Releases · aphp/edsnlp

05 Oct 17:15

percevalw

v0.19.0

8dfb62c

v0.19.0 Latest

Latest

Changelog

Added

New DocToMarkupConverter to convert documents to markdown and improved MarkupToDocConverter to allow overlapping markup annotations (e.g., This is a <a>text <b>with</a> overlapping</b> tags).
New helper edsnlp.utils.fuzzy_alignment.align to map the entities of an annotated document to another document with similar but not identical text (e.g., after some text normalization or minor edits).
We now support span_getter="sents" to apply various pipes on sentences instead of entities or spans.
New LLM generic extractor pipe eds.llm_markup_extractor, that can be used to extract entities using a large language model served through an OpenAPI-style API.

Fixed

Since fork hangs when HDFS has been used in the main process, we now auto detect if the currently running program has interacted with HDFS before auto-picking a process starting method.
We now account for pipe selection (ie enable, disable and exclude) when loading a model from huggingface hub.
We do not instantiate pipes in exclude anymore when loading a model (before they were instantiated but not added to the pipeline).

Pull Requests

correcting typo by @ohassanaly in #447
Slurm integration by @percevalw in #449
Llm extraction by @percevalw in #450
chore: bump version to 0.19.0 by @percevalw in #453

New Contributors

@ohassanaly made their first contribution in #447

Full Changelog: v0.18.0...v0.19.0

Contributors

percevalw and ohassanaly

Assets 2

04 Sep 09:20

percevalw

v0.18.0

a764e03

v0.18.0

Changelog

📢 EDS-NLP will drop support for Python 3.7, 3.8 and 3.9 support in the next major release (v0.19.0), in October 2025. Please upgrade to Python 3.10 or later.

Added

Added support for multiple loggers (tensorboard, wandb, comet_ml, aim, mlflow, clearml, dvclive, csv, json, rich) in edsnlp.train via the logger parameter. Default is [json and rich] for backward compatibility.
Sub batch sizes for gradient accumulation can now be defined as simple "splits" of the original batch, e.g. batch_size = 10000 tokens and sub_batch_size = 5 splits to accumulate batches of 2000 tokens.
Parquet writer now has a pyarrow_write_kwargs to pass to pyarrow.dataset.write_dataset
LinearSchedule (mostly used for LR scheduling) now allows a end_value parameter to configure if the learning rate should decay to zero or another value.
New eds.explode pipe that splits one document into multiple documents, one per span yielded by its span_getter parameter, each new document containing exactly that single span.
New Training a span classifier tutorial, and reorganized deep-learning docs
ScheduledOptimizer now warns when a parameter selector does not match any parameter.

Fixed

use_section in eds.history should now correctly handle cases when there are other sections following history sections.
Added clickable snippets in the documentation for more registered functions
Pyarrow dataset writing with multiprocessing should be faster, as we removed a useless data transfer
We should now correctly support loading transformers in offline mode if they were already in huggingface's cache
We now support words[-10:10] syntax in trainable span classifier context_getter parameter
🚑 Until now, post_init was applied after the instantiation of the optimizer : if the model discovered new labels, and therefore changed its parameter tensors to reflect that, these new tensors were not taken into account by the optimizer, which could likely lead to subpar performance. Now, post_init is applied before the optimizer is instantiated, so that the optimizer can correctly handle the new tensors.
Added missing entry points for readers and writers in the registry, including write_parquet and support for polars in pyproject.toml. Now all implemented readers and writers are correctly registered as entry points.
Parameters are now updated in place by "post_init" is run in eds.ner_crf and eds.span_classifier, and are therefore correctly taken into account by the optimizer.

Changed

Sections cues in eds.history are now section titles, and not the full section.
💥 Validation metrics are now found under the root field validation in the training logs (e.g. metrics['validation']['ner']['micro']['f'])
It is now recommended to define optimizer groups of ScheduledOptimizer as a list of dicts of optim hyper-parameters, each containing a selector regex key, rather than as a single dict with a selector as keys and a dict of optim hyper-parameters as values. This allows for more flexibility in defining the optimizer groups, and is more consistent with the rest of the EDS-NLP API. This makes it easier to reference groups values from other places in config files, since their path doesn't contain a complex regex string anymore. See the updated training tutorials for more details.

Pull Requests

fix: use_sections in eds.history should now work by @percevalw in #430
docs: fix read parquet parameters docs by @percevalw in #425
Explode pipe + span classifier training tutorial by @percevalw in #432
Update, fix and refactor doc dependencies by @percevalw in #438
fix: entrypoints by @aricohen93 in #420
fix: take filter_expr into account in dependency parsing evaluation by @percevalw in #382
Update ner_crf & span_classifier params in place in post_init to avoid optimizer issues by @percevalw in #443
chore: bump version to 0.18.0 by @percevalw in #439

Full Changelog: v0.17.2...v0.18.0

Contributors

percevalw and aricohen93

Assets 2

25 Jun 18:19

percevalw

v0.17.2

e113c26

v0.17.2

Changelog

Added

Handling intra-word linebreak as pollution : adds a pollution pattern that detects intra-word linebreak, which can then be removed in the get_text method
Qualifiers can process Span or Doc : this feature especially makes it easier to nest qualifiers components in other components
New label_weights parameter in eds.span_classifier`, which allows the user to set per label-value loss weights during training
New edsnlp.data.converters.MarkupToDocConverter to convert Markdown or XML-like markup to documents, which is particularly useful to create annotated documents from scratch (e.g., for testing purposes).
New Metrics documentation page to document the available metrics and how to use them.

Fixed

Various disorders/behaviors patches

Changed

Deduplicate spans between doc.ents and doc.spans during train: previously, a span_getter requesting entities from both ents and spans could yield duplicates.

Pull Requests

feat: Various patches by @Thomzoy in #391
Metrics doc by @percevalw in #417
chore: bump version to 0.17.2 by @percevalw in #424

Full Changelog: v0.17.1...v0.17.2

Contributors

percevalw and Thomzoy

Assets 2

26 May 13:46

percevalw

v0.17.1

8e9ed84

v0.17.1

Changelog

Added

Added grad spike detection to the edsnlp.train script, and per weight layer gradient logging.

Fixed

Fixed mini-batch accumulation for multi-task training
Fixed a pickling error when applying a pipeline in multiprocessing mode. This occurred in some cases when one of the pipes was declared in a "difficultly importable" module (e.g., causing a "PicklingWarning: Cannot locate reference to <class...").
Fixed typo in eds.consultation_dates towns: berck.sur.mer.
Fixed a bug where relative date expressions with bounds (e.g. 'depuis hier') raised an error when converted to durations.
Fixed pipe ADICAP to deal with cases where not code is found after 'codification'/'adicap'
Support "00"-like hours and minutes in the eds.dates component
Fix arc minutes, arc seconds and degree unit scales in eds.quantities, used when converting between different time (or angle) units

Pull Requests

fix: add grad spike detection by @percevalw in #375
fix: avoid pickling error in multiprocessing mode by @percevalw in #408
fix: correct town name typo (berck.sur.mer) by @percevalw in #409
fix: error when converting relative date expressions with bounds to durations by @percevalw in #411
Fix adicap by @aricohen93 in #410
Fix time matching by @LoickChardon in #413
chore: bump version to 0.17.1 by @percevalw in #416

New Contributors

@LoickChardon made their first contribution in #413

Full Changelog: v0.17.0...v0.17.1

Contributors

percevalw, aricohen93, and LoickChardon

Assets 2

18 Apr 08:06

percevalw

v0.17.0

8276cc9

v0.17.0

Changelog

Added

Support for numpy>2.0, and formal support for Python 3.11 and Python 3.12
Expose the defaults patterns of eds.negation, eds.hypothesis, eds.family, eds.history and eds.reported_speech under a eds.negation.default_patterns attribute
Added a context_getter SpanGetter argument to the eds.matcher class to only retrieve entities inside the spans returned by the getter
Added a filter_expr parameter to scorers to filter the documents to score
Added a new required field to eds.contextual_matcher assign patterns to only match if the required field has been found, and an include parameter (similar to exclude) to search for required patterns without assigning them to the entity
Added context strings (e.g., "words[0:5] | sent[0:1]") to the eds.contextual_matcher component to allow for more complex patterns in the selection of the window around the trigger spans.
Include and exclude patterns in the contextual matcher now dismiss matches that occur inside the anchor pattern (e.g. "anti" exclude pattern for anchor pattern "antibiotics" will not match the "anti" part of "antibiotics")
Pull Requests will now build a public accessible preview of the docs

Changed

Improve the contextual matcher documentation.

Fixed

edsnlp.package now correctly detect if a project uses an old-style poetry pyproject or a PEP621 pyproject.toml.
PEP621 projects containing nested directories (e.g., "my_project/pipes/foo.py") are now supported.
Try several paths to find current pip executable
The parameter "value_extract" of eds.score now correctly handles lists of patterns.
"Zero variance error" when computing param tuning importance are now catched and converted as a warning

Pull Requests

Fix packaging by @percevalw in #395
fix: avoid non-standard (pytoml) syntax in pyproject.toml by @percevalw in #399
fix: try several paths to find current pip executable by @percevalw in #401
Fix optuna issue by @LucasDedieu in #398
Improve contextual matcher by @percevalw in #289

Full Changelog: v0.16.0...v0.17.0

Contributors

percevalw and LucasDedieu

Assets 2

27 Mar 10:34

LucasDedieu

v0.16.0

0bf89d1

v0.16.0

Changelog

Added

Hyperparameter Tuning for EDS-NLP: introduced a new script edsnlp.tune for hyperparameter tuning using Optuna. This feature allows users to efficiently optimize model parameters with options for single-phase or two-phase tuning strategies. Includes support for parameter importance analysis, visualization, pruning, and automatic handling of GPU time budgets.
Provided a detailed tutorial on hyperparameter tuning, covering usage scenarios and configuration options.
ScheduledOptimizer (e.g., @core: "optimizer") now supports importing optimizers using their qualified name (e.g., optim: "torch.optim.Adam").
eds.ner_crf now computes confidence score on spans.

Changed

The loss of eds.ner_crf is now computed as the mean over the words instead of the sum. This change is compatible with multi-gpu training.
Having multiple stats keys matching a batching pattern now warns instead of raising an error.

Fixed

Support packaging with poetry 2.0
Solve pickling issues with multiprocessing when pytorch is installed
Allow deep attributes like a.b.c for span_attributes in Standoff and OMOP doc2dict converters
Fixed various aspects of stream shuffling:
- Ensure the Parquet reader shuffles the data when shuffle=True
- Ensure we don't overwrite the RNG of the data reader when calling stream.shuffle() with no seed
- Raise an error if the batch size in stream.shuffle(batch_size=...) is not compatible with the stream
eds.split now keeps doc and span attributes in the sub-documents.

Pull Requests

fix: support packaging with poetry 2.0 by @percevalw in #362
Solve pickling issues with multiprocessing when pytorch is installed by @percevalw in #367
Feat: add hyperparameters tuning by @LucasDedieu in #361
Fix issue 368: Add metric parameter and write optimal config.yml at the end of tuning. by @LucasDedieu in #369
Fix issue 370: two-phase tuning now write phase 1 frozen best values into phase 2 results_summary.txt by @LucasDedieu in #371
fix: allow deep attributes in Standoff and OMOP doc2dict converters by @percevalw in #381
fix: improve various aspect of stream shuffling by @percevalw in #380
fix: eds.split now keeps doc and span attributes in the sub-documents by @percevalw in #363
feat: allow importing optims using qualified names in ScheduledOptimizer by @percevalw in #383
feat: compute eds.ner_crf loss as mean over words by @percevalw in #384
Fix issue 372: resulting tuning config file now preserve comments by @LucasDedieu in #373
Feat: add checkpoint management for tuning by @LucasDedieu in #385
feat: add ner confidence score by @LucasDedieu in #387
chore: bump version to 0.16.0 by @LucasDedieu in #393

New Contributors

@LucasDedieu made their first contribution in #361

Full Changelog: v0.15.0...v0.16.0

Contributors

percevalw and LucasDedieu

Assets 2

13 Dec 19:11

percevalw

v0.15.0

f14900d

v0.15.0

Changelog

Added

edsnlp.data.read_parquet now accept a work_unit="fragment" option to split tasks between workers by parquet fragment instead of row. When this is enabled, workers do not read every fragment while skipping 1 in n rows, but read all rows of 1/n fragments, which should be faster.
Accept no validation data in edsnlp.train script
Log the training config at the beginning of the trainings
Support a specific model output dir path for trainings (output_model_dir), and whether to save the model or not (save_model)
Specify whether to log the validation results or not (logger=False)
Added support for the CoNLL format with edsnlp.data.read_conll and with a specific eds.conll_dict2doc converter
Added a Trainable Biaffine Dependency Parser (eds.biaffine_dep_parser) component and metrics
New eds.extractive_qa component to perform extractive question answering using questions as prompts to tag entities instead of a list of predefined labels as in eds.ner_crf.

Fixed

Fix join_thread missing attribute in SimpleQueue when cleaning a multiprocessing executor
Support huggingface transformers that do not set cls_token_id and sep_token_id (we now also look for these tokens in the special_tokens_map and vocab mappings)
Fix changing scorers dict size issue when evaluating during training
Seed random states (instead of using random.RandomState()) when shuffling in data readers : this is important for
1. reproducibility
2. in multiprocessing mode, ensure that the same data is shuffled in the same way in all workers
Bubble BaseComponent instantiation errors correctly
Improved support for multi-gpu gradient accumulation (only sync the gradients at the end of the accumulation), now controled by the optiona sub_batch_size argument of TrainingData.
Support again edsnlp without pytorch installed
We now test that edsnlp works without pytorch installed
Fix units and scales, ie 1l = 1dm3, 1ml = 1cm3

Pull Requests

fix: check join_thread attribute in queue when cleaning mp exec by @percevalw in #345
fix: support hf transformers with cls_token_id and sep_token_id set to None by @percevalw in #346
fix: changing scorers dict size issue when evaluating during training by @percevalw in #347
Fix streams by @percevalw in #350
Various trainer fixes by @percevalw in #352
Trainable biaffine dependency parser by @percevalw in #353
feat: new eds.extractive_qa component by @percevalw in #351
Fix training and multiprocessing by @percevalw in #354
fix: correct conversions for volumes, areas by @etienneguevel in #349
chore: bump version to 0.15.0 by @percevalw in #355

Full Changelog: v0.14.0...v0.15.0

Contributors

percevalw and etienneguevel

Assets 2

15 Nov 08:39

percevalw

v0.14.0

1ebc7d7

v0.14.0

Changelog

Added

Support for setuptools based projects in edsnlp.package command
Pipelines can now be instantiated directly from a config file (instead of having to cast a dict containing their arguments) by putting the @core = "pipeline" or "load" field in the pipeline section)
edsnlp.load now correctly takes disable, enable and exclude parameters into account
Pipeline now has a basic repr showing is base langage (mostly useful to know its tokenizer) and its pipes
New python -m edsnlp.evaluate script to evaluate a model on a dataset
Sentence detection can now be configured to change the minimum number of newlines to consider a newline-triggered sentence, and disable capitalization checking.
New eds.split pipe to split a document into multiple documents based on a splitting pattern (useful for training)
Allow converter argument of edsnlp.data.read/from_... to be a list of converters instead of a single converter
New revamped and documented edsnlp.train script and API
Support YAML config files (supported only CFG/INI files before)
Most of EDS-NLP functions are now clickable in the documentation

ScheduledOptimizer now accepts schedules directly in place of parameters, and easy parameter selection:

ScheduledOptimizer(
    optim="adamw",
    module=nlp,
    total_steps=2000,
    groups={
        "^transformer": {
            # lr will go from 0 to 5e-5 then to 0 for params matching "transformer"
            "lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 0 "max_value": 5e-5,},
        },
        "": {
            # lr will go from 3e-4 during 200 steps then to 0 for other params
            "lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 3e-4 "max_value": 3e-4,},
        },
    },
)

Changed

eds.span_context_getter's parameter context_sents is no longer optional and must be explicitly set to 0 to disable sentence context
In multi-GPU setups, streams that contain torch components are now stripped of their parameter tensors when sent to CPU Workers since these workers only perform preprocessing and postprocessing and should therefore not need the model parameters.
The batch_size argument of Pipeline is deprecated and is not used anymore. Use the batch_size argument of stream.map_pipeline instead.

Fixed

Sort files before iterating over a standoff or json folder to ensure reproducibility
Sentence detection now correctly match capitalized letters + apostrophe
We now ensure that the workers pool is properly closed whatever happens (exception, garbage collection, data ending) in the multiprocessing backend. This prevents some executions from hanging indefinitely at the end of the processing.
Propagate torch sharing strategy to other workers in the multiprocessing backend. This is useful when the system is running out of file descriptors and ulimit -n is not an option. Torch sharing strategy can also be set via an environment variable TORCH_SHARING_STRATEGY (default is file_descriptor, consider using file_system if you encounter issues).

Data API changes

LazyCollection objects are now called Stream objects
By default, multiprocessing backend now preserves the order of the input data. To disable this and improve performance, use deterministic=False in the set_processing method
🚀 Parallelized GPU inference throughput improvements !
- For simple {pre-process → model → post-process} pipelines, GPU inference can be up to 30% faster in non-deterministic mode (results can be out of order) and up to 20% faster in deterministic mode (results are in order)
- For multitask pipelines, GPU inference can be up to twice as fast (measured in a two-tasks BERT+NER+Qualif pipeline on T4 and A100 GPUs)
The .map_batches, .map_pipeline and .map_gpu methods now support a specific batch_size and batching function, instead of having a single batch size for all pipes
Readers now have a loop parameter to cycle over the data indefinitely (useful for training)
Readers now have a shuffle parameter to shuffle the data before iterating over it
In multiprocessing mode, file based readers now read the data in the workers (was an option before)
We now support two new special batch sizes
- "fragment" in the case of parquet datasets: rows of a full parquet file fragment per batch
- "dataset" which is mostly useful during training, for instance to shuffle the dataset at each epoch.
  These are also compatible in batched writer such as parquet, where each input fragment can be processed and mapped to a single matching output fragment.
💥 Breaking change: a map function returning a list or a generator won't be automatically flattened anymore. Use flatten() to flatten the output if needed. This shouldn't change the behavior for most users since most writers (to_pandas, to_polars, to_parquet, ...) still flatten the output
💥 Breaking change: the chunk_size and sort_chunks are now deprecated : to sort data before applying a transformation, use .map_batches(custom_sort_fn, batch_size=...)

Training API changes

We now provide a training script python -m edsnlp.train --config config.cfg that should fit many use cases. Check out the docs !
In particular, we do not require pytorch's Dataloader for training and can rely solely on EDS-NLP stream/data API, which is better suited for large streamable datasets and dynamic preprocessing (ie different result each time we apply a noised preprocessing op on a sample).
Each trainable component can now provide a stats field in its preprocess output to log info about the sample (number of words, tokens, spans, ...):
- these stats are both used for batching (e.g., make batches of no more than "25000 tokens")
- for logging
- for computing correct loss means when accumulating gradients over multiple mini-mini-batches
- for computing correct loss means in multi-GPU setups, since these stats are synchronized and accumulated across GPUs
Support multi GPU training via hugginface accelerate and EDS-NLP Stream API consideration of env['WOLRD_SIZE'] and env['LOCAL_RANK'] environment variables

Pull Requests

Improve training tutorials by @percevalw in #331
Various fixes by @percevalw in #332
Multiprocessing related fixes by @percevalw in #333
chore: bump version to 0.14.0 by @percevalw in #334

Full Changelog: v0.13.1...v0.14.0

Contributors

core and percevalw

Assets 2

10 Oct 20:35

percevalw

v0.13.1

430ef22

v0.13.1

Changelog

Added

eds.tables accepts a minimum_table_size (default 2) argument to reduce pollution
RuleBasedQualifier now expose a process method that only returns qualified entities and token without actually tagging them, deferring this task to the __call__ method.
Added new patterns for metastasis detection developed on CT-Scan reports.
Added citation of articles

Fixed

Disorder and Behavior pipes don't use a "PRESENT" or "ABSENT" status anymore. Instead, status=None by default,
and ent._.negation is set to True instead of setting status to "ABSENT". To this end, the tobacco and alcohol
now use the NegationQualifier internally.
Numbers are now only detected without trying to remove the pollution in between digits, ie 55 @ 77777 could be detected as a full number before, but not anymore.
Fix fsspec open file encoding to "utf-8".

Changed

Rename eds.measurements to eds.quantities
scikit-learn (used in eds.endlines) is no longer installed by default when installing edsnlp[ml]

Pull Requests

Remove pollution exclusion during numbers matching by @percevalw in #316
Rename eds.measurements by @svittoz in #313
Adding minimum_table_size argument to eds.tables by @svittoz in #318
Fs encoding fix by @Aremaki in #320
chore(deps): bump actions/download-artifact from 2 to 4.1.7 in /.github/workflows in the github_actions group across 1 directory by @dependabot in #319
fix: skip spacy 3.8.0 due to numpy build dep by @percevalw in #321
Fix behavior, disorder and qualifier pipes by @Thomzoy in #322
Metastatic status by @aricohen93 in #308
chore: bump version to 0.13.1 by @percevalw in #327
Test 3.12 by @percevalw in #328

New Contributors

@dependabot made their first contribution in #319

Full Changelog: v0.13.0...v0.13.1

Contributors

percevalw, dependabot, and 4 other contributors

Assets 2

22 Jul 16:26

percevalw

v0.13.0

fa135e6

v0.13.0

Changelog

Added

data.set_processing(...) now expose an autocast parameter to disable or tweak the automatic casting of the tensor
during the processing. Autocasting should result in a slight speedup, but may lead to numerical instability.
Use torch.inference_mode to disable view tracking and version counter bumps during inference.
Added a new NER pipeline for suicide attempt detection
Added date cues (regular expression matches that contributed to a date being detected) under the extension ent._.date_cues
Added tables processing in eds.measurement
Added 'all' as possible input in eds.measurement measurements config
Added new units in eds.measurement

Changed

Default to mixed precision inference

Fixed

edsnlp.load("your/huggingface-model", install_dependencies=True) now correctly resolves the python pip
(especially on Colab) to auto-install the model dependencies
We now better handle empty documents in the eds.transformer, eds.text_cnn and eds.ner_crf components
Support mixed precision in eds.text_cnn and eds.ner_crf components
Support pre-quantization (<4.30) transformers versions
Verify that all batches are non empty
Fix span_context_getter for context_words = 0, context_sents > 2 and support assymetric contexts
Don't split sentences on rare unicode symbols
Better detect abbreviations, like E.coli, now split as [E., coli] and not [E, ., coli]

What's Changed

Various ml fixes by @percevalw in #303
TS by @aricohen93 in #269
date cues by @cvinot in #265
Fix fast inference by @percevalw in #305
Fix typo in diabetes patterns by @isabelbt in #306
Fix span context getter by @aricohen93 in #307
Fix sentences by @percevalw in #310
chore: bump version to 0.13.0 by @percevalw in #312

New Contributors

@cvinot made their first contribution in #265
@isabelbt made their first contribution in #306

Full Changelog: v0.12.3...v0.13.0

Contributors

percevalw, cvinot, and 2 other contributors

Assets 2

Releases: aphp/edsnlp

v0.19.0

Changelog

Added

Fixed

Pull Requests

New Contributors

Contributors

Uh oh!

v0.18.0

Changelog

Added

Fixed

Changed

Pull Requests

Contributors

Uh oh!

v0.17.2

Changelog

Added

Fixed

Changed

Pull Requests

Contributors

Uh oh!

v0.17.1

Changelog

Added

Fixed

Pull Requests

New Contributors

Contributors

Uh oh!

v0.17.0

Changelog

Added

Changed

Fixed

Pull Requests

Contributors

Uh oh!

v0.16.0

Changelog

Added

Changed

Fixed

Pull Requests

New Contributors

Contributors

Uh oh!

v0.15.0

Changelog

Added

Fixed

Pull Requests

Contributors

Uh oh!

v0.14.0

Changelog

Added

Changed

Fixed

Data API changes

Training API changes

Pull Requests

Contributors

Uh oh!

v0.13.1

Changelog

Added

Fixed

Changed

Pull Requests

New Contributors

Contributors

Uh oh!

v0.13.0

Changelog

Added

Changed