Releases: aphp/edsnlp
Releases · aphp/edsnlp
v0.19.0
Changelog
Added
- New
DocToMarkupConverterto convert documents to markdown and improvedMarkupToDocConverterto allow overlapping markup annotations (e.g.,This is a <a>text <b>with</a> overlapping</b> tags). - New helper
edsnlp.utils.fuzzy_alignment.alignto map the entities of an annotated document to another document with similar but not identical text (e.g., after some text normalization or minor edits). - We now support
span_getter="sents"to apply various pipes on sentences instead of entities or spans. - New LLM generic extractor pipe
eds.llm_markup_extractor, that can be used to extract entities using a large language model served through an OpenAPI-style API.
Fixed
- Since
forkhangs when HDFS has been used in the main process, we now auto detect if the currently running program has interacted with HDFS before auto-picking a process starting method. - We now account for pipe selection (ie
enable,disableandexclude) when loading a model from huggingface hub. - We do not instantiate pipes in
excludeanymore when loading a model (before they were instantiated but not added to the pipeline).
Pull Requests
- correcting typo by @ohassanaly in #447
- Slurm integration by @percevalw in #449
- Llm extraction by @percevalw in #450
- chore: bump version to 0.19.0 by @percevalw in #453
New Contributors
- @ohassanaly made their first contribution in #447
Full Changelog: v0.18.0...v0.19.0
v0.18.0
Changelog
📢 EDS-NLP will drop support for Python 3.7, 3.8 and 3.9 support in the next major release (v0.19.0), in October 2025. Please upgrade to Python 3.10 or later.
Added
- Added support for multiple loggers (
tensorboard,wandb,comet_ml,aim,mlflow,clearml,dvclive,csv,json,rich) inedsnlp.trainvia theloggerparameter. Default is [jsonandrich] for backward compatibility. - Sub batch sizes for gradient accumulation can now be defined as simple "splits" of the original batch, e.g.
batch_size = 10000 tokensandsub_batch_size = 5 splitsto accumulate batches of 2000 tokens. - Parquet writer now has a
pyarrow_write_kwargsto pass to pyarrow.dataset.write_dataset - LinearSchedule (mostly used for LR scheduling) now allows a
end_valueparameter to configure if the learning rate should decay to zero or another value. - New
eds.explodepipe that splits one document into multiple documents, one per span yielded by itsspan_getterparameter, each new document containing exactly that single span. - New
Training a span classifiertutorial, and reorganized deep-learning docs ScheduledOptimizernow warns when a parameter selector does not match any parameter.
Fixed
use_sectionineds.historyshould now correctly handle cases when there are other sections following history sections.- Added clickable snippets in the documentation for more registered functions
- Pyarrow dataset writing with multiprocessing should be faster, as we removed a useless data transfer
- We should now correctly support loading transformers in offline mode if they were already in huggingface's cache
- We now support
words[-10:10]syntax in trainable span classifiercontext_getterparameter - 🚑 Until now,
post_initwas applied after the instantiation of the optimizer : if the model discovered new labels, and therefore changed its parameter tensors to reflect that, these new tensors were not taken into account by the optimizer, which could likely lead to subpar performance. Now,post_initis applied before the optimizer is instantiated, so that the optimizer can correctly handle the new tensors. - Added missing entry points for readers and writers in the registry, including
write_parquetand support forpolarsinpyproject.toml. Now all implemented readers and writers are correctly registered as entry points. - Parameters are now updated in place by "post_init" is run in
eds.ner_crfandeds.span_classifier, and are therefore correctly taken into account by the optimizer.
Changed
- Sections cues in
eds.historyare now section titles, and not the full section. - 💥 Validation metrics are now found under the root field
validationin the training logs (e.g.metrics['validation']['ner']['micro']['f']) - It is now recommended to define optimizer groups of
ScheduledOptimizeras a list of dicts of optim hyper-parameters, each containing aselectorregex key, rather than as a single dict with aselectoras keys and a dict of optim hyper-parameters as values. This allows for more flexibility in defining the optimizer groups, and is more consistent with the rest of the EDS-NLP API. This makes it easier to reference groups values from other places in config files, since their path doesn't contain a complex regex string anymore. See the updated training tutorials for more details.
Pull Requests
- fix: use_sections in eds.history should now work by @percevalw in #430
- docs: fix read parquet parameters docs by @percevalw in #425
- Explode pipe + span classifier training tutorial by @percevalw in #432
- Update, fix and refactor doc dependencies by @percevalw in #438
- fix: entrypoints by @aricohen93 in #420
- fix: take filter_expr into account in dependency parsing evaluation by @percevalw in #382
- Update ner_crf & span_classifier params in place in post_init to avoid optimizer issues by @percevalw in #443
- chore: bump version to 0.18.0 by @percevalw in #439
Full Changelog: v0.17.2...v0.18.0
v0.17.2
Changelog
Added
- Handling intra-word linebreak as pollution : adds a pollution pattern that detects intra-word linebreak, which can then be removed in the
get_textmethod - Qualifiers can process
SpanorDoc: this feature especially makes it easier to nest qualifiers components in other components - New label_weights parameter in eds.span_classifier`, which allows the user to set per label-value loss weights during training
- New
edsnlp.data.converters.MarkupToDocConverterto convert Markdown or XML-like markup to documents, which is particularly useful to create annotated documents from scratch (e.g., for testing purposes). - New Metrics documentation page to document the available metrics and how to use them.
Fixed
- Various disorders/behaviors patches
Changed
- Deduplicate spans between doc.ents and doc.spans during train: previously, a
span_getterrequesting entities from bothentsandspanscould yield duplicates.
Pull Requests
- feat: Various patches by @Thomzoy in #391
- Metrics doc by @percevalw in #417
- chore: bump version to 0.17.2 by @percevalw in #424
Full Changelog: v0.17.1...v0.17.2
v0.17.1
Changelog
Added
- Added grad spike detection to the
edsnlp.trainscript, and per weight layer gradient logging.
Fixed
- Fixed mini-batch accumulation for multi-task training
- Fixed a pickling error when applying a pipeline in multiprocessing mode. This occurred in some cases when one of the pipes was declared in a "difficultly importable" module (e.g., causing a "PicklingWarning: Cannot locate reference to <class...").
- Fixed typo in
eds.consultation_datestowns:berck.sur.mer. - Fixed a bug where relative date expressions with bounds (e.g. 'depuis hier') raised an error when converted to durations.
- Fixed pipe ADICAP to deal with cases where not code is found after 'codification'/'adicap'
- Support "00"-like hours and minutes in the
eds.datescomponent - Fix arc minutes, arc seconds and degree unit scales in
eds.quantities, used when converting between different time (or angle) units
Pull Requests
- fix: add grad spike detection by @percevalw in #375
- fix: avoid pickling error in multiprocessing mode by @percevalw in #408
- fix: correct town name typo (berck.sur.mer) by @percevalw in #409
- fix: error when converting relative date expressions with bounds to durations by @percevalw in #411
- Fix adicap by @aricohen93 in #410
- Fix time matching by @LoickChardon in #413
- chore: bump version to 0.17.1 by @percevalw in #416
New Contributors
- @LoickChardon made their first contribution in #413
Full Changelog: v0.17.0...v0.17.1
v0.17.0
Changelog
Added
- Support for numpy>2.0, and formal support for Python 3.11 and Python 3.12
- Expose the defaults patterns of
eds.negation,eds.hypothesis,eds.family,eds.historyandeds.reported_speechunder aeds.negation.default_patternsattribute - Added a
context_getterSpanGetter argument to theeds.matcherclass to only retrieve entities inside the spans returned by the getter - Added a
filter_exprparameter to scorers to filter the documents to score - Added a new
requiredfield toeds.contextual_matcherassign patterns to only match if the required field has been found, and anincludeparameter (similar toexclude) to search for required patterns without assigning them to the entity - Added context strings (e.g., "words[0:5] | sent[0:1]") to the
eds.contextual_matchercomponent to allow for more complex patterns in the selection of the window around the trigger spans. - Include and exclude patterns in the contextual matcher now dismiss matches that occur inside the anchor pattern (e.g. "anti" exclude pattern for anchor pattern "antibiotics" will not match the "anti" part of "antibiotics")
- Pull Requests will now build a public accessible preview of the docs
Changed
- Improve the contextual matcher documentation.
Fixed
edsnlp.packagenow correctly detect if a project uses an old-style poetry pyproject or a PEP621 pyproject.toml.- PEP621 projects containing nested directories (e.g., "my_project/pipes/foo.py") are now supported.
- Try several paths to find current pip executable
- The parameter "value_extract" of
eds.scorenow correctly handles lists of patterns. - "Zero variance error" when computing param tuning importance are now catched and converted as a warning
Pull Requests
- Fix packaging by @percevalw in #395
- fix: avoid non-standard (pytoml) syntax in pyproject.toml by @percevalw in #399
- fix: try several paths to find current pip executable by @percevalw in #401
- Fix optuna issue by @LucasDedieu in #398
- Improve contextual matcher by @percevalw in #289
Full Changelog: v0.16.0...v0.17.0
v0.16.0
Changelog
Added
- Hyperparameter Tuning for EDS-NLP: introduced a new script
edsnlp.tunefor hyperparameter tuning using Optuna. This feature allows users to efficiently optimize model parameters with options for single-phase or two-phase tuning strategies. Includes support for parameter importance analysis, visualization, pruning, and automatic handling of GPU time budgets. - Provided a detailed tutorial on hyperparameter tuning, covering usage scenarios and configuration options.
ScheduledOptimizer(e.g.,@core: "optimizer") now supports importing optimizers using their qualified name (e.g.,optim: "torch.optim.Adam").eds.ner_crfnow computes confidence score on spans.
Changed
- The loss of
eds.ner_crfis now computed as the mean over the words instead of the sum. This change is compatible with multi-gpu training. - Having multiple stats keys matching a batching pattern now warns instead of raising an error.
Fixed
- Support packaging with poetry 2.0
- Solve pickling issues with multiprocessing when pytorch is installed
- Allow deep attributes like
a.b.cforspan_attributesin Standoff and OMOP doc2dict converters - Fixed various aspects of stream shuffling:
- Ensure the Parquet reader shuffles the data when
shuffle=True - Ensure we don't overwrite the RNG of the data reader when calling
stream.shuffle()with no seed - Raise an error if the batch size in
stream.shuffle(batch_size=...)is not compatible with the stream
- Ensure the Parquet reader shuffles the data when
eds.splitnow keeps doc and span attributes in the sub-documents.
Pull Requests
- fix: support packaging with poetry 2.0 by @percevalw in #362
- Solve pickling issues with multiprocessing when pytorch is installed by @percevalw in #367
- Feat: add hyperparameters tuning by @LucasDedieu in #361
- Fix issue 368: Add
metricparameter and write optimalconfig.ymlat the end of tuning. by @LucasDedieu in #369 - Fix issue 370: two-phase tuning now write phase 1 frozen best values into phase 2
results_summary.txtby @LucasDedieu in #371 - fix: allow deep attributes in Standoff and OMOP doc2dict converters by @percevalw in #381
- fix: improve various aspect of stream shuffling by @percevalw in #380
- fix: eds.split now keeps doc and span attributes in the sub-documents by @percevalw in #363
- feat: allow importing optims using qualified names in ScheduledOptimizer by @percevalw in #383
- feat: compute eds.ner_crf loss as mean over words by @percevalw in #384
- Fix issue 372: resulting tuning config file now preserve comments by @LucasDedieu in #373
- Feat: add checkpoint management for tuning by @LucasDedieu in #385
- feat: add ner confidence score by @LucasDedieu in #387
- chore: bump version to 0.16.0 by @LucasDedieu in #393
New Contributors
- @LucasDedieu made their first contribution in #361
Full Changelog: v0.15.0...v0.16.0
v0.15.0
Changelog
Added
edsnlp.data.read_parquetnow accept awork_unit="fragment"option to split tasks between workers by parquet fragment instead of row. When this is enabled, workers do not read every fragment while skipping 1 in n rows, but read all rows of 1/n fragments, which should be faster.- Accept no validation data in
edsnlp.trainscript - Log the training config at the beginning of the trainings
- Support a specific model output dir path for trainings (
output_model_dir), and whether to save the model or not (save_model) - Specify whether to log the validation results or not (
logger=False) - Added support for the CoNLL format with
edsnlp.data.read_conlland with a specificeds.conll_dict2docconverter - Added a Trainable Biaffine Dependency Parser (
eds.biaffine_dep_parser) component and metrics - New
eds.extractive_qacomponent to perform extractive question answering using questions as prompts to tag entities instead of a list of predefined labels as ineds.ner_crf.
Fixed
- Fix
join_threadmissing attribute inSimpleQueuewhen cleaning a multiprocessing executor - Support huggingface transformers that do not set
cls_token_idandsep_token_id(we now also look for these tokens in thespecial_tokens_mapandvocabmappings) - Fix changing scorers dict size issue when evaluating during training
- Seed random states (instead of using
random.RandomState()) when shuffling in data readers : this is important for- reproducibility
- in multiprocessing mode, ensure that the same data is shuffled in the same way in all workers
- Bubble BaseComponent instantiation errors correctly
- Improved support for multi-gpu gradient accumulation (only sync the gradients at the end of the accumulation), now controled by the optiona
sub_batch_sizeargument ofTrainingData. - Support again edsnlp without pytorch installed
- We now test that edsnlp works without pytorch installed
- Fix units and scales, ie 1l = 1dm3, 1ml = 1cm3
Pull Requests
- fix: check join_thread attribute in queue when cleaning mp exec by @percevalw in #345
- fix: support hf transformers with cls_token_id and sep_token_id set to None by @percevalw in #346
- fix: changing scorers dict size issue when evaluating during training by @percevalw in #347
- Fix streams by @percevalw in #350
- Various trainer fixes by @percevalw in #352
- Trainable biaffine dependency parser by @percevalw in #353
- feat: new eds.extractive_qa component by @percevalw in #351
- Fix training and multiprocessing by @percevalw in #354
- fix: correct conversions for volumes, areas by @etienneguevel in #349
- chore: bump version to 0.15.0 by @percevalw in #355
Full Changelog: v0.14.0...v0.15.0
v0.14.0
Changelog
Added
- Support for setuptools based projects in
edsnlp.packagecommand - Pipelines can now be instantiated directly from a config file (instead of having to cast a dict containing their arguments) by putting the @core = "pipeline" or "load" field in the pipeline section)
edsnlp.loadnow correctly takes disable, enable and exclude parameters into account- Pipeline now has a basic repr showing is base langage (mostly useful to know its tokenizer) and its pipes
- New
python -m edsnlp.evaluatescript to evaluate a model on a dataset - Sentence detection can now be configured to change the minimum number of newlines to consider a newline-triggered sentence, and disable capitalization checking.
- New
eds.splitpipe to split a document into multiple documents based on a splitting pattern (useful for training) - Allow
converterargument ofedsnlp.data.read/from_...to be a list of converters instead of a single converter - New revamped and documented
edsnlp.trainscript and API - Support YAML config files (supported only CFG/INI files before)
- Most of EDS-NLP functions are now clickable in the documentation
- ScheduledOptimizer now accepts schedules directly in place of parameters, and easy parameter selection:
ScheduledOptimizer( optim="adamw", module=nlp, total_steps=2000, groups={ "^transformer": { # lr will go from 0 to 5e-5 then to 0 for params matching "transformer" "lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 0 "max_value": 5e-5,}, }, "": { # lr will go from 3e-4 during 200 steps then to 0 for other params "lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 3e-4 "max_value": 3e-4,}, }, }, )
Changed
eds.span_context_getter's parametercontext_sentsis no longer optional and must be explicitly set to 0 to disable sentence context- In multi-GPU setups, streams that contain torch components are now stripped of their parameter tensors when sent to CPU Workers since these workers only perform preprocessing and postprocessing and should therefore not need the model parameters.
- The
batch_sizeargument ofPipelineis deprecated and is not used anymore. Use thebatch_sizeargument ofstream.map_pipelineinstead.
Fixed
- Sort files before iterating over a standoff or json folder to ensure reproducibility
- Sentence detection now correctly match capitalized letters + apostrophe
- We now ensure that the workers pool is properly closed whatever happens (exception, garbage collection, data ending) in the
multiprocessingbackend. This prevents some executions from hanging indefinitely at the end of the processing. - Propagate torch sharing strategy to other workers in the
multiprocessingbackend. This is useful when the system is running out of file descriptors andulimit -nis not an option. Torch sharing strategy can also be set via an environment variableTORCH_SHARING_STRATEGY(default isfile_descriptor, consider usingfile_systemif you encounter issues).
Data API changes
LazyCollectionobjects are now calledStreamobjects- By default,
multiprocessingbackend now preserves the order of the input data. To disable this and improve performance, usedeterministic=Falsein theset_processingmethod - 🚀 Parallelized GPU inference throughput improvements !
- For simple {pre-process → model → post-process} pipelines, GPU inference can be up to 30% faster in non-deterministic mode (results can be out of order) and up to 20% faster in deterministic mode (results are in order)
- For multitask pipelines, GPU inference can be up to twice as fast (measured in a two-tasks BERT+NER+Qualif pipeline on T4 and A100 GPUs)
- The
.map_batches,.map_pipelineand.map_gpumethods now support a specificbatch_sizeand batching function, instead of having a single batch size for all pipes - Readers now have a
loopparameter to cycle over the data indefinitely (useful for training) - Readers now have a
shuffleparameter to shuffle the data before iterating over it - In
multiprocessingmode, file based readers now read the data in the workers (was an option before) - We now support two new special batch sizes
- "fragment" in the case of parquet datasets: rows of a full parquet file fragment per batch
- "dataset" which is mostly useful during training, for instance to shuffle the dataset at each epoch.
These are also compatible in batched writer such as parquet, where each input fragment can be processed and mapped to a single matching output fragment.
- 💥 Breaking change: a
mapfunction returning a list or a generator won't be automatically flattened anymore. Useflatten()to flatten the output if needed. This shouldn't change the behavior for most users since most writers (to_pandas, to_polars, to_parquet, ...) still flatten the output - 💥 Breaking change: the
chunk_sizeandsort_chunksare now deprecated : to sort data before applying a transformation, use.map_batches(custom_sort_fn, batch_size=...)
Training API changes
- We now provide a training script
python -m edsnlp.train --config config.cfgthat should fit many use cases. Check out the docs ! - In particular, we do not require pytorch's Dataloader for training and can rely solely on EDS-NLP stream/data API, which is better suited for large streamable datasets and dynamic preprocessing (ie different result each time we apply a noised preprocessing op on a sample).
- Each trainable component can now provide a
statsfield in itspreprocessoutput to log info about the sample (number of words, tokens, spans, ...):- these stats are both used for batching (e.g., make batches of no more than "25000 tokens")
- for logging
- for computing correct loss means when accumulating gradients over multiple mini-mini-batches
- for computing correct loss means in multi-GPU setups, since these stats are synchronized and accumulated across GPUs
- Support multi GPU training via hugginface
accelerateand EDS-NLPStreamAPI consideration of env['WOLRD_SIZE'] and env['LOCAL_RANK'] environment variables
Pull Requests
- Improve training tutorials by @percevalw in #331
- Various fixes by @percevalw in #332
- Multiprocessing related fixes by @percevalw in #333
- chore: bump version to 0.14.0 by @percevalw in #334
Full Changelog: v0.13.1...v0.14.0
v0.13.1
Changelog
Added
eds.tablesaccepts a minimum_table_size (default 2) argument to reduce pollutionRuleBasedQualifiernow expose aprocessmethod that only returns qualified entities and token without actually tagging them, deferring this task to the__call__method.- Added new patterns for metastasis detection developed on CT-Scan reports.
- Added citation of articles
Fixed
- Disorder and Behavior pipes don't use a "PRESENT" or "ABSENT"
statusanymore. Instead,status=Noneby default,
andent._.negationis set to True instead of settingstatusto "ABSENT". To this end, the tobacco and alcohol
now use theNegationQualifierinternally. - Numbers are now only detected without trying to remove the pollution in between digits, ie
55 @ 77777could be detected as a full number before, but not anymore. - Fix fsspec open file encoding to "utf-8".
Changed
- Rename
eds.measurementstoeds.quantities - scikit-learn (used in
eds.endlines) is no longer installed by default when installingedsnlp[ml]
Pull Requests
- Remove pollution exclusion during numbers matching by @percevalw in #316
- Rename eds.measurements by @svittoz in #313
- Adding minimum_table_size argument to eds.tables by @svittoz in #318
- Fs encoding fix by @Aremaki in #320
- chore(deps): bump actions/download-artifact from 2 to 4.1.7 in /.github/workflows in the github_actions group across 1 directory by @dependabot in #319
- fix: skip spacy 3.8.0 due to numpy build dep by @percevalw in #321
- Fix behavior, disorder and qualifier pipes by @Thomzoy in #322
- Metastatic status by @aricohen93 in #308
- chore: bump version to 0.13.1 by @percevalw in #327
- Test 3.12 by @percevalw in #328
New Contributors
- @dependabot made their first contribution in #319
Full Changelog: v0.13.0...v0.13.1
v0.13.0
Changelog
Added
data.set_processing(...)now expose anautocastparameter to disable or tweak the automatic casting of the tensor
during the processing. Autocasting should result in a slight speedup, but may lead to numerical instability.- Use
torch.inference_modeto disable view tracking and version counter bumps during inference. - Added a new NER pipeline for suicide attempt detection
- Added date cues (regular expression matches that contributed to a date being detected) under the extension
ent._.date_cues - Added tables processing in eds.measurement
- Added 'all' as possible input in eds.measurement measurements config
- Added new units in eds.measurement
Changed
- Default to mixed precision inference
Fixed
edsnlp.load("your/huggingface-model", install_dependencies=True)now correctly resolves the python pip
(especially on Colab) to auto-install the model dependencies- We now better handle empty documents in the
eds.transformer,eds.text_cnnandeds.ner_crfcomponents - Support mixed precision in
eds.text_cnnandeds.ner_crfcomponents - Support pre-quantization (<4.30) transformers versions
- Verify that all batches are non empty
- Fix
span_context_getterforcontext_words= 0,context_sents> 2 and support assymetric contexts - Don't split sentences on rare unicode symbols
- Better detect abbreviations, like
E.coli, now split as [E.,coli] and not [E,.,coli]
What's Changed
- Various ml fixes by @percevalw in #303
- TS by @aricohen93 in #269
- date cues by @cvinot in #265
- Fix fast inference by @percevalw in #305
- Fix typo in diabetes patterns by @isabelbt in #306
- Fix span context getter by @aricohen93 in #307
- Fix sentences by @percevalw in #310
- chore: bump version to 0.13.0 by @percevalw in #312
New Contributors
Full Changelog: v0.12.3...v0.13.0