Ev_simple_fasta_header #2

anna-parker · 2025-08-12T10:03:57Z

Examples created by @alejandra-gonzalezsanchez

resolves #4847 ### Screenshot Improves #4821, comes after #5398 You can use pathoplexus/dev_example_data#2 for testing. Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. ## Prepro config changes Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. ### PR Checklist - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping ## Future Work - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://sort-multi-path.loculus.org

resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 ### BREAKING CHANGES When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. ### Testing You can use pathoplexus/dev_example_data#2 for testing. ### Prepro config changes Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. ### PR Checklist - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping ## Future Work - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <[email protected]>

resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. You can use pathoplexus/dev_example_data#2 for testing. Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <[email protected]>

… refactor multi segment submission in backend and edit page and have prepro assign segments (#5382) resolves #4999 #4708, #4734, #5511 partially resolves #5392, #5185 (comment) includes work done in #5398 and #5402 This PR additionally fixes submission, subtype assignment and search for EVs and other multi-path organisms. ### BREAKING CHANGES When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaIds` column with a space -separated list of the `fastaId`s (fasta header IDs) of the respective sequences. If no `fastaIds` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort (uses a minimizer index for fast local alignment) or nextclade align (full sequence alignment to reference) will be used to assign segments/subtypes for all multi-segmented and multi-pathogen sequences (this is also done in ingest for grouping segments): ``` segment_classification_method: "minimizer" or "align" minimizer_url: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format `<submissionId>_<segmentName>` (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaId in the processedData, the map is called: `sequenceNameToFastaId`. This allows us to surface the segment assignment on the edit page. ### Nextclade Preprocessing pipeline config changes Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a dictionary where each item includes all information required to run nextclade. I.e. we change from: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` to: ``` nextclade_sequence_and_datasets: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name and name are used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > genes: [RdRp] - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M genes: [GPC] - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S genes: [NP] nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output segment_classification_method: <optional, default for multi segmented viruses is align - if you assign segments in ingest for grouping use the same option here as you use there e.g. "minimizer" or "align"> minimizer_url: <optional, url_to_minimizer_index_used_by_nextclade_sort> ``` ### Ingest Pipeline Config changes `minimizer_index` is changed to `minimizer_url` for consistency (can be used in ingest and preprocessing and should both be the same) ### Optional additional Config changes Limit the number of sequences the backend will accept per submission by using - should be added for multi-segmented organisms: ` submissionDataTypes: &defaultSubmissionDataTypes consensusSequences: true maxSequencesPerEntry: 1 ` ### Testing You can use pathoplexus/example_data#16 and pathoplexus/dev_example_data#2 for testing. ### PR Checklist - [x] Update values.schema.json and other READMEs - [x] add fastaId to commonMetadata (ensure it is downloaded in templates): #5561 - [x] Fix how genes are returned (will cause a config update): #5563 - [x] Improve prepro code (less duplication and more tests): #5554 - [x] ingest EVs as single segmented to ensure search works: #5511 - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - ~add integration testing for full EV submission user journey~ -> will be done in a later PR - [x] improve CCHF minimizer (some segments are again not assigned) - [x] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) -> decided against - [x] update PPX docs with new multi-segment submission format -> test PR here: pathoplexus/pathoplexus#759 - [x] update example data for demo 🚀 Preview: https://edit-page-anya.loculus.org --------- Co-authored-by: Cornelius Roemer <[email protected]> Co-authored-by: Fabian Engelniederhammer <[email protected]> Co-authored-by: Theo Sanderson <[email protected]>

anna-parker added 3 commits May 9, 2025 11:49

feat: add example data for hmpv, rsv-a and rsv-b

bfa260e

fix

f835d95

feat: add enteroviruses with no assignment in fasta header

e6da1ff

anna-parker mentioned this pull request Aug 12, 2025

feat!(prepro, config): assign segment with nextclade sort loculus-project/loculus#4783

Closed

3 tasks

anna-parker mentioned this pull request Nov 10, 2025

feat(prepro): assign segment/subtype using nextclade sort (3/n) loculus-project/loculus#5402

Merged

9 tasks

also add CCHF with new submission format

2139441

anna-parker mentioned this pull request Nov 20, 2025

feat!(backend): refactor multi-segment submission (2/n) loculus-project/loculus#5398

Merged

9 tasks

anna-parker mentioned this pull request Nov 20, 2025

feat!(website, prepro, backend, config, integration):multi pathogen - refactor multi segment submission in backend and edit page and have prepro assign segments loculus-project/loculus#5382

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ev_simple_fasta_header #2

Ev_simple_fasta_header #2

Uh oh!

anna-parker commented Aug 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ev_simple_fasta_header #2

Are you sure you want to change the base?

Ev_simple_fasta_header #2

Uh oh!

Conversation

anna-parker commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

anna-parker commented Aug 12, 2025 •

edited

Loading