Skip to content

Commit a1c29d5

Browse files
feat!(backend): refactor multi-segment submission (2/n) (#5398)
resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. You can use pathoplexus/dev_example_data#2 for testing. Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <[email protected]>
1 parent 123b94f commit a1c29d5

File tree

67 files changed

+1361
-649
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+1361
-649
lines changed

backend/docs/db/schema.sql

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -378,7 +378,8 @@ CREATE TABLE public.metadata_upload_aux_table (
378378
group_id integer,
379379
uploaded_at timestamp without time zone NOT NULL,
380380
metadata jsonb NOT NULL,
381-
files jsonb
381+
files jsonb,
382+
fasta_ids jsonb DEFAULT '[]'::jsonb
382383
);
383384

384385

@@ -538,9 +539,8 @@ ALTER VIEW public.sequence_entries_view OWNER TO postgres;
538539

539540
CREATE TABLE public.sequence_upload_aux_table (
540541
upload_id text NOT NULL,
541-
submission_id text NOT NULL,
542-
segment_name text NOT NULL,
543-
compressed_sequence_data text NOT NULL
542+
compressed_sequence_data text NOT NULL,
543+
fasta_id text NOT NULL
544544
);
545545

546546

@@ -753,7 +753,7 @@ ALTER TABLE ONLY public.sequence_entries_preprocessed_data
753753
--
754754

755755
ALTER TABLE ONLY public.sequence_upload_aux_table
756-
ADD CONSTRAINT sequence_upload_aux_table_pkey PRIMARY KEY (upload_id, submission_id, segment_name);
756+
ADD CONSTRAINT sequence_upload_aux_table_pkey PRIMARY KEY (upload_id, fasta_id);
757757

758758

759759
--
@@ -794,6 +794,13 @@ CREATE INDEX data_use_terms_table_accession_idx ON public.data_use_terms_table U
794794
CREATE INDEX flyway_schema_history_s_idx ON public.flyway_schema_history USING btree (success);
795795

796796

797+
--
798+
-- Name: metadata_upload_aux_table_fasta_ids_idx; Type: INDEX; Schema: public; Owner: postgres
799+
--
800+
801+
CREATE INDEX metadata_upload_aux_table_fasta_ids_idx ON public.metadata_upload_aux_table USING gin (fasta_ids jsonb_path_ops);
802+
803+
797804
--
798805
-- Name: sequence_entries_organism_idx; Type: INDEX; Schema: public; Owner: postgres
799806
--

backend/src/main/kotlin/org/loculus/backend/api/SubmissionTypes.kt

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ import com.fasterxml.jackson.databind.JsonDeserializer
88
import com.fasterxml.jackson.databind.JsonNode
99
import com.fasterxml.jackson.databind.annotation.JsonDeserialize
1010
import io.swagger.v3.oas.annotations.media.Schema
11+
import org.loculus.backend.model.FastaId
1112
import org.loculus.backend.model.SubmissionId
1213
import org.loculus.backend.service.files.FileId
1314
import org.loculus.backend.utils.Accession
@@ -166,6 +167,11 @@ data class ProcessedData<SequenceType>(
166167
description = "The key is the gene name, the value is a list of amino acid insertions",
167168
)
168169
val aminoAcidInsertions: Map<GeneName, List<Insertion>>,
170+
@Schema(
171+
example = """{"segment1": "fastaHeader1", "segment2": "fastaHeader2"}""",
172+
description = "The key is the segment name, the value is the fastaHeader of the original Data",
173+
)
174+
val sequenceNameToFastaHeaderMap: Map<SegmentName, String> = emptyMap(),
169175
@Schema(
170176
example = """{"raw_reads": [{"fileId": "s0m3-uUiDd", "name": "data.fastaq"}], "sequencing_logs": []}""",
171177
description = "The key is the file category name, the value is a list of files, with ID and name.",
@@ -300,9 +306,9 @@ data class OriginalDataInternal<SequenceType, FilesType>(
300306
val metadata: Map<String, String>,
301307
@Schema(
302308
example = "{\"segment1\": \"ACTG\", \"segment2\": \"GTCA\"}",
303-
description = "The key is the segment name, the value is the nucleotide sequence",
309+
description = "The key is the fastaID, the value is the nucleotide sequence",
304310
)
305-
val unalignedNucleotideSequences: Map<SegmentName, SequenceType?>,
311+
val unalignedNucleotideSequences: Map<FastaId, SequenceType?>,
306312
@Schema(
307313
example = """{"raw_reads": [{"fileId": "f1le-uuId-asdf", "name": "myfile.fastaq"]}""",
308314
description = "A map from file categories, to lists of files. The files can also have URLs.",

backend/src/main/kotlin/org/loculus/backend/controller/SubmissionControllerDescriptions.kt

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
package org.loculus.backend.controller
22

3-
import org.loculus.backend.model.HEADER_TO_CONNECT_METADATA_AND_SEQUENCES
3+
import org.loculus.backend.model.METADATA_ID_HEADER
44

55
const val SUBMIT_RESPONSE_DESCRIPTION = """
66
Returns a list of accession, version and submissionId of the submitted sequence entries.
7-
The submissionId is the (locally unique) '$HEADER_TO_CONNECT_METADATA_AND_SEQUENCES' provided by the submitter in the metadata file.
7+
The submissionId is the (locally unique) '$METADATA_ID_HEADER' provided by the submitter in the metadata file.
88
The version will be 1 for every sequence.
99
The accession is the (globally unique) id that the system assigned to the sequence entry.
10-
You can use this response to associate the user provided $HEADER_TO_CONNECT_METADATA_AND_SEQUENCES with the system assigned accession.
10+
You can use this response to associate the user provided $METADATA_ID_HEADER with the system assigned accession.
1111
"""
1212

1313
const val SUBMIT_ERROR_RESPONSE = """
@@ -18,16 +18,18 @@ const val METADATA_FILE_DESCRIPTION = """
1818
A TSV (tab separated values) file containing the metadata of the submitted sequence entries.
1919
The file may be compressed with zstd, xz, zip, gzip, lzma, bzip2 (with common extensions).
2020
It must contain the column names.
21-
The field '$HEADER_TO_CONNECT_METADATA_AND_SEQUENCES' is required and must be unique within the provided dataset.
21+
The field '$METADATA_ID_HEADER' is required and must be unique within the provided dataset.
2222
It is used to associate metadata to the sequences in the sequences fasta file.
2323
"""
24+
25+
// TODO: update description
2426
const val SEQUENCE_FILE_DESCRIPTION = """
2527
A fasta file containing the unaligned nucleotide sequences of the submitted sequences.
2628
The file may be compressed with zstd, xz, zip, gzip, lzma, bzip2 (with common extensions).
2729
If the underlying organism has a single segment,
28-
the headers of the fasta file must match the '$HEADER_TO_CONNECT_METADATA_AND_SEQUENCES' field in the metadata file.
30+
the headers of the fasta file must match the '$METADATA_ID_HEADER' field in the metadata file.
2931
If the underlying organism has multiple segments,
30-
the headers of the fasta file must be of the form '>[$HEADER_TO_CONNECT_METADATA_AND_SEQUENCES]_[segmentName]'.
32+
the headers of the fasta file must be of the form '>[$METADATA_ID_HEADER]_[segmentName]'.
3133
"""
3234

3335
const val FILE_MAPPING_DESCRIPTION = """
@@ -114,7 +116,7 @@ The version will increase by one in respect to the original accession version.
114116

115117
const val REVISED_METADATA_FILE_DESCRIPTION = """
116118
A TSV (tab separated values) file containing the metadata of the revised data.
117-
The first row must contain the column names. The column '$HEADER_TO_CONNECT_METADATA_AND_SEQUENCES' is required and must be unique within the
119+
The first row must contain the column names. The column '$METADATA_ID_HEADER' is required and must be unique within the
118120
provided dataset. It is used to associate metadata to the sequences in the sequences fasta file.
119121
Additionally, the column 'accession' is required and must match the accession of the original sequence entry.
120122
"""

backend/src/main/kotlin/org/loculus/backend/model/SubmitModel.kt

Lines changed: 39 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -31,13 +31,15 @@ import java.io.BufferedInputStream
3131
import java.io.File
3232
import java.io.InputStream
3333

34-
const val HEADER_TO_CONNECT_METADATA_AND_SEQUENCES = "id"
35-
const val HEADER_TO_CONNECT_METADATA_AND_SEQUENCES_ALTERNATE_FOR_BACKCOMPAT = "submissionId"
34+
const val METADATA_ID_HEADER = "id"
35+
const val METADATA_ID_HEADER_ALTERNATE_FOR_BACKCOMPAT = "submissionId"
36+
const val FASTA_ID_HEADER = "fastaId"
3637

3738
const val ACCESSION_HEADER = "accession"
3839
private val log = KotlinLogging.logger { }
3940

4041
typealias SubmissionId = String
42+
typealias FastaId = String
4143
typealias SegmentName = String
4244

4345
const val UNIQUE_CONSTRAINT_VIOLATION_SQL_STATE = "23505"
@@ -126,8 +128,13 @@ class SubmitModel(
126128
val metadataSubmissionIds = uploadDatabaseService.getMetadataUploadSubmissionIds(uploadId).toSet()
127129
if (requiresConsensusSequenceFile(submissionParams.organism)) {
128130
log.debug { "Validating submission with uploadId $uploadId" }
129-
val sequenceSubmissionIds = uploadDatabaseService.getSequenceUploadSubmissionIds(uploadId).toSet()
130-
validateSubmissionIdSetsForConsensusSequences(metadataSubmissionIds, sequenceSubmissionIds)
131+
val metadataFastaIds = uploadDatabaseService.getFastaIdsForMetadata(uploadId).flatten()
132+
val metadataFastaIdsSet = metadataFastaIds.toSet()
133+
if (metadataFastaIdsSet.size < metadataFastaIds.size) {
134+
throw UnprocessableEntityException("Metadata file contains duplicate fastaIds.")
135+
}
136+
val sequenceFastaIds = uploadDatabaseService.getSequenceUploadSubmissionIds(uploadId).toSet()
137+
validateSubmissionIdSetsForConsensusSequences(metadataFastaIdsSet, sequenceFastaIds)
131138
}
132139

133140
if (submissionParams is SubmissionParams.RevisionSubmissionParams) {
@@ -167,38 +174,39 @@ class SubmitModel(
167174
metadataFileTypes,
168175
metadataTempFileToDelete,
169176
)
177+
val addFastaId = requiresConsensusSequenceFile(submissionParams.organism)
170178
try {
171-
uploadMetadata(uploadId, submissionParams, metadataStream, batchSize)
179+
uploadMetadata(uploadId, submissionParams, metadataStream, batchSize, addFastaId = addFastaId)
172180
} finally {
173181
metadataTempFileToDelete.delete()
174182
}
175183

176184
val sequenceFile = submissionParams.sequenceFile
177185
if (sequenceFile == null) {
178-
if (requiresConsensusSequenceFile(submissionParams.organism)) {
186+
if (addFastaId) {
179187
throw BadRequestException(
180188
"Submissions for organism ${submissionParams.organism.name} require a sequence file.",
181189
)
182190
}
183-
} else {
184-
if (!requiresConsensusSequenceFile(submissionParams.organism)) {
185-
throw BadRequestException(
186-
"Sequence uploads are not allowed for organism ${submissionParams.organism.name}.",
187-
)
188-
}
191+
return
192+
}
193+
if (!addFastaId) {
194+
throw BadRequestException(
195+
"Sequence uploads are not allowed for organism ${submissionParams.organism.name}.",
196+
)
197+
}
189198

190-
val sequenceTempFileToDelete = MaybeFile()
191-
try {
192-
val sequenceStream = getStreamFromFile(
193-
sequenceFile,
194-
uploadId,
195-
sequenceFileTypes,
196-
sequenceTempFileToDelete,
197-
)
198-
uploadSequences(uploadId, sequenceStream, batchSize, submissionParams.organism)
199-
} finally {
200-
sequenceTempFileToDelete.delete()
201-
}
199+
val sequenceTempFileToDelete = MaybeFile()
200+
try {
201+
val sequenceStream = getStreamFromFile(
202+
sequenceFile,
203+
uploadId,
204+
sequenceFileTypes,
205+
sequenceTempFileToDelete,
206+
)
207+
uploadSequences(uploadId, sequenceStream, batchSize, submissionParams.organism)
208+
} finally {
209+
sequenceTempFileToDelete.delete()
202210
}
203211
}
204212

@@ -244,6 +252,7 @@ class SubmitModel(
244252
submissionParams: SubmissionParams,
245253
metadataStream: InputStream,
246254
batchSize: Int,
255+
addFastaId: Boolean,
247256
) {
248257
log.debug {
249258
"intermediate storing uploaded metadata of type ${submissionParams.uploadType.name} " +
@@ -253,7 +262,7 @@ class SubmitModel(
253262
try {
254263
when (submissionParams) {
255264
is SubmissionParams.OriginalSubmissionParams -> {
256-
metadataEntryStreamAsSequence(metadataStream)
265+
metadataEntryStreamAsSequence(metadataStream, addFastaId)
257266
.chunked(batchSize)
258267
.forEach { batch ->
259268
uploadDatabaseService.batchInsertMetadataInAuxTable(
@@ -269,7 +278,7 @@ class SubmitModel(
269278
}
270279

271280
is SubmissionParams.RevisionSubmissionParams -> {
272-
revisionEntryStreamAsSequence(metadataStream)
281+
revisionEntryStreamAsSequence(metadataStream, addFastaId)
273282
.chunked(batchSize)
274283
.forEach { batch ->
275284
uploadDatabaseService.batchInsertRevisedMetadataInAuxTable(
@@ -344,14 +353,15 @@ class SubmitModel(
344353

345354
if (metadataKeysNotInSequences.isNotEmpty() || sequenceKeysNotInMetadata.isNotEmpty()) {
346355
val metadataNotPresentErrorText = if (metadataKeysNotInSequences.isNotEmpty()) {
347-
"Metadata file contains ${metadataKeysNotInSequences.size} ids that are not present " +
356+
"Metadata file contains ${metadataKeysNotInSequences.size} FASTA ids that are not present " +
348357
"in the sequence file: " + metadataKeysNotInSequences.toList().joinToString(limit = 10) + "; "
349358
} else {
350359
""
351360
}
352361
val sequenceNotPresentErrorText = if (sequenceKeysNotInMetadata.isNotEmpty()) {
353-
"Sequence file contains ${sequenceKeysNotInMetadata.size} ids that are not present " +
354-
"in the metadata file: " + sequenceKeysNotInMetadata.toList().joinToString(limit = 10)
362+
"Sequence file contains ${sequenceKeysNotInMetadata.size} FASTA ids that are not present " +
363+
"in the metadata file: " +
364+
sequenceKeysNotInMetadata.toList().joinToString(limit = 10)
355365
} else {
356366
""
357367
}

backend/src/main/kotlin/org/loculus/backend/service/submission/CompressionService.kt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,7 @@ class CompressionService(private val compressionDictService: CompressionDictServ
102102
}
103103
},
104104
processedData.aminoAcidInsertions,
105+
processedData.sequenceNameToFastaHeaderMap,
105106
processedData.files,
106107
)
107108

@@ -128,6 +129,7 @@ class CompressionService(private val compressionDictService: CompressionDictServ
128129
}
129130
},
130131
processedData.aminoAcidInsertions,
132+
processedData.sequenceNameToFastaHeaderMap,
131133
processedData.files,
132134
)
133135

backend/src/main/kotlin/org/loculus/backend/service/submission/EmptyProcessedDataProvider.kt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ class EmptyProcessedDataProvider(private val backendConfig: BackendConfig) {
2020
alignedAminoAcidSequences = referenceGenome.genes.map { it.name }.associateWith { null },
2121
nucleotideInsertions = referenceGenome.nucleotideSequences.map { it.name }.associateWith { emptyList() },
2222
aminoAcidInsertions = referenceGenome.genes.map { it.name }.associateWith { emptyList() },
23+
sequenceNameToFastaHeaderMap = referenceGenome.nucleotideSequences.map { it.name }.associateWith { "" },
2324
files = null,
2425
)
2526
}

backend/src/main/kotlin/org/loculus/backend/service/submission/ProcessedSequenceEntryValidator.kt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,11 @@ class ProcessedSequenceEntryValidator(private val schema: Schema, private val re
232232
"alignedNucleotideSequences",
233233
)
234234

235+
validateNoUnknownSegment(
236+
processedData.sequenceNameToFastaHeaderMap,
237+
"sequenceNameToFastaHeaderMap",
238+
)
239+
235240
validateNoUnknownSegment(
236241
processedData.unalignedNucleotideSequences,
237242
"unalignedNucleotideSequences",

backend/src/main/kotlin/org/loculus/backend/service/submission/SubmissionDatabaseService.kt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -457,6 +457,7 @@ class SubmissionDatabaseService(
457457
aminoAcidInsertions = processedData.aminoAcidInsertions.mapValues { (_, it) ->
458458
it.map { insertion -> insertion.copy(sequence = insertion.sequence.uppercase(Locale.US)) }
459459
},
460+
sequenceNameToFastaHeaderMap = processedData.sequenceNameToFastaHeaderMap,
460461
)
461462

462463
private fun validateExternalMetadata(
@@ -1224,7 +1225,7 @@ class SubmissionDatabaseService(
12241225
.fetchSize(streamBatchSize)
12251226
.asSequence()
12261227
.map {
1227-
// Revoked sequences have no original metdadata, hence null can happen
1228+
// Revoked sequences have no original metadata, hence null can happen
12281229
@Suppress("USELESS_ELVIS")
12291230
val metadata = it[originalMetadata] ?: null
12301231
val selectedMetadata = fields?.associateWith { field -> metadata?.get(field) }

0 commit comments

Comments
 (0)