Parallelizing Bedboss Reprocessing (Heavy Processing)

Currently, heavy processing can happen in two ways:

1) By providing an identifier — the BED file ID that needs to be processed.
2) By running `reprocess-all` — Bedboss will query all files that haven’t been processed and process them sequentially.

There are a few possible solutions for enabling parallel processing:

1) Launch multiple instances of reprocess-all.
To do this safely, we need to know exactly which files are in the processing queue. We must ensure that if two processes are running, one won’t "steal" a job (ID) from the other. In other words, we need to prevent any concurrency issues.

2) Use Looper with subsamples, where each sample represents a job and subsamples are the IDs to be processed.
This approach introduces downstream challenges, such as batching the subsamples. Also, Looper isn’t designed to handle thousands of samples efficiently.

After discussion, we’re currently satisfied with the upload time. If needed, we’ll revisit and address this issue in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelizing Bedboss Reprocessing (Heavy Processing) #120

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parallelizing Bedboss Reprocessing (Heavy Processing) #120

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions