Skip to content

Parallelizing Bedboss Reprocessing (Heavy Processing) #120

@khoroshevskyi

Description

@khoroshevskyi

Currently, heavy processing can happen in two ways:

  1. By providing an identifier — the BED file ID that needs to be processed.
  2. By running reprocess-all — Bedboss will query all files that haven’t been processed and process them sequentially.

There are a few possible solutions for enabling parallel processing:

  1. Launch multiple instances of reprocess-all.
    To do this safely, we need to know exactly which files are in the processing queue. We must ensure that if two processes are running, one won’t "steal" a job (ID) from the other. In other words, we need to prevent any concurrency issues.

  2. Use Looper with subsamples, where each sample represents a job and subsamples are the IDs to be processed.
    This approach introduces downstream challenges, such as batching the subsamples. Also, Looper isn’t designed to handle thousands of samples efficiently.

After discussion, we’re currently satisfied with the upload time. If needed, we’ll revisit and address this issue in the future.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions