Skip to content

Conversation

@imartayan
Copy link
Contributor

Hey @bede!

This PR aims to speed up short reads filtering by packing multiple records until we reach a length threshold (currently 8000 bp) and processing this batch with a single call to simd_minimizers. It introduces a RecordBuffer type that packs multiples sequences and headers in a single Vec<u8>, and uses this type internally in FilterProcessor to batch the operations. Currently, it is only implemented for the ParallelProcessor trait and isn't used for paired reads yet.

This might use a bit more copy than before since we have to keep some records longer (thus slightly slowing down long reads processing) but should bring a significant speedup for short reads.

Let me know if you're happy with the new performances and if you get consistent results with the previous implementation. If so, I can adapt the code to support paired reads as well.

Best,
Igor

@bede
Copy link
Owner

bede commented Aug 18, 2025

Many thanks @imartayan!
Results look great at a glance. For the uncompressed fastq R1 reads (forward reads only) of the 2x150bp simulated reads for rsviruses17900, this PR increases throughput from 300Mbp/s to 542Mbp/s on my local M1 machine. For fastq.gz, throughput remains capped at ~190Mbp/s (though as discussed we hopefully can 2x in the future with parallel readers). This approach would also deliver improvements in conjunction with faster compression approaches, and binary formats like uBAM (#33) and vbq (#31) that may be supported in future.

@bede
Copy link
Owner

bede commented Sep 10, 2025

Hi Igor,
I'm sorry for not responding here yet; first impressions are really good. I'll get back to you after a closer look.

Thanks,
Bede

@bede bede force-pushed the main branch 2 times, most recently from 08ec7de to 97868a0 Compare November 20, 2025 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants