-
Notifications
You must be signed in to change notification settings - Fork 0
handling fasta files with biopython #202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request refactors FASTA file parsing by replacing manual string manipulation with BioPython's SeqIO library. The change improves code maintainability and robustness by using a well-tested, standard library for biological sequence format handling.
Key Changes:
- Replaced manual line-by-line FASTA parsing with BioPython's
SeqIO.parse() - Added biopython 1.84 dependency to requirements.txt
- Simplified header matching logic using BioPython's record.id and record.description attributes
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| requirements.txt | Added biopython==1.84 dependency |
| worker.py | Added imports for BioPython and StringIO (not used in this file) |
| helpers.py | Replaced manual FASTA parsing with BioPython's SeqIO.parse in check_for_sequence_data function |
Comments suppressed due to low confidence (2)
worker.py:6
- Import of 'StringIO' is not used.
from io import StringIO
worker.py:7
- Import of 'SeqIO' is not used.
from Bio import SeqIO
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return False, f"Empty sequence found for header: {fasta_header}" | ||
|
|
||
| # Reconstruct FASTA format for storage | ||
| sequence_data = f">{sequence_found.description}\n{str(sequence_found.seq)}" |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor behavioral change: The BioPython implementation returns the sequence as a single unwrapped line, whereas the original implementation preserved line breaks from the source file. While both are valid FASTA format, if line wrapping is desired (typically 60-80 characters per line), consider using textwrap.fill() or BioPython's SeqIO.write() to wrap the sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| if not sequence_data.strip(): | ||
| return False, f"No sequence data found for isolate '{isolate_sample_id}'" |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is now redundant. Since sequence_found is guaranteed to be non-None (line 1100) and have a non-empty sequence (line 1104), the reconstructed sequence_data at line 1108 will always contain content and never be empty. Consider removing this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
@desafinadude I've opened a new pull request, #203, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@desafinadude I've opened a new pull request, #204, to work on those changes. Once the pull request is ready, I'll request review from you. |
No description provided.