-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Hello tskit-dev team,
I am having extreme difficulty trying to infer an ARG from haploid virus data.
My pipeline is:
- FASTA (84 sequences) ->
mafft(alignment) - Alignment ->
snp-sites-> VCF (this VCF has 1848 SNPs) - VCF ->
tsinfer
The VCF from snp-sites seems incompatible with tsinfer. No matter what I try, I always get the error: Must add at least one sample individual (or Must add at least one site).
Here is what I have tried, and all of them fail with the same error:
- Using the old
SampleDataAPI (which shows aDeprecationWarning). - Fixing the contig name (from '0' to '1', which is the contig in my VCF).
- Setting
ploidy=1inadd_vcf. - Adding individuals manually (
samples.add_individual) before callingadd_vcf. - Adding individuals (
samples.add_individual) and samples (samples.add_sample) before callingadd_vcf.
I also wrote my own Python script to "clean" the VCF from snp-sites. My script successfully splits multi-allelic sites and creates a new, clean VCF with 2091 valid SNPs.
This is the main problem: Even when I try to load this "perfectly clean" VCF, tsinfer still fails with the exact same error: Must add at least one sample individual.
This makes me think the SampleData API is fundamentally broken for this task.
What is the correct, modern way (using SampleFile?) to load a haploid VCF (especially one from snp-sites)? All my attempts have failed.
Thank you for any help you can provide.
(My environment: Linux, Python 3.11, tsinfer version 0.4.1)