Skip to content

Help: Cannot load haploid VCF from snp-sites. Fails with "Must add at least one sample individual #1061

@mandagaby18ag-design

Description

@mandagaby18ag-design

Hello tskit-dev team,

I am having extreme difficulty trying to infer an ARG from haploid virus data.

My pipeline is:

  1. FASTA (84 sequences) -> mafft (alignment)
  2. Alignment -> snp-sites -> VCF (this VCF has 1848 SNPs)
  3. VCF -> tsinfer

The VCF from snp-sites seems incompatible with tsinfer. No matter what I try, I always get the error: Must add at least one sample individual (or Must add at least one site).

Here is what I have tried, and all of them fail with the same error:

  • Using the old SampleData API (which shows a DeprecationWarning).
  • Fixing the contig name (from '0' to '1', which is the contig in my VCF).
  • Setting ploidy=1 in add_vcf.
  • Adding individuals manually (samples.add_individual) before calling add_vcf.
  • Adding individuals (samples.add_individual) and samples (samples.add_sample) before calling add_vcf.

I also wrote my own Python script to "clean" the VCF from snp-sites. My script successfully splits multi-allelic sites and creates a new, clean VCF with 2091 valid SNPs.

This is the main problem: Even when I try to load this "perfectly clean" VCF, tsinfer still fails with the exact same error: Must add at least one sample individual.

This makes me think the SampleData API is fundamentally broken for this task.

What is the correct, modern way (using SampleFile?) to load a haploid VCF (especially one from snp-sites)? All my attempts have failed.

Thank you for any help you can provide.
(My environment: Linux, Python 3.11, tsinfer version 0.4.1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions