pgen-samplebind

A focused CLI for the one thing plink2 --pmerge doesn't yet do: bind two or more PFILE/BFILE/EIGENSTRAT genotype datasets that share a variant set but contain different samples. Targeted at the ancient-DNA / population-genetics community working with the 1240k SNP capture panel and downstream AdmixTools 2 / qpAdm pipelines.

pgen-samplebind merge panel_a panel_b -o merged

That's it for the happy path. Variant alignment, strand canonicalization, ambiguous-strand policy, missing-call fill, IID collision handling, pseudohaploid detection, population-label preservation, and report emission all run with one command.

Why this exists

plink2 --pmerge errors with "under development" for the non-concatenating case (same variants, different samples). Today's workarounds — plink1.9 --bmerge, EIGENSOFT's mergeit, or VCF round-trips through bcftools merge — each carry friction (deprecated tool, format-conversion overhead, single-thread C engine, multi-format hops). The AdmixTools / ancient-DNA community spends nontrivial time on this every panel build.

pgen-samplebind is a stopgap until plink2 lands the feature natively; it's deliberately scoped to the bind operation only — not a plink2 replacement.

Install

pip install pgen-samplebind

Requires Python 3.11 through 3.14. CI tests all four versions on both Ubuntu and macOS. For EIGENSTRAT or BFILE inputs, also install plink2 v2.0.0-a.7.1 or newer and put it on PATH. Pick the right asset for your platform:

# Linux x86_64
ASSET=plink2_linux_x86_64.zip
# macOS Apple Silicon
# ASSET=plink2_mac_arm64.zip
# macOS Intel
# ASSET=plink2_mac.zip

curl -fsSL "https://github.com/chrchang/plink-ng/releases/download/v2.0.0-a.7.1/${ASSET}" -o /tmp/plink2.zip
unzip /tmp/plink2.zip -d ~/bin
plink2 --version  # confirm v2.0.0-a.7.1 (or newer) on PATH

The v2.0.0-a.7.x line introduced the --eigfile --make-pgen path pgen-samplebind shells out to for EIGENSTRAT inputs; older versions including v2.00a5.x (still shipped by Homebrew at the time of this writing) will silently lack EIGENSTRAT support. PFILE-only workflows have no plink2 dependency.

Canonical use cases

Four workflows the tool is shaped around. Each is one command.

1. Panel extension

Existing AADR-derived panel (~700 samples × ~1.15M 1240k variants), add new populations from a newer AADR release. Variants identical, alleles canonical (same source).

pgen-samplebind merge \
    /data/aadr_v66_phase6 \
    /data/aadr_v66_new_pops \
    -o /data/aadr_v66_phase7 \
    --report phase7_bind_report.tsv

2. Single-sample target append

A user's WGS pseudohaploid genotypes (from pileupCaller --randomDiploid) merged into an AADR-derived panel for downstream qpAdm. The user is missing some panel sites (low-coverage regions) — the tool emits missing-call codes for them, doesn't drop the variants. --target activates the asymmetric strand-check + per-sample call-rate gate (default 0.40).

pgen-samplebind merge \
    --target /data/carsten_pileupcaller \
    /data/aadr_v66_subset \
    -o /data/aadr_with_carsten

3. Cross-source merge (different formats, different strand conventions)

Two cohort releases from different labs — one in EIGENSTRAT, one in PFILE — both built on 1240k. Strand canonicalization may differ; ambiguous A/T and C/G SNPs may need explicit policy.

# cohort_a is EIGENSTRAT (cohort_a.geno + .snp + .ind); cohort_b is PFILE
# (cohort_b.pgen + .pvar + .psam). Format is auto-detected from the prefix.
pgen-samplebind merge \
    /data/cohort_a \
    /data/cohort_b \
    -o /data/cohort_ab \
    --on-strand flip \
    --validate-strand-fail-pct 10

4. AADR cross-version cohort assembly

Recreating a published cohort using AADR v66 genotypes (more SNPs) requires mapping v44.3 sample IDs to v66 sample IDs through the Master-ID join — AADR de-anonymized many samples between releases (I0001 became Loschbour.AG, etc.). Sample-bind plus identity remapping, one step.

pgen-samplebind merge \
    /data/aadr_v44_3_panel \
    /data/aadr_v66_extras \
    --id-column 'Master ID' \
    --relabel-from /data/aadr_v66.anno \
    --relabel-input-col 'Master ID' \
    --relabel-output-col 'Group ID' \
    -o /data/cross_version_panel

Subcommands

pgen-samplebind merge    INPUT [INPUT ...] -o OUTPUT  [options]
pgen-samplebind validate INPUT [INPUT ...]            [options]
pgen-samplebind afs      INPUT             -o OUTDIR  [options]
pgen-samplebind hash     INPUT                        [options]
pgen-samplebind inspect  INPUT                        [options]

Inputs are PFILE/BFILE/EIGENSTRAT prefixes; format auto-detected from companion files (.pgen+.pvar+.psam / .bed+.bim+.fam / .geno+.snp+.ind). The variant companion may be either .pvar or .pvar.zst (plink2 v2.0.0-a.6+ default; HuggingFace / Dataverse panels typically arrive zstd-compressed).

`merge` — bind inputs into one output PFILE

Option	Default	Purpose
`-o, --out PREFIX`	required	Output PFILE prefix
`--target PATH`	none	Single-sample / small-cohort mode; activates asymmetric strand-check + per-target call-rate gate. Targets are appended after the positional inputs; canonical remains input[0]. Repeatable — pass `--target` multiple times to append several targets in one merge
`--variant-key {chr_pos\|id}`	`chr_pos`	Match key
`--on-mismatch {drop\|error}`	`drop`	Allele mismatch beyond resolution
`--on-missing {fill_missing\|drop_variant\|error}`	`fill_missing`	Variant in input[0] absent in input N
`--on-extra {warn\|drop\|error}`	`warn`	Variant in input N absent from input[0]
`--on-strand {drop\|flip\|error}`	`flip`	Strand mismatch on unambiguous SNPs
`--trust-strand`	off	Pass A/T and C/G ambiguous SNPs without flagging (footgun for cross-source panels)
`--on-collision {error\|first\|suffix}`	`error`	IID collision policy. `suffix` uses `_<input_idx>` (general); `_target` (single-target mode); `_target_<input_idx>` (multi-target mode, ≥ 2 targets, so renames disambiguate). Idempotent numeric retry on further collisions
`--id-column NAME`	`IID`	`.psam` column for identity ops (e.g., `'Master ID'` for AADR anno)
`--population-column NAME`	auto	Holds population labels (default: POP / PHENO / PHENO1 fallback)
`--target-min-call-rate FLOAT`	`0.40`	Target-mode per-sample minimum call rate before exit-1
`--validate-strand-fail-pct N`	`10.0`	Exit 1 if ambiguous-strand drops exceed N% of intersection
`--relabel-from FILE`	none	TSV-driven POP relabel. 2-col header-less form maps POP→POP; N-col form (with `--relabel-input-col` / `--relabel-output-col`) joins per-sample on `--id-column`
`--relabel-input-col NAME`	none	For N-col TSVs: which column matches `--id-column`
`--relabel-output-col NAME`	none	For N-col TSVs: which column becomes the new POP value
`--report PATH`	none	Per-variant action TSV (streamed; constant memory)
`--report-json PATH`	none	Run-level summary JSON (~few KB; rows excluded by default)
`--report-json-include-rows`	off	Include per-variant rows in JSON (buffered; warns at >100 MB predicted size)
`--quiet`	off	Suppress the stdout summary block and the stderr progress bar
`--block-size N`	`2048`	Variants per pgenlib read block

`validate` — check alignment without writing

Same alignment / strand options as merge. No output written; reports go to stdout plus optional --report / --report-json. Exits 0 if alignment OK, 1 if any of the gates below fires.

--no-population-column: skip the population-column requirement on input psams. Use when a user PFILE has only [IID, SEX] (e.g., a single-sample VCF intersected with a reference panel before fraposa OADP projection — populations are downstream classification output, not user input). Variant-alignment, strand-orientation, and IID-collision checks still run; population-aware report fields come out empty for inputs that lack the column. Mutually exclusive with --population-column.

`hash` — emit canonical variant-set hash

# Hashing the same variant set via two different formats yields the same
# digest: PFILE on /data/aadr_v66_subset.{pgen,pvar,psam} and EIGENSTRAT
# on /data/aadr_v66_subset_eig.{geno,snp,ind}. Format is auto-detected
# from the companion files present at the prefix.
pgen-samplebind hash /data/aadr_v66_subset
# 7c4f8e...  PFILE
pgen-samplebind hash /data/aadr_v66_subset_eig
# 7c4f8e...  EIGENSTRAT  ← same hash → format-equivalent panels

--emit-canonical prints the canonicalized bytestream that's hashed (for diagnosis when two inputs should match but don't).

`afs` — per-population allele-frequency-spectrum TSVs

pgen-samplebind afs panel -o panel_afs/

Streams genotypes from a PFILE/BFILE/EIGENSTRAT input and emits three TSVs + a manifest JSON matching the shape AdmixTools 2's *_to_afs() family returns:

panel_afs/
├── afs_snp.tsv       (variant_id, chrom, pos, ref, alt, cm)
├── afs_freq.tsv      (variant_id × population, ALT-allele frequency)
├── afs_counts.tsv    (variant_id × population, called-allele counts)
└── afs_manifest.json (tool version, sample counts per pop, parameters)

Useful for the PFILE-native pipeline: skip the plink2 --pfile … --make-bed last-mile step before AT2's non-qpfstats f2 / qpAdm path. Bridge until pfile_to_afs() lands in admixtools upstream.

Feeding AFS into AT2 — use the end-to-end bridge script:

Rscript scripts/pgensb_afs_to_at2_f2_cache.R panel_afs/ panel_at2_f2_cache/

This loads the AFS bundle, applies the discard_from_aftable(maxmiss=0, …) filter that AT2's extract_f2 silently applies before writing its cache, then calls afs_to_f2() twice (once with poly_only=TRUE for type='f2', once with poly_only=FALSE for type='ap') to produce an AT2-ready f2 cache. From R:

library(admixtools)
f2 <- f2_from_precomp("panel_at2_f2_cache/", pops = my_pops, afprod = TRUE)
qpadm(f2, left = ..., right = ..., target = ...)

The sibling scripts/load_pgensb_afs.R is a raw loader that returns the AFS as in-memory data frames without filtering — use it for inspection / debugging, not as the AT2 entry point (raw AFS fed into afs_to_f2() produces divergent f2 because extract_f2's maxmiss=0 filter isn't applied).

Limitation: AT2's extract_f2(qpfstats=TRUE) path reads genotypes directly and bypasses the AFS layer entirely. For workflows that need qpfstats (e.g., ancient-DNA with high missingness), the PFILE→BED last-mile remains. The byte-equal-to-mergeit-qpfstats proof comes from the PFILE→BED→qpfstats path, not this AFS bridge.

Key flags:

--populations POP (repeatable) — restrict to a subset of populations
--no-pseudohaploid-adjust — skip pseudohaploid called-allele adjustment (treats all samples as diploid). Default applies the adjustment when the input has a PSEUDOHAPLOID column — pseudohaploid samples then contribute 1 called allele (not 2) for correct effective sample sizes in downstream variance estimates. Allele frequencies are unchanged.
--include-sex-chrom — extend to chr 23/24/25/26 (default: autosomes only)
--population-column NAME — aggregate by a .psam column other than POP

`inspect` — structured summary of one input

Format, sample count, variant count, populations, pseudohaploid mix, sex distribution, missingness histogram. TSV by default; --json for machine-readable.

Validation gates (exit 1)

merge applies soft-validation gates (a)-(c); validate applies all four (a)-(d). Any firing gate exits 1.

(a) Extras above warn threshold. Variants in input[N] absent from input[0] exceed the --on-extra warn threshold (10% of input[0] by default). Catches the input-order-reversed failure mode (smaller panel placed first; larger panel's distinct variants silently dropped).
(b) Ambiguous-strand drops above intersection threshold. Drops due to A/T or C/G allele ambiguity exceed --validate-strand-fail-pct of the alignment intersection (10% by default). Intersection denominator is deliberate: catches the wrong-panel failure mode (tiny intersection × 30% drop rate) that a canonical denominator would silently hide.
(c) Target call rate below threshold. In --target mode, target's non-missing call count over canonical variants is below --target-min-call-rate.
(d) --on-* error policy would have triggered. Validate-mode only — validate softens the error policies into a count + gate-(d) trigger so it can report the full picture rather than aborting at the first hit. In merge mode those policies aren't softened: they raise an InvariantViolation and exit 3 (explicit invariant enforcement is honored).

For merge, gates (a)-(c) run between pass 1 (alignment) and pass 2 (genotype streaming) — failing fast saves the pass-2 wallclock for an alignment that wouldn't have validated.

Exit codes

Stable across versions; safe to script against.

Code	Meaning
`0`	Success
`1`	Validation failure — gate (a)/(b)/(c)/(d) fired
`2`	I/O failure — read/write failure, plink2 subprocess failed, advisory output-prefix lock held
`3`	Invariant violation — multi-allelic input, duplicate canonical variant keys, `--on-* error` policy triggered, `--on-collision error` on duplicate IIDs
`4`	Usage error — bad CLI argument combination, unknown input prefix

Concurrency

Same input prefixes: safe, read-only.
Same output prefix: tool advisory-locks {out}.lock via fcntl.flock; fails clearly with exit 2 if held; cleans up on success or signal.
Cache directory: not managed by this tool — caller's problem.

The lock prevents two concurrent invocations from corrupting each other's output. It does not synchronize across machines (file lock is local-fs only). On Linux, the tool detects NFS/SMB/CIFS at lock-acquire time (via /proc/self/mountinfo) and emits a stderr warning since fcntl.flock semantics over network filesystems are implementation-defined and effectively no-op; the lock is still attempted. On macOS the detection is suppressed (no /proc-equivalent fs-type API) to avoid false positives — the network-fs caveat applies the same way, you just won't get the diagnostic warning.

Performance

Expected throughput 50-70 M genotypes/sec end-to-end on a typical Linux x86_64 machine (one core, default --block-size 2048), post-v0.3 vectorization. The current CI baseline measured 70.84 M g/s on a 100M-genotype synthetic fixture (ubuntu-latest); the perf benchmark gates against regression below 80% of that recorded baseline. For a real-world bind of two 700-sample × 1.15M-variant 1240k panels (~1.6B total genotypes), plan on roughly 25-35 seconds wallclock including pass-1 alignment, pass-2 genotype rewrite, and .psam finalization — roughly 20-40% faster than v0.1.

Troubleshooting

"plink2 not found on PATH; required for eigenstrat input"

Install plink2 v2.0.0-a.7.1+ (see Install). PFILE-only workflows have no plink2 dependency. The same message appears for BFILE inputs (with "bfile" instead of "eigenstrat").

"FID column header on line 1 is not at the beginning"

Symptom of an older plink2 reading our .psam output. We emit FID-first column order per the plink2 spec; if you see this from a downstream tool, verify the consumer is plink2 v2.0.0-a.7.x or newer. Earlier plink2 lines require --no-fid or an .fam-style fallback.

`--target` mode call-rate gate fires unexpectedly

Default threshold is 0.40 (40% of canonical variants called for the target sample). For very-low-coverage targets this may be too strict. Pass --target-min-call-rate 0.20 (or lower) if appropriate; with 0.0 the gate is fully disabled.

"ASCII per-line EIGENSTRAT" inputs

Both flavors of EIGENSTRAT input are supported: PACKEDANCESTRYMAP (binary, GENO /TGENO header — converted via plink2 --eigfile) and the older ASCII per-line variant (one digit per sample-variant cell — parsed natively, no plink2 dependency for the ASCII path). Format is auto-detected from the .geno header bytes.

Cross-source merges drop more variants than expected

By default A/T and C/G ambiguous SNPs are dropped wherever strand cannot be verified — including the case where both inputs have the same allele pair (e.g., both have A/T at the same position). The strand cannot be proven the same because complementing A/T gives T/A which is the same pair. This matches mergeit's strandcheck: YES convention and is the safe default for cross-source merges (different cohorts, different processing pipelines).

For single-source merges where REF/ALT calls are guaranteed consistent across inputs (same AADR release, same conversion pipeline), pass --trust-strand to passthrough matching ambiguous variants. The 1240k panel has ~1.3% A/T+C/G matching ambiguous SNPs so this typically recovers 10-20K additional variants on a full 1.15M-variant panel.

"ambiguous-strand drops exceed 10% of intersection"

Gate (b) fired. Three common causes:

Wrong panel pairing: tiny intersection × normal drop rate computes as a high fraction of intersection. Run pgen-samplebind hash on each input — if the hashes differ unexpectedly, the panels aren't what you thought they were.
Different strand conventions: cross-source merges where one source already strand-flipped at A/T+C/G sites.
Legitimate but high A/T+C/G fraction: 1240k naturally has ~5-8% ambiguous; a panel restricted to ambiguous-only sites would push above 10%. Raise the threshold with --validate-strand-fail-pct 20 if that's your case, or use --trust-strand for explicit pass-through (footgun warning: silently passes potentially-flipped genotypes).

Half-built output files after a failure

Don't happen. The merge orchestrator wraps pass 2 + .psam finalization in a try/except that unlinks the .pgen/.pvar/.psam triplet on any tool-internal exception, so downstream pipelines never silently consume a partial output. (Atomic-rename across the triplet isn't actually atomic on NFS — three separate file ops — so unlink-on-failure is the safer default.)

Output prefix is locked by another pgen-samplebind process

Another invocation holds the lock at {prefix}.lock. If that process has died (kill -9, OOM, lost terminal), the lock file persists and you must remove it manually:

ls -la /data/merged.lock     # check lock-file age vs. expected runtime
rm /data/merged.lock         # only after confirming no live process holds it

Verification

End-to-end byte-equal qpAdm parity against the established mergeit + plink2 + AdmixTools 2 pipeline is verified on every commit via a vendored AADR-derivative regression test (Patterson 7-source + 4 English target pops + 1 individual target, drawn from AADR v66 under fair-use). The result is the project's trust artifact — anyone can clone, run, and confirm pgen-samplebind reproduces the established pipeline on a published-research-shape workload without trusting the maintainer's claims. See CONTRIBUTING.md §Dogfood for run instructions and the three-tier breakdown.

Status

v0.3.2 — bugfix release. Closes #6: re-binding pgen-samplebind's own output with --population-column FID previously crashed because the prior output's POP column collided with the rename target. psam.rename_to_pop is now round-trippable through itself, unblocking the documented "panel extension" canonical use case (incrementally adding populations to a previously-bound dataset without rebuilding the full merge from scratch).

See CHANGELOG.md for the full feature list and known limitations.

Contributing

Issues and pull requests welcome at https://github.com/carstenerickson/pgen-samplebind/issues. Dev setup, test-runner conventions, commit + release process, and design philosophy are in CONTRIBUTING.md; the architecture tour for new contributors is in DEVELOPMENT.md. The project is small and opinionated — substantive scope changes (e.g., multi-allelic merges, dosage data, BFILE-only output) should start with a design discussion before implementation.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
scripts		scripts
src/pgen_samplebind		src/pgen_samplebind
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
DEVELOPMENT.md		DEVELOPMENT.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

pgen-samplebind

Why this exists

Install

Canonical use cases

1. Panel extension

2. Single-sample target append

3. Cross-source merge (different formats, different strand conventions)

4. AADR cross-version cohort assembly

Subcommands

merge — bind inputs into one output PFILE

validate — check alignment without writing

hash — emit canonical variant-set hash

afs — per-population allele-frequency-spectrum TSVs

inspect — structured summary of one input

Validation gates (exit 1)

Exit codes

Concurrency

Performance

Troubleshooting

"plink2 not found on PATH; required for eigenstrat input"

"FID column header on line 1 is not at the beginning"

--target mode call-rate gate fires unexpectedly

"ASCII per-line EIGENSTRAT" inputs

Cross-source merges drop more variants than expected

"ambiguous-strand drops exceed 10% of intersection"

Half-built output files after a failure

Output prefix is locked by another pgen-samplebind process

Verification

Status

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`merge` — bind inputs into one output PFILE

`validate` — check alignment without writing

`hash` — emit canonical variant-set hash

`afs` — per-population allele-frequency-spectrum TSVs

`inspect` — structured summary of one input

`--target` mode call-rate gate fires unexpectedly

Packages