Skip to content

Bidirectional AnnData <-> Seurat/SCE conversion (v0.3.0)#4

Open
Huangy57 wants to merge 5 commits into
mainfrom
yhuang2/combine-convert2seurat-dev
Open

Bidirectional AnnData <-> Seurat/SCE conversion (v0.3.0)#4
Huangy57 wants to merge 5 commits into
mainfrom
yhuang2/combine-convert2seurat-dev

Conversation

@Huangy57
Copy link
Copy Markdown
Contributor

Summary

Adds bidirectional conversion to convert2anndata (v0.3.0): in addition
to SCE/Seurat -> AnnData, the package now converts AnnData (.h5ad) into
Seurat or SingleCellExperiment objects, with cli_convert() dispatching
on file extension.

New: convert_anndata_to_seurat(), convert_anndata_to_sce(), the
extract_anndata_* family, attach_reductions_seurat(), and the
setup/check/diagnose_anndata_python() reticulate helpers. Includes
Seurat 5 layer/slot robustness and use_raw='auto'. ~10 new test files
plus a standalone eval harness.

This PR also includes PR-polish on top of the feature commits:

  • Export extract_anndata_obsp() / extract_anndata_raw() for parity
    with the rest of the extract_anndata_* family; regenerate
    NAMESPACE + man pages.
  • CI: Python 3.8 -> 3.11, setup-python v2 -> v5, and pin
    RETICULATE_PYTHON across all steps so the AnnData tests run instead
    of silently skipping.
  • pkgdown: bidirectional description + grouped reference index.
  • Add NEWS.md (0.3.0).

Validation

Validated with R 4.4.1, Seurat 5.4.0, SingleCellExperiment 1.28.1,
reticulate 1.45.0, Python anndata 0.12.0:

  • R CMD check: 0 errors, 0 warnings, 3 benign notes.
  • testthat: 387 pass, 0 fail, 1 skip (CLI subprocess test, skips
    until the package is installed on the libpath).
  • Eval harness (roundtrip, edge cases, real pbmc3k, new-vs-original
    comparison) all pass.
  • Quantitative roundtrip fidelity: 22 pass / 3 known-gap / 0 fail.

Inspectable evidence (logs + fidelity table) is attached to
[settylab/sarah-nexus#18](https://github.com/settylab/sarah-nexus/issues/18).

Known limitations

  • The Seurat-mediated path (AnnData -> Seurat -> ...) cannot preserve
    arbitrary extra layers or var metadata — a Seurat assay has only
    counts/data/scale.data slots. Use the SCE-mediated path when
    verbatim layers / var matter.
  • convert_anndata_to_sce() does not yet attach obsp as colPairs
    (the Seurat target does attach obsp as graphs). Planned follow-up.

Test plan

  • CI green on the updated workflow (Python 3.11, reticulate pinned).
  • R CMD check clean locally.
  • pkgdown site builds with the new reference index.

Huangy57 and others added 4 commits May 8, 2026 16:51
Standalone script merged: convert_anndata_to_seurat() now accepts a
.h5ad path or an in-memory AnnData. setup_anndata_python() reproduces
the script's conda/RETICULATE_PYTHON/CONDA_PREFIX resolution chain.
The standalone convert_anndata2seurat.R becomes a thin shim.

Hardcoded knobs made flexible:
* counts_layer accepts a vector of candidates (default covers
  "counts", "raw_counts", "raw_count").
* attach_reductions_seurat() exposes a user-extensible reduction_map;
  default_reduction_map() covers the common scanpy embeddings;
  reduction_map = list(<key> = list(name = NA)) disables a default.
* assay name override propagates to data layer, var feature meta, and
  reductions.
* orig.ident resolution: explicit arg -> uns$conversion_source ->
  file basename -> "AnnData".

New scanpy-shaped data:
* use_raw = c("auto", "always", "never") wires adata.raw into the
  Seurat counts layer, including the typical scanpy pattern where raw
  carries the unfiltered gene set.
* attach_obsp = TRUE maps adata.obsp graphs (connectivities,
  distances) to Seurat Graphs.

Python env hardening (the user's biggest concern):
* check_anndata_python() runs five layered probes: interpreter
  reachable, PYTHONPATH not pinning a different Python's
  site-packages (the cryptic HPC numpy-source-directory error),
  anndata importable, numpy importable, end-to-end smoke probe.
  Each failure raises a tailored, actionable message naming the
  selected interpreter and the one-line fix.
* diagnose_anndata_python() prints a one-shot env snapshot (python,
  version, anndata, numpy, RETICULATE_PYTHON, CONDA_PREFIX,
  PYTHONPATH).
* setup_anndata_python() now resolves a name against
  ~/micromamba/envs/, ~/miniconda3/envs/, ~/anaconda3/envs/, and
  $MAMBA_ROOT_PREFIX/envs/, and validates by default.
* Conversion path entrypoints auto-call check_anndata_python() before
  read_h5ad(), so the user never sees the original cryptic numpy
  error from deep inside Python.
* .onAttach() prints a one-liner pointing at diagnose_anndata_python().

Real bugs uncovered & fixed:
* Seurat sanitises feature names ('gene_01' -> 'gene-01') during
  CreateSeuratObject; previously SetAssayData(layer = 'data') and the
  var feature-meta assignment failed with "No feature overlap"
  whenever inputs contained underscores. Realign rownames/colnames to
  the seurat object's actual ones after construction.
* convert_seurat_to_sce: ref_features/ref_cells were overwritten per
  iteration with the *current* assay's, defeating the mismatched-assay
  -> altExp routing. Set them once from the default assay.
* convert_to_anndata: process_other_assays(sce) was missing the
  assayName argument, so the assay used as X was duplicated under
  layers. Pass assayName through.
* anndata_mapping_keys() helper handles both the older-reticulate
  Python-KeysView and the newer R-character-vector return shapes.
  Without this, single-key layers/obsm with name "raw_count" got
  split character-by-character ('r','a','w','_'...) corrupting layer
  detection and SCE exclude lists.

Test coverage (all green):
* 35 comprehensive tests covering matrix dtypes, obs/var fidelity
  with NAs and factors, layer fallback chain, reduction map override
  and disable, orig.ident precedence ladder, file-path entry, custom
  assay, full round-trip, edge sizes.
* 12 tests for use_raw / obsp / Seurat Graphs.
* 4 realistic scanpy-shaped tests (1000 cells, raw + layers + obsm +
  obsp + uns), including backed='r' read.
* 14 tests for check_anndata_python and diagnose_anndata_python
  covering each failure branch via mocking.
* Pre-existing failures fixed in process_layers, ensure_csparse_matrix,
  process_main_assay, sparse_matrix_conversion, convert_seurat_to_sce,
  cli_convert (gracefully skips when not installed).

R CMD check: 0 errors, 0 warnings, 3 NOTEs (pre-existing housekeeping
items: .github dir, sandbox clock, LICENSE not in DESCRIPTION).
testthat: 387 expectations, 0 failures, 0 errors.
4 standalone eval scripts (eval_new_pkg, eval_compare, eval_roundtrip,
eval_edge_cases) all green.

Readme: new "Python environment" section plus a Troubleshooting
subsection covering the common reticulate failure modes (no Python
found, anndata not installed, PYTHONPATH pollution, silent typos
in conda env names).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aw='auto'

Two issues surfaced when running the same test suite against a second
R + Python configuration (micromamba r_env: R 4.4.1 + Python 3.11 +
anndata 0.12 + Seurat 5.x with strict-mode SeuratObject):

* SeuratObject >= 5.0.0 made `slot=` defunct (was deprecated in 5.0.0,
  now errors). Three call sites still passed `slot = "counts"`:
  extract_counts_matrix, identify_alt_exps_seurat, attach_alt_experiments_sce.
  Switch to `layer = "counts"` -- accepted by both pre-5.0 and >= 5.0.

* use_raw = "auto" was using adata.raw whenever no counts_layer matched.
  On real scanpy datasets like pbmc3k_processed, raw retains the
  unfiltered gene set (13714 genes) while adata.X is the
  analysis-ready filtered set (1838 genes). Auto silently swapped the
  user's filtered Seurat object for the unfiltered one. Now "auto"
  only uses raw when raw and adata.X share the gene set (the
  `adata.raw = adata` "save-raw-then-normalize" pattern). When they
  differ, prefer adata.X and emit a one-line notice; users who
  explicitly want the unfiltered raw can pass use_raw = "always".

Verified against the real pbmc3k_processed h5ad: 2638 cells x 1838
filtered genes, 8 louvain clusters preserved, all 4 obsm reductions
(pca, tsne, umap, draw_graph_fr) attached with exact-match
embeddings, both kNN graphs (connectivities + distances) attached as
Seurat Graphs, and downstream FindNeighbors -> FindClusters succeeds.

testthat: 387 expectations, 0 failures, 0 errors under both
configurations (r_env and fhR + scvi_env).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ormly

- Export `extract_anndata_obsp()` and `extract_anndata_raw()` for parity
  with the rest of the `extract_anndata_*` family (X / layers / obsm
  were already exported). Regenerate NAMESPACE + man pages.
- pkgdown: bidirectional description on the home page + grouped
  reference index (Bidirectional conversion / Python environment /
  AnnData component extractors / Attaching components / Internal
  building blocks). All 35 exports covered.
- NEWS.md: new 0.3.0 entry covering bidirectional conversion, CLI
  dispatch by extension, reticulate helpers, Seurat 5 layer/slot
  robustness, `use_raw='auto'`, expanded tests / eval harness,
  validation results, and the documented known limitations.
- DESCRIPTION: bump RoxygenNote 7.3.1 -> 7.3.3 (regenerated by
  devtools::document() under roxygen2 7.3.3); update URL to list the
  GitHub repo + the pkgdown site (required by pkgdown::check_pkgdown).
- Bump `actions/setup-python` v2 -> v5 and pin Python 3.8 (EOL) -> 3.11.
  Modern `anndata` will not install on 3.8.
- Export `RETICULATE_PYTHON` from `steps.setup-python.outputs.python-path`
  via `$GITHUB_ENV` so it persists to the test/coverage step.
  Previously it was set only on the "Install Python dependencies" step
  (and to a hard-coded /opt/hostedtoolcache path that no longer exists),
  so reticulate had no interpreter pinned at test time and the new
  AnnData <-> Seurat/SCE tests silently skipped.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

@katosh
Copy link
Copy Markdown
Collaborator

katosh commented May 19, 2026

Please ensure good texting coverage and include round trip testing for complex anndatas and Seurat objects.

Add testthat coverage that round-trips COMPLEX objects through the full
conversion cycle and asserts structural + numeric fidelity, per the
issue #18 ask. Three new files plus a shared fixture/helper:

- helper-roundtrip.R: shared skip guard + seeded-RNG (reused from the
  existing suite), output-quieting + comparison utilities, and complex
  fixture builders for AnnData, SCE, and Seurat objects (multiple
  layers/assays, PCA+UMAP, varm, obsp/varp graphs, altExps, raw,
  sparse+dense matrices, and obs/var with factor/NA/logical/character).
- test-roundtrip_complex_anndata.R: AnnData -> SCE -> AnnData (faithful,
  through on-disk .h5ad) and AnnData -> Seurat -> SCE -> AnnData; asserts
  X/layers/obsm/obs/var fidelity and obsp connectivity.
- test-roundtrip_complex_sce.R: SCE -> AnnData -> SCE; asserts assays,
  reducedDims, colData/rowData, and colPair->obsp / altExp->uns carry.
- test-roundtrip_complex_seurat.R: Seurat -> SCE -> AnnData -> Seurat;
  asserts counts, reductions, factor/NA metadata, and NN-graph
  connectivity.

The converters are one-directional per component; the documented
structural gaps (Seurat-mediated path drops extra layers and var;
convert_anndata_to_sce has no obsp->colPairs; altExps/colPairs/varm/varp/raw
not restored on the reverse leg) are written as skip()'d tests with
TODO(#18) so they are recorded rather than silently missing.

Test-only change: no edits to R/, NAMESPACE, man/, DESCRIPTION, or NEWS.md.
Verified with R 4.4.1 + Seurat 5.4.0 + anndata: testthat 448 pass / 0 fail
/ 8 skip; R CMD check 0 errors / 0 warnings / 3 benign notes; package
coverage 80.1% -> 82.7%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants