Bidirectional AnnData <-> Seurat/SCE conversion (v0.3.0)#4
Open
Huangy57 wants to merge 5 commits into
Open
Conversation
Standalone script merged: convert_anndata_to_seurat() now accepts a
.h5ad path or an in-memory AnnData. setup_anndata_python() reproduces
the script's conda/RETICULATE_PYTHON/CONDA_PREFIX resolution chain.
The standalone convert_anndata2seurat.R becomes a thin shim.
Hardcoded knobs made flexible:
* counts_layer accepts a vector of candidates (default covers
"counts", "raw_counts", "raw_count").
* attach_reductions_seurat() exposes a user-extensible reduction_map;
default_reduction_map() covers the common scanpy embeddings;
reduction_map = list(<key> = list(name = NA)) disables a default.
* assay name override propagates to data layer, var feature meta, and
reductions.
* orig.ident resolution: explicit arg -> uns$conversion_source ->
file basename -> "AnnData".
New scanpy-shaped data:
* use_raw = c("auto", "always", "never") wires adata.raw into the
Seurat counts layer, including the typical scanpy pattern where raw
carries the unfiltered gene set.
* attach_obsp = TRUE maps adata.obsp graphs (connectivities,
distances) to Seurat Graphs.
Python env hardening (the user's biggest concern):
* check_anndata_python() runs five layered probes: interpreter
reachable, PYTHONPATH not pinning a different Python's
site-packages (the cryptic HPC numpy-source-directory error),
anndata importable, numpy importable, end-to-end smoke probe.
Each failure raises a tailored, actionable message naming the
selected interpreter and the one-line fix.
* diagnose_anndata_python() prints a one-shot env snapshot (python,
version, anndata, numpy, RETICULATE_PYTHON, CONDA_PREFIX,
PYTHONPATH).
* setup_anndata_python() now resolves a name against
~/micromamba/envs/, ~/miniconda3/envs/, ~/anaconda3/envs/, and
$MAMBA_ROOT_PREFIX/envs/, and validates by default.
* Conversion path entrypoints auto-call check_anndata_python() before
read_h5ad(), so the user never sees the original cryptic numpy
error from deep inside Python.
* .onAttach() prints a one-liner pointing at diagnose_anndata_python().
Real bugs uncovered & fixed:
* Seurat sanitises feature names ('gene_01' -> 'gene-01') during
CreateSeuratObject; previously SetAssayData(layer = 'data') and the
var feature-meta assignment failed with "No feature overlap"
whenever inputs contained underscores. Realign rownames/colnames to
the seurat object's actual ones after construction.
* convert_seurat_to_sce: ref_features/ref_cells were overwritten per
iteration with the *current* assay's, defeating the mismatched-assay
-> altExp routing. Set them once from the default assay.
* convert_to_anndata: process_other_assays(sce) was missing the
assayName argument, so the assay used as X was duplicated under
layers. Pass assayName through.
* anndata_mapping_keys() helper handles both the older-reticulate
Python-KeysView and the newer R-character-vector return shapes.
Without this, single-key layers/obsm with name "raw_count" got
split character-by-character ('r','a','w','_'...) corrupting layer
detection and SCE exclude lists.
Test coverage (all green):
* 35 comprehensive tests covering matrix dtypes, obs/var fidelity
with NAs and factors, layer fallback chain, reduction map override
and disable, orig.ident precedence ladder, file-path entry, custom
assay, full round-trip, edge sizes.
* 12 tests for use_raw / obsp / Seurat Graphs.
* 4 realistic scanpy-shaped tests (1000 cells, raw + layers + obsm +
obsp + uns), including backed='r' read.
* 14 tests for check_anndata_python and diagnose_anndata_python
covering each failure branch via mocking.
* Pre-existing failures fixed in process_layers, ensure_csparse_matrix,
process_main_assay, sparse_matrix_conversion, convert_seurat_to_sce,
cli_convert (gracefully skips when not installed).
R CMD check: 0 errors, 0 warnings, 3 NOTEs (pre-existing housekeeping
items: .github dir, sandbox clock, LICENSE not in DESCRIPTION).
testthat: 387 expectations, 0 failures, 0 errors.
4 standalone eval scripts (eval_new_pkg, eval_compare, eval_roundtrip,
eval_edge_cases) all green.
Readme: new "Python environment" section plus a Troubleshooting
subsection covering the common reticulate failure modes (no Python
found, anndata not installed, PYTHONPATH pollution, silent typos
in conda env names).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aw='auto' Two issues surfaced when running the same test suite against a second R + Python configuration (micromamba r_env: R 4.4.1 + Python 3.11 + anndata 0.12 + Seurat 5.x with strict-mode SeuratObject): * SeuratObject >= 5.0.0 made `slot=` defunct (was deprecated in 5.0.0, now errors). Three call sites still passed `slot = "counts"`: extract_counts_matrix, identify_alt_exps_seurat, attach_alt_experiments_sce. Switch to `layer = "counts"` -- accepted by both pre-5.0 and >= 5.0. * use_raw = "auto" was using adata.raw whenever no counts_layer matched. On real scanpy datasets like pbmc3k_processed, raw retains the unfiltered gene set (13714 genes) while adata.X is the analysis-ready filtered set (1838 genes). Auto silently swapped the user's filtered Seurat object for the unfiltered one. Now "auto" only uses raw when raw and adata.X share the gene set (the `adata.raw = adata` "save-raw-then-normalize" pattern). When they differ, prefer adata.X and emit a one-line notice; users who explicitly want the unfiltered raw can pass use_raw = "always". Verified against the real pbmc3k_processed h5ad: 2638 cells x 1838 filtered genes, 8 louvain clusters preserved, all 4 obsm reductions (pca, tsne, umap, draw_graph_fr) attached with exact-match embeddings, both kNN graphs (connectivities + distances) attached as Seurat Graphs, and downstream FindNeighbors -> FindClusters succeeds. testthat: 387 expectations, 0 failures, 0 errors under both configurations (r_env and fhR + scvi_env). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ormly - Export `extract_anndata_obsp()` and `extract_anndata_raw()` for parity with the rest of the `extract_anndata_*` family (X / layers / obsm were already exported). Regenerate NAMESPACE + man pages. - pkgdown: bidirectional description on the home page + grouped reference index (Bidirectional conversion / Python environment / AnnData component extractors / Attaching components / Internal building blocks). All 35 exports covered. - NEWS.md: new 0.3.0 entry covering bidirectional conversion, CLI dispatch by extension, reticulate helpers, Seurat 5 layer/slot robustness, `use_raw='auto'`, expanded tests / eval harness, validation results, and the documented known limitations. - DESCRIPTION: bump RoxygenNote 7.3.1 -> 7.3.3 (regenerated by devtools::document() under roxygen2 7.3.3); update URL to list the GitHub repo + the pkgdown site (required by pkgdown::check_pkgdown).
- Bump `actions/setup-python` v2 -> v5 and pin Python 3.8 (EOL) -> 3.11. Modern `anndata` will not install on 3.8. - Export `RETICULATE_PYTHON` from `steps.setup-python.outputs.python-path` via `$GITHUB_ENV` so it persists to the test/coverage step. Previously it was set only on the "Install Python dependencies" step (and to a hard-coded /opt/hostedtoolcache path that no longer exists), so reticulate had no interpreter pinned at test time and the new AnnData <-> Seurat/SCE tests silently skipped.
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
Collaborator
|
Please ensure good texting coverage and include round trip testing for complex anndatas and Seurat objects. |
Add testthat coverage that round-trips COMPLEX objects through the full conversion cycle and asserts structural + numeric fidelity, per the issue #18 ask. Three new files plus a shared fixture/helper: - helper-roundtrip.R: shared skip guard + seeded-RNG (reused from the existing suite), output-quieting + comparison utilities, and complex fixture builders for AnnData, SCE, and Seurat objects (multiple layers/assays, PCA+UMAP, varm, obsp/varp graphs, altExps, raw, sparse+dense matrices, and obs/var with factor/NA/logical/character). - test-roundtrip_complex_anndata.R: AnnData -> SCE -> AnnData (faithful, through on-disk .h5ad) and AnnData -> Seurat -> SCE -> AnnData; asserts X/layers/obsm/obs/var fidelity and obsp connectivity. - test-roundtrip_complex_sce.R: SCE -> AnnData -> SCE; asserts assays, reducedDims, colData/rowData, and colPair->obsp / altExp->uns carry. - test-roundtrip_complex_seurat.R: Seurat -> SCE -> AnnData -> Seurat; asserts counts, reductions, factor/NA metadata, and NN-graph connectivity. The converters are one-directional per component; the documented structural gaps (Seurat-mediated path drops extra layers and var; convert_anndata_to_sce has no obsp->colPairs; altExps/colPairs/varm/varp/raw not restored on the reverse leg) are written as skip()'d tests with TODO(#18) so they are recorded rather than silently missing. Test-only change: no edits to R/, NAMESPACE, man/, DESCRIPTION, or NEWS.md. Verified with R 4.4.1 + Seurat 5.4.0 + anndata: testthat 448 pass / 0 fail / 8 skip; R CMD check 0 errors / 0 warnings / 3 benign notes; package coverage 80.1% -> 82.7%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds bidirectional conversion to
convert2anndata(v0.3.0): in additionto SCE/Seurat -> AnnData, the package now converts AnnData (.h5ad) into
Seurat or SingleCellExperiment objects, with
cli_convert()dispatchingon file extension.
New:
convert_anndata_to_seurat(),convert_anndata_to_sce(), theextract_anndata_*family,attach_reductions_seurat(), and thesetup/check/diagnose_anndata_python()reticulate helpers. IncludesSeurat 5 layer/slot robustness and
use_raw='auto'. ~10 new test filesplus a standalone eval harness.
This PR also includes PR-polish on top of the feature commits:
extract_anndata_obsp()/extract_anndata_raw()for paritywith the rest of the
extract_anndata_*family; regenerateNAMESPACE + man pages.
setup-pythonv2 -> v5, and pinRETICULATE_PYTHONacross all steps so the AnnData tests run insteadof silently skipping.
NEWS.md(0.3.0).Validation
Validated with R 4.4.1, Seurat 5.4.0, SingleCellExperiment 1.28.1,
reticulate 1.45.0, Python anndata 0.12.0:
R CMD check: 0 errors, 0 warnings, 3 benign notes.testthat: 387 pass, 0 fail, 1 skip (CLI subprocess test, skipsuntil the package is installed on the libpath).
comparison) all pass.
Inspectable evidence (logs + fidelity table) is attached to
[settylab/sarah-nexus#18](https://github.com/settylab/sarah-nexus/issues/18).Known limitations
AnnData -> Seurat -> ...) cannot preservearbitrary extra
layersorvarmetadata — a Seurat assay has onlycounts/data/scale.dataslots. Use the SCE-mediated path whenverbatim layers /
varmatter.convert_anndata_to_sce()does not yet attachobspascolPairs(the Seurat target does attach
obspas graphs). Planned follow-up.Test plan
R CMD checkclean locally.