Represent single-cell experiments

This package provides container class to represent single-cell experimental data as 2-dimensional matrices. In these matrices, the rows typically denote features or genomic regions of interest, while columns represent cells. In addition, a SingleCellExperiment (SCE) object may contain low-dimensionality embeddings, alternative experiments performed on same sample or set of cells.

Important

The design of SingleCellExperiment class and its derivates adheres to the R/Bioconductor specification, where rows correspond to features, and columns represent cells.

Note

These classes follow a functional paradigm for accessing or setting properties, with further details discussed in functional paradigm section.

Installation

To get started, install the package from PyPI

pip install singlecellexperiment

Construction

The SingleCellExperiment extends RangeSummarizedExperiment and contains additional attributes:

  • reduced_dims: Slot for low-dimensionality embeddings for each cell.

  • alternative_experiments: Manages multi-modal experiments performed on the same sample or set of cells.

  • row_pairs or column_pairs: Stores relationships between features or cells.

Note

In contrast to R, matrices in Python are unnamed and do not contain row or column names. Hence, these matrices cannot be directly used as values in assays or alternative experiments. We strictly enforce type checks in these cases. To relax these restrictions for alternative experiments, set type_check_alternative_experiments to False.

Important

If you are using the alternative_experiments slot, the number of cells must match the parent experiment. Otherwise, the expectation is that the cells do not share the same sample or annotations and cannot be set in alternative experiments!

Before we construct a SingleCellExperiment object, lets generate information about rows, columns and a mock experimental data from single-cell rna-seq experiments:

import pandas as pd
import numpy as np
from scipy import sparse as sp
from biocframe import BiocFrame
from genomicranges import GenomicRanges
from random import random

nrows = 200
ncols = 6
counts = sp.rand(nrows, ncols, density=0.2, format="csr")
row_data = BiocFrame(
    {
        "seqnames": [
            "chr1",
            "chr2",
            "chr2",
            "chr2",
            "chr1",
            "chr1",
            "chr3",
            "chr3",
            "chr3",
            "chr3",
        ]
        * 20,
        "starts": range(100, 300),
        "ends": range(110, 310),
        "strand": ["-", "+", "+", "*", "*", "+", "+", "+", "-", "-"] * 20,
        "score": range(0, 200),
        "GC": [random() for _ in range(10)] * 20,
    }
)

col_data = pd.DataFrame(
    {
        "celltype": ["cluster1", "cluster2"] * 3,
    }
)

Now lets create the SingleCellExperiment instance:

from singlecellexperiment import SingleCellExperiment

sce = SingleCellExperiment(
    assays={"counts": counts}, row_data=row_data, column_data=col_data,
    reduced_dims = {"random_embeds": np.random.rand(ncols, 4)}
)

print(sce)
class: SingleCellExperiment
dimensions: (200, 6)
assays(1): ['counts']
row_data columns(6): ['seqnames', 'starts', 'ends', 'strand', 'score', 'GC']
row_names(0):  
column_data columns(1): ['celltype']
column_names(6): ['0', '1', '2', '3', '4', '5']
main_experiment_name:  
reduced_dimensions(1): ['random_embeds']
alternative_experiments(0): []
row_pairs(0): []
column_pairs(0): []
metadata(0): 

Tip

You can also use delayed or file-backed arrays for representing experimental data, check out this section from summarized experiment.

Interop with anndata

We provide convenient methods for loading an AnnData or h5ad file into SingleCellExperiment objects.

For example, lets create an AnnData object,

import anndata as ad
from scipy import sparse as sp

counts = sp.csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
adata = ad.AnnData(counts)
print(adata)
AnnData object with n_obs × n_vars = 100 × 2000

Converting AnnData as SingleCellExperiment is straightforward:

sce_adata = SingleCellExperiment.from_anndata(adata)
print(sce_adata)
class: SingleCellExperiment
dimensions: (2000, 100)
assays(1): ['X']
row_data columns(0): []
row_names(2000): ['0', '1', '2', ..., '1997', '1998', '1999']
column_data columns(0): []
column_names(100): ['0', '1', '2', ..., '97', '98', '99']
main_experiment_name:  
reduced_dimensions(0): []
alternative_experiments(0): []
row_pairs(0): []
column_pairs(0): []
metadata(1): uns

and vice-verse. All assays from SCE are represented in the layers slot of the AnnData object:

adata2 = sce_adata.to_anndata()
print(adata2)
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/functools.py:912: ImplicitModificationWarning: Transforming to str index.
  return dispatch(args[0].__class__)(*args, **kw)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 adata2 = sce_adata.to_anndata()
      2 print(adata2)

File ~/work/SingleCellExperiment/SingleCellExperiment/.tox/docs/lib/python3.12/site-packages/singlecellexperiment/SingleCellExperiment.py:1147, in SingleCellExperiment.to_anndata(self, include_alternative_experiments)
   1137 def to_anndata(self, include_alternative_experiments: bool = False):
   1138     """Transform ``SingleCellExperiment``-like into a :py:class:`~anndata.AnnData` representation.
   1139 
   1140     Args:
   (...)   1145         A tuple with ``AnnData`` main experiment and a list of alternative experiments.
   1146     """
-> 1147     obj = super().to_anndata()
   1149     from delayedarray import (
   1150         DelayedArray,
   1151         is_sparse,
   1152         to_dense_array,
   1153         to_scipy_sparse_matrix,
   1154     )
   1156     if self.reduced_dims is not None:

File ~/work/SingleCellExperiment/SingleCellExperiment/.tox/docs/lib/python3.12/site-packages/summarizedexperiment/base.py:1192, in BaseSE.to_anndata(self)
   1189 if tcols.empty:
   1190     tcols.index = range(self._shape[1])
-> 1192 obj = AnnData(
   1193     obs=tcols,
   1194     var=trows,
   1195     uns=self.metadata,
   1196     layers=layers,
   1197 )
   1199 return obj

    [... skipping hidden 1 frame]

File ~/work/SingleCellExperiment/SingleCellExperiment/.tox/docs/lib/python3.12/site-packages/anndata/_core/anndata.py:252, in AnnData.__init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, obsp, varp, oidx, vidx)
    250     self._init_as_view(X, oidx, vidx)
    251 else:
--> 252     self._init_as_actual(
    253         X=X,
    254         obs=obs,
    255         var=var,
    256         uns=uns,
    257         obsm=obsm,
    258         varm=varm,
    259         raw=raw,
    260         layers=layers,
    261         dtype=dtype,
    262         shape=shape,
    263         obsp=obsp,
    264         varp=varp,
    265         filename=filename,
    266         filemode=filemode,
    267     )

File ~/work/SingleCellExperiment/SingleCellExperiment/.tox/docs/lib/python3.12/site-packages/anndata/_core/anndata.py:459, in AnnData._init_as_actual(self, X, obs, var, uns, obsm, varm, varp, obsp, raw, layers, dtype, shape, filename, filemode)
    456         raise ValueError(msg)
    458 # unstructured annotations
--> 459 self.uns = uns or OrderedDict()
    461 self.obsm = obsm
    462 self.varm = varm

File ~/work/SingleCellExperiment/SingleCellExperiment/.tox/docs/lib/python3.12/site-packages/anndata/_core/anndata.py:889, in AnnData.uns(self, value)
    887 if not isinstance(value, MutableMapping):
    888     msg = "Only mutable mapping types (e.g. dict) are allowed for `.uns`."
--> 889     raise ValueError(msg)
    890 if isinstance(value, DictView):
    891     value = value.copy()

ValueError: Only mutable mapping types (e.g. dict) are allowed for `.uns`.

Similarly, one can load a h5ad file:

from singlecellexperiment import read_h5ad
sce_h5 = read_h5ad("../../assets/data/adata.h5ad")
print(sce_h5)

From 10X formats

In addition, we also provide convenient methods to load a 10X Genomics HDF5 Feature-Barcode Matrix Format file.

from singlecellexperiment import read_tenx_h5
sce_h5 = read_tenx_h5("../../assets/data/tenx.sub.h5")
print(sce_h5)

Note

Methods are also available to read a 10x matrix market directory using the read_tenx_mtx function.

Getters/Setters

Getters are available to access various attributes using either the property notation or functional style.

Since SingleCellExperiment extends RangedSummarizedExperiment, all getters and setters from the base class are accessible here; more details here.

# access assay names
print("reduced dim names (as property): ", sce.reduced_dim_names)
print("reduced dim names (functional style): ", sce.get_reduced_dim_names())

# access row data
print(sce.row_data)

Access a reduced dimension

One can access an reduced dimension by index or name:

sce.reduced_dim(0) # same as se.reduced_dim("random_embeds")

Subset experiments

You can subset experimental data by using the subset ([]) operator. This operation accepts different slice input types, such as a boolean vector, a slice object, a list of indices, or names (if available) to subset.

In our previous example, we didn’t include row or column names. Let’s create another SingleCellExperiment object that includes names.

subset_sce = sce[0:10, 0:3]
print(subset_sce)

Combining experiments

SingleCellExperiment implements methods for the combine generic from BiocUtils.

These methods enable the merging or combining of multiple SingleCellExperiment objects, allowing users to aggregate data from different experiments or conditions. Note: row_pairs and column_pairs are not ignored as part of this operation.

To demonstrate, let’s create multiple SingleCellExperiment objects (read more about this in combine section from SummarizedExperiment).

ncols = 10
nrows = 100
sce1 = SingleCellExperiment(
    assays={"counts": np.random.poisson(lam=10, size=(nrows, ncols))},
    row_data=BiocFrame({"A": [1] * nrows}),
    column_data=BiocFrame({"A": [1] * ncols}),
)

sce2 = SingleCellExperiment(
    assays={
        "counts": np.random.poisson(lam=10, size=(nrows, ncols)),
        # "normalized": np.random.normal(size=(nrows, ncols)),
    },
    row_data=BiocFrame({"A": [3] * nrows}),
    column_data=BiocFrame({"A": [3] * ncols}),
)

rowdata1 = pd.DataFrame(
    {
        "seqnames": ["chr_5", "chr_3", "chr_2"],
        "start": [500, 300, 200],
        "end": [510, 310, 210],
    },
    index=["HER2", "BRCA1", "TPFK"],
)
coldata1 = pd.DataFrame(
    {
        "sample": ["SAM_1", "SAM_2", "SAM_3"],
        "disease": ["True", "True", "True"],
        "doublet_score": [0.15, 0.62, 0.18],
    },
    index=["cell_1", "cell_2", "cell_3"],
)
sce_alts1 = SingleCellExperiment(
    assays={
        "counts": np.random.poisson(lam=5, size=(3, 3)),
        "lognorm": np.random.lognormal(size=(3, 3)),
    },
    row_data=rowdata1,
    column_data=coldata1,
    row_names=["HER2", "BRCA1", "TPFK"],
    column_names=["cell_1", "cell_2", "cell_3"],
    metadata={"seq_type": "paired"},
    reduced_dims={"PCA": np.random.poisson(lam=10, size=(3, 5))},
    alternative_experiments={
        "modality1": SingleCellExperiment(
            assays={"counts2": np.random.poisson(lam=10, size=(3, 3))},
        )
    },
)

rowdata2 = pd.DataFrame(
    {
        "seqnames": ["chr_5", "chr_3", "chr_2"],
        "start": [500, 300, 200],
        "end": [510, 310, 210],
    },
    index=["HER2", "BRCA1", "TPFK"],
)
coldata2 = pd.DataFrame(
    {
        "sample": ["SAM_4", "SAM_5", "SAM_6"],
        "disease": ["True", "False", "True"],
        "doublet_score": [0.05, 0.23, 0.54],
    },
    index=["cell_4", "cell_5", "cell_6"],
)
sce_alts2 = SingleCellExperiment(
    assays={
        "counts": np.random.poisson(lam=5, size=(3, 3)),
        # "lognorm": np.random.lognormal(size=(3, 3)),
    },
    row_data=rowdata2,
    column_data=coldata2,
    metadata={"seq_platform": "Illumina NovaSeq 6000"},
    reduced_dims={"PCA": np.random.poisson(lam=5, size=(3, 5))},
    alternative_experiments={
        "modality1": SingleCellExperiment(
            assays={"counts2": np.random.poisson(lam=5, size=(3, 3))},
        )
    },
)

The combine_rows or combine_columns operations, expect all experiments to contain the same assay names. To combine experiments by row:

from biocutils import relaxed_combine_columns, combine_columns, combine_rows, relaxed_combine_rows
sce_combined = combine_rows(sce2, sce1)
print(sce_combined)

Similarly to combine by column:

sce_combined = combine_columns(sce2, sce1)
print(sce_combined)

Note

You can use relaxed_combine_columns or relaxed_combined_rows when there’s mismatch in the number of features or samples. Missing rows or columns in any object are filled in with appropriate placeholder values before combining, e.g. missing assay’s are replaced with a masked numpy array.

# sce_alts1 contains an additional assay not present in sce_alts2
sce_relaxed_combine = relaxed_combine_columns(sce_alts1, sce_alts2)
print(sce_relaxed_combine)

Export as AnnData or MuData

The package also provides methods to convert a SingleCellExperiment object into a MuData representation:

mdata = sce.to_mudata()
mdata

or coerce to an AnnData object:

adata, alts = sce.to_anndata()
print("main experiment: ", adata)
print("alternative experiments: ", alts)