This package provides containers to represent genomic experimental data as 2-dimensional matrices. In these matrices, the rows typically denote features or genomic regions of interest, while columns represent samples or cells.
The package currently includes representations for both SummarizedExperiment and RangedSummarizedExperiment. A distinction lies in the fact that the rows of a RangedSummarizedExperiment object are expected to be GenomicRanges (tutorial here), representing genomic regions of interest.
Important
The design of SummarizedExperiment class and its derivates adheres to the R/Bioconductor specification, where rows correspond to features, and columns represent samples or cells.
Note
These classes follow a functional paradigm for accessing or setting properties, with further details discussed in functional paradigm section.
A SummarizedExperiment contains three key attributes,
assays: A dictionary of matrices with assay names as keys, e.g. counts, logcounts etc.
row_data: Feature information e.g. genes, transcripts, exons, etc.
column_data: Sample information about the columns of the matrices.
Important
Both row_data and column_data are expected to be BiocFrame objects and will be coerced to a BiocFrame for consistent downstream operations.
In addition, these classes can optionally accept row_names and column_names. Since row_data and column_data may also contain names, the following rules are used in the implementation:
On construction, if row_names or column_names are not provided, these are automatically inferred from row_data and column_data objects.
On accessors of these objects, the row_names in row_data and column_data are replaced by the equivalents from the SE level.
On setters for these attributes, especially with the functional style (set_row_data and set_column_data methods), additional options are available to replace the names in the SE object.
Caution
These rules help avoid unexpected mdifications in names, when either row_data or column_data objects are modified.
To construct a SummarizedExperiment, we’ll first generate a matrix of read counts, representing the read counts from a series of RNA-seq experiments. Following that, we’ll create a BiocFrame object to denote feature information and a table for column annotations. This table may include the names for the columns and any other values we wish to represent.
Similarly, we can use the same information to construct a RangeSummarizedExperiment. We convert feature information into a GenomicRanges object and provide this as row_ranges:
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/genomicranges/SeqInfo.py:348: UserWarning: 'seqnames' is deprecated, use 'get_seqnames' instead
warn("'seqnames' is deprecated, use 'get_seqnames' instead", UserWarning)
Delayed or file-backed arrays
The general idea is that DelayedArray’s are a drop-in replacement for NumPy arrays, at least for BiocPy applications. Learn more about delayed arrays here.
For example, we can use the DelayedArray inside a SummarizedExperiment:
import numpyimport delayedarray# create a delayed array, can also be file-backedx = numpy.random.rand(100, 20)d = delayedarray.wrap(x)# operate over delayed arraysfiltered = d[1:100:2,1:8]total = filtered.sum(axis=0)normalized = filtered / totaltransformed = numpy.log1p(normalized)import summarizedexperiment as SEse_delayed = SE.SummarizedExperiment({ "counts": filtered, "lognorm": transformed })print(se_delayed)
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/anndata/_core/aligned_df.py:67: ImplicitModificationWarning: Transforming to str index.
warnings.warn("Transforming to str index.", ImplicitModificationWarning)
Tip
To convert an AnnData object to a BiocPy representation, utilize the from_anndata method in the SingleCellExperiment class. This minimizes the loss of information when converting between these two representations.
Getters/Setters
Getters are available to access various attributes using either the property notation or functional style.
You can subset experimental data by using the subset ([]) operator. This operation accepts different slice input types, such as a boolean vector, a slice object, a list of indices, or names (if available) to subset.
In our previous example, we didn’t include row or column names. Let’s create another SummarizedExperiment object that includes names.
Additionally, since RangeSummarizedExperiment contains row_ranges, this allows us to perform a number of range-based operations that are possible on a GenomicRanges object.
For example, to subset RangeSummarizedExperiment with a query set of regions:
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/genomicranges/SeqInfo.py:348: UserWarning: 'seqnames' is deprecated, use 'get_seqnames' instead
warn("'seqnames' is deprecated, use 'get_seqnames' instead", UserWarning)
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/genomicranges/SeqInfo.py:348: UserWarning: 'seqnames' is deprecated, use 'get_seqnames' instead
warn("'seqnames' is deprecated, use 'get_seqnames' instead", UserWarning)
Additionally, RSE supports many other interval based operations. Checkout the documentation for more details.
Combining experiments
SummarizedExperiment implements methods for the combine generics from BiocUtils.
These methods enable the merging or combining of multiple SummarizedExperiment objects, allowing users to aggregate data from different experiments or conditions. To demonstrate, let’s create multiple SummarizedExperiment objects.
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/summarizedexperiment/BaseSE.py:96: UserWarning: 'row_data' does not contain unique 'row_names'.
warn("'row_data' does not contain unique 'row_names'.", UserWarning)
You can use relaxed_combine_columns or relaxed_combined_rows when there’s mismatch in the number of features or samples. Missing rows or columns in any object are filled in with appropriate placeholder values before combining, e.g. missing assay’s are replaced with a masked numpy array.
# se3 contains an additional assay not present in se1se_relaxed_combine = relaxed_combine_columns(se3, se1)print(se_relaxed_combine)
Both these classes can also contain no experimental data, and they tend to be useful when integrated into more extensive data structures but do not contain any data themselves.
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/genomicranges/SeqInfo.py:348: UserWarning: 'seqnames' is deprecated, use 'get_seqnames' instead
warn("'seqnames' is deprecated, use 'get_seqnames' instead", UserWarning)