Represent genomic experimental data¶
This package provides classes to represent genomic experiments, commonly generated by sequencing experiments. It provides similar interfaces and functionality as the Bioconductor equivalent. The package currently provides both SummarizedExperiment
& RangedSummarizedExperiment
representations. A key difference between these two is the RangedSummarizedExperiment
object provides an additional slot to store genomic regions for each feature and is expected to be GenomicRanges
(more here).
Important
The design of SummarizedExperiment
class and its derivates adheres to the R/Bioconductor specification, where rows correspond to features, and columns represent samples or cells.
Note
These classes follow a functional paradigm for accessing or setting properties, with further details discussed in functional paradigm section.
Construct a SummarizedExperiment
object¶
A SummarizedExperiment
contains three key attributes,
assays
: A dictionary of matrices with assay names as keys, e.g. counts, logcounts etc.row_data
: Feature information e.g. genes, transcripts, exons, etc.column_data
: Sample information about the columns of the matrices.
Important
Both row_data
and column_data
are expected to be BiocFrame objects and will be coerced to a BiocFrame
for consistent downstream operations.
In addition, these classes can optionally accept row_names
and column_names
. Since row_data
and column_data
may also contain names, the following rules are used in the implementation:
On construction, if
row_names
orcolumn_names
are not provided, these are automatically inferred fromrow_data
andcolumn_data
objects.On accessors of these objects, the
row_names
inrow_data
andcolumn_data
are replaced by the equivalents from the SE level.On setters for these attributes, especially with the functional style (
set_row_data
andset_column_data
methods), additional options are available to replace the names in the SE object.
Caution
These rules help avoid unexpected mdifications in names, when either row_data
or column_data
objects are modified.
To construct a SummarizedExperiment
, we’ll first generate a matrix of read counts, representing the read counts from a series of RNA-seq experiments. Following that, we’ll create a BiocFrame
object to denote feature information and a table for column annotations. This table may include the names for the columns and any other values we wish to represent.
from random import random
import pandas as pd
import numpy as np
from biocframe import BiocFrame
nrows = 200
ncols = 6
counts = np.random.rand(nrows, ncols)
row_data = BiocFrame(
{
"seqnames": [
"chr1",
"chr2",
"chr2",
"chr2",
"chr1",
"chr1",
"chr3",
"chr3",
"chr3",
"chr3",
]
* 20,
"starts": range(100, 300),
"ends": range(110, 310),
"strand": ["-", "+", "+", "*", "*", "+", "+", "+", "-", "-"] * 20,
"score": range(0, 200),
"GC": [random() for _ in range(10)] * 20,
}
)
col_data = pd.DataFrame(
{
"treatment": ["ChIP", "Input"] * 3,
}
)
Note
The inputs row_data
and column_data
are expected to be BiocFrame
objects and will be coerced to a BiocFrame
if a pandas DataFrame
is supplied.
SummarizedExperiment
¶
A SummarizedExperiment
is a base class to represent any genomic experiment. This class expects features (row_data
) to be either a pandas DataFrame
or any variant of BiocFrame
.
from summarizedexperiment import SummarizedExperiment
se = SummarizedExperiment(
assays={"counts": counts}, row_data=row_data, column_data=col_data
)
print(se)
class: SummarizedExperiment
dimensions: (200, 6)
assays(1): ['counts']
row_data columns(6): ['seqnames', 'starts', 'ends', 'strand', 'score', 'GC']
row_names(0):
column_data columns(1): ['treatment']
column_names(6): ['0', '1', '2', '3', '4', '5']
metadata(0):
RangedSummarizedExperiment
¶
Similarly, we can use the same information to construct a RangeSummarizedExperiment
. We convert feature information into a GenomicRanges
object and provide this as row_ranges
:
from genomicranges import GenomicRanges
from summarizedexperiment import RangedSummarizedExperiment
gr = GenomicRanges.from_pandas(row_data.to_pandas())
rse = RangedSummarizedExperiment(
assays={"counts": counts}, row_data=row_data, row_ranges=gr, column_data=col_data
)
print(rse)
class: RangedSummarizedExperiment
dimensions: (200, 6)
assays(1): ['counts']
row_data columns(6): ['seqnames', 'starts', 'ends', 'strand', 'score', 'GC']
row_names(0):
column_data columns(1): ['treatment']
column_names(6): ['0', '1', '2', '3', '4', '5']
metadata(0):
Delayed or file-backed arrays¶
The general idea is that DelayedArray’s are a drop-in replacement for NumPy arrays, at least for BiocPy applications.
For example, we can use the DelayedArray
inside a SummarizedExperiment
:
import numpy
import delayedarray
# create a delayed array, can also be file-backed
x = numpy.random.rand(100, 20)
d = delayedarray.wrap(x)
# operate over delayed arrays
filtered = d[1:100:2,1:8]
total = filtered.sum(axis=0)
normalized = filtered / total
transformed = numpy.log1p(normalized)
import summarizedexperiment as SE
se_delayed = SE.SummarizedExperiment({ "counts": filtered, "lognorm": transformed })
print(se_delayed)
class: SummarizedExperiment
dimensions: (50, 7)
assays(2): ['counts', 'lognorm']
row_data columns(0): []
row_names(0):
column_data columns(0): []
column_names(0):
metadata(0):
Interop with anndata
¶
Converting a SummarizedExperiment
to an AnnData
representation is straightforward:
adata = se.to_anndata()
print(adata)
AnnData object with n_obs × n_vars = 6 × 200
obs: 'treatment', 'rownames'
var: 'seqnames', 'starts', 'ends', 'strand', 'score', 'GC'
layers: 'counts'
/home/runner/work/SummarizedExperiment/SummarizedExperiment/.tox/docs/lib/python3.11/site-packages/anndata/_core/aligned_df.py:68: ImplicitModificationWarning: Transforming to str index.
warnings.warn("Transforming to str index.", ImplicitModificationWarning)
Tip
To convert an AnnData
object to a BiocPy representation, utilize the from_anndata
method in the SingleCellExperiment class. This minimizes the loss of information when converting between these two representations.
Getters/Setters¶
Getters are available to access various attributes using either the property notation or functional style.
# access assay names
print("assay names (as property): ", se.assay_names)
print("assay names (functional style): ", se.get_assay_names())
# access row data
print(se.row_data)
assay names (as property): ['counts']
assay names (functional style): ['counts']
BiocFrame with 200 rows and 6 columns
seqnames starts ends strand score GC
<list> <range> <range> <list> <range> <list>
[0] chr1 100 110 - 0 0.5021027032251704
[1] chr2 101 111 + 1 0.11315767945779376
[2] chr2 102 112 + 2 0.6856567171517574
... ... ... ... ... ...
[197] chr3 297 307 + 197 0.6187942318721821
[198] chr3 298 308 - 198 0.6006342380871723
[199] chr3 299 309 - 199 0.5026034089630075
Access an assay¶
One can access an assay by index or name:
se.assay(0) # same as se.assay("counts")
array([[8.47617084e-01, 1.41818725e-01, 8.80281453e-01, 2.40265063e-01,
8.68808112e-01, 5.24471696e-02],
[8.84237314e-01, 3.16558083e-01, 8.74361868e-01, 8.09549643e-01,
6.17999048e-01, 8.80996819e-01],
[5.93596681e-01, 9.71742228e-01, 8.98851414e-01, 2.56496425e-01,
9.22791075e-01, 1.24107551e-01],
...,
[8.16900783e-01, 4.61170709e-02, 2.13884096e-01, 6.58021495e-01,
9.62391637e-01, 1.59028841e-01],
[1.47188775e-01, 2.02690708e-01, 4.75456048e-01, 6.48863881e-01,
2.15161648e-01, 4.61195571e-01],
[8.26708808e-01, 1.55946421e-01, 5.87897745e-01, 8.17005659e-01,
6.84922678e-01, 4.46766510e-05]], shape=(200, 6))
Setters¶
Important
All property-based setters are in_place
operations, with further details discussed in functional paradigm section.
modified_column_data = se.column_data.set_column("score", range(10,16))
modified_se = se.set_column_data(modified_column_data)
print(modified_se)
class: SummarizedExperiment
dimensions: (200, 6)
assays(1): ['counts']
row_data columns(6): ['seqnames', 'starts', 'ends', 'strand', 'score', 'GC']
row_names(0):
column_data columns(2): ['treatment', 'score']
column_names(6): ['0', '1', '2', '3', '4', '5']
metadata(0):
Now, lets check the column_data
on the original object.
print(se.column_data)
BiocFrame with 6 rows and 1 column
treatment
<list>
0 ChIP
1 Input
2 ChIP
3 Input
4 ChIP
5 Input
Subset experiments¶
You can subset experimental data by using the subset ([]
) operator. This operation accepts different slice input types, such as a boolean vector, a slice
object, a list of indices, or names (if available) to subset.
In our previous example, we didn’t include row or column names. Let’s create another SummarizedExperiment
object that includes names.
row_data = BiocFrame({
"seqnames": ["chr_5", "chr_3", "chr_2"],
"start": [100, 200, 300],
"end": [110, 210, 310],
})
col_data = BiocFrame({
"sample": ["SAM_1", "SAM_3", "SAM_3"],
"disease": ["True", "True", "True"],
},
row_names=["cell_1", "cell_2", "cell_3"],
)
se_with_names = SummarizedExperiment(
assays={
"counts": np.random.poisson(lam=5, size=(3, 3)),
"lognorm": np.random.lognormal(size=(3, 3)),
},
row_data=row_data,
column_data=col_data,
row_names=["HER2", "BRCA1", "TPFK"],
column_names=["cell_1", "cell_2", "cell_3"],
)
print(se_with_names)
class: SummarizedExperiment
dimensions: (3, 3)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['HER2', 'BRCA1', 'TPFK']
column_data columns(2): ['sample', 'disease']
column_names(3): ['cell_1', 'cell_2', 'cell_3']
metadata(0):
Subset by index position¶
A straightforward slice operation:
subset_se = se_with_names[0:10, 0:3]
print(subset_se)
class: SummarizedExperiment
dimensions: (3, 3)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['HER2', 'BRCA1', 'TPFK']
column_data columns(2): ['sample', 'disease']
column_names(3): ['cell_1', 'cell_2', 'cell_3']
metadata(0):
Subset by row names or column names¶
Either one or both of the slices can contain names. These names are mapped to row_names
and column_names
of the SummarizedExperiment
object.
subset_se = se_with_names[:2, ["cell_1", "cell_3"]]
print(subset_se)
class: SummarizedExperiment
dimensions: (2, 2)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(2): ['HER2', 'BRCA1']
column_data columns(2): ['sample', 'disease']
column_names(2): ['cell_1', 'cell_3']
metadata(0):
An Exception
is raised if a names does not exist.
Subset by boolean vector¶
Similarly, you can also slice by a boolean array.
Important
Note that the boolean vectors should contain the same number of features for the row slice and the same number of samples for the column slice.
subset_se_with_bools = se_with_names[[True, True, False], [True, False, True]]
print(subset_se_with_bools)
class: SummarizedExperiment
dimensions: (2, 2)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(2): ['HER2', 'BRCA1']
column_data columns(2): ['sample', 'disease']
column_names(2): ['cell_1', 'cell_3']
metadata(0):
Subset by empty list¶
This is a feature not a bug ;), you can specify an empty list to completely remove all rows or samples.
Warning
An empty array ([]
) is not the same as an empty slice (:
). This helps us avoid unintented operations.
subset = se_with_names[:2, []]
print(subset)
class: SummarizedExperiment
dimensions: (2, 0)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(2): ['HER2', 'BRCA1']
column_data columns(2): ['sample', 'disease']
column_names(0): []
metadata(0):
Range-based operations¶
Additionally, since RangeSummarizedExperiment
contains row_ranges
, this allows us to perform a number of range-based operations that are possible on a GenomicRanges
object.
For example, to subset RangeSummarizedExperiment
with a query set of regions:
from iranges import IRanges
query = GenomicRanges(seqnames=["chr2"], ranges=IRanges([4], [6]), strand=["+"])
result = rse.subset_by_overlaps(query)
print(result)
class: RangedSummarizedExperiment
dimensions: (0, 6)
assays(1): ['counts']
row_data columns(6): ['seqnames', 'starts', 'ends', 'strand', 'score', 'GC']
row_names(0):
column_data columns(1): ['treatment']
column_names(6): ['0', '1', '2', '3', '4', '5']
metadata(0):
Additionally, RSE supports many other interval based operations. Checkout the documentation for more details.
Combining experiments¶
SummarizedExperiment
implements methods for the combine
generics from BiocUtils.
These methods enable the merging or combining of multiple SummarizedExperiment
objects, allowing users to aggregate data from different experiments or conditions. To demonstrate, let’s create multiple SummarizedExperiment
objects.
rowData1 = pd.DataFrame(
{
"seqnames": ["chr_5", "chr_3", "chr_2"],
"start": [10293804, 12098948, 20984392],
"end": [28937947, 3872839, 329837492]
},
index=["HER2", "BRCA1", "TPFK"],
)
colData1 = pd.DataFrame(
{
"sample": ["SAM_1", "SAM_3", "SAM_3"],
"disease": ["True", "True", "True"],
},
index=["cell_1", "cell_2", "cell_3"],
)
se1 = SummarizedExperiment(
assays={
"counts": np.random.poisson(lam=5, size=(3, 3)),
"lognorm": np.random.lognormal(size=(3, 3))
},
row_data=rowData1,
column_data=colData1,
metadata={"seq_type": "paired"},
)
rowData2 = pd.DataFrame(
{
"seqnames": ["chr_5", "chr_3", "chr_2"],
"start": [10293804, 12098948, 20984392],
"end": [28937947, 3872839, 329837492]
},
index=["HER2", "BRCA1", "TPFK"],
)
colData2 = pd.DataFrame(
{
"sample": ["SAM_4", "SAM_5", "SAM_6"],
"disease": ["True", "False", "True"],
},
index=["cell_4", "cell_5", "cell_6"],
)
se2 = SummarizedExperiment(
assays={
"counts": np.random.poisson(lam=5, size=(3, 3)),
"lognorm": np.random.lognormal(size=(3, 3))
},
row_data=rowData2,
column_data=colData2,
metadata={"seq_platform": "Illumina NovaSeq 6000"},
)
rowData3 = pd.DataFrame(
{
"seqnames": ["chr_7", "chr_1", "chr_Y"],
"start": [1084390, 1874937, 243879798],
"end": [243895239, 358908298, 390820395]
},
index=["MYC", "BRCA2", "TPFK"],
)
colData3 = pd.DataFrame(
{
"sample": ["SAM_7", "SAM_8", "SAM_9"],
"disease": ["True", "False", "False"],
"doublet_score": [.15, .62, .18]
},
index=["cell_7", "cell_8", "cell_9"],
)
se3 = SummarizedExperiment(
assays={
"counts": np.random.poisson(lam=5, size=(3, 3)),
"lognorm": np.random.lognormal(size=(3, 3)),
"beta": np.random.beta(a=1, b=1, size=(3, 3))
},
row_data=rowData3,
column_data=colData3,
metadata={"seq_platform": "Illumina NovaSeq 6000"},
)
print(se1)
print(se2)
print(se3)
class: SummarizedExperiment
dimensions: (3, 3)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['HER2', 'BRCA1', 'TPFK']
column_data columns(2): ['sample', 'disease']
column_names(3): ['cell_1', 'cell_2', 'cell_3']
metadata(1): seq_type
class: SummarizedExperiment
dimensions: (3, 3)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['HER2', 'BRCA1', 'TPFK']
column_data columns(2): ['sample', 'disease']
column_names(3): ['cell_4', 'cell_5', 'cell_6']
metadata(1): seq_platform
class: SummarizedExperiment
dimensions: (3, 3)
assays(3): ['counts', 'lognorm', 'beta']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['MYC', 'BRCA2', 'TPFK']
column_data columns(3): ['sample', 'disease', 'doublet_score']
column_names(3): ['cell_7', 'cell_8', 'cell_9']
metadata(1): seq_platform
Important
The combine_rows
or combine_columns
operations, expect all experiments to contain the same assay names.
To combine experiments by row:
from biocutils import relaxed_combine_columns, combine_columns, combine_rows, relaxed_combine_rows
se_combined = combine_rows(se2, se1)
print(se_combined)
class: SummarizedExperiment
dimensions: (6, 3)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(6): ['HER2', 'BRCA1', 'TPFK', 'HER2', 'BRCA1', 'TPFK']
column_data columns(2): ['sample', 'disease']
column_names(3): ['cell_4', 'cell_5', 'cell_6']
metadata(1): seq_platform
/home/runner/work/SummarizedExperiment/SummarizedExperiment/.tox/docs/lib/python3.11/site-packages/summarizedexperiment/BaseSE.py:91: UserWarning: 'row_data' does not contain unique 'row_names'.
warn("'row_data' does not contain unique 'row_names'.", UserWarning)
Similarly to combine by column:
se_combined = combine_columns(se2, se1)
print(se_combined)
class: SummarizedExperiment
dimensions: (3, 6)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['HER2', 'BRCA1', 'TPFK']
column_data columns(2): ['sample', 'disease']
column_names(6): ['cell_4', 'cell_5', 'cell_6', 'cell_1', 'cell_2', 'cell_3']
metadata(1): seq_platform
Important
You can use relaxed_combine_columns
or relaxed_combined_rows
when there’s mismatch in the number of features or samples. Missing rows or columns in any object are filled in with appropriate placeholder values before combining, e.g. missing assay’s are replaced with a masked numpy array.
# se3 contains an additional assay not present in se1
se_relaxed_combine = relaxed_combine_columns(se3, se1)
print(se_relaxed_combine)
class: SummarizedExperiment
dimensions: (3, 6)
assays(3): ['counts', 'lognorm', 'beta']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['MYC', 'BRCA2', 'TPFK']
column_data columns(3): ['sample', 'disease', 'doublet_score']
column_names(6): ['cell_7', 'cell_8', 'cell_9', 'cell_1', 'cell_2', 'cell_3']
metadata(1): seq_platform
Empty experiments¶
Both these classes can also contain no experimental data, and they tend to be useful when integrated into more extensive data structures but do not contain any data themselves.
To create an empty SummarizedExperiment
:
empty_se = SummarizedExperiment()
print(empty_se)
class: SummarizedExperiment
dimensions: (0, 0)
assays(0): []
row_data columns(0): []
row_names(0):
column_data columns(0): []
column_names(0):
metadata(0):
Similarly an empty RangeSummarizedExperiment
:
empty_rse = RangedSummarizedExperiment()
print(empty_rse)
class: RangedSummarizedExperiment
dimensions: (0, 0)
assays(0): []
row_data columns(0): []
row_names(0):
column_data columns(0): []
column_names(0):
metadata(0):