Represent genomic experimental data

This package provides classes to represent genomic experiments, commonly generated by sequencing experiments. It provides similar interfaces and functionality as the Bioconductor equivalent. The package currently provides both SummarizedExperiment & RangedSummarizedExperiment representations. A key difference between these two is the RangedSummarizedExperiment object provides an additional slot to store genomic regions for each feature and is expected to be GenomicRanges (more here).

Important

The design of SummarizedExperiment class and its derivates adheres to the R/Bioconductor specification, where rows correspond to features, and columns represent samples or cells.

Note

These classes follow a functional paradigm for accessing or setting properties, with further details discussed in functional paradigm section.

Construct a SummarizedExperiment object

A SummarizedExperiment contains three key attributes,

  • assays: A dictionary of matrices with assay names as keys, e.g. counts, logcounts etc.

  • row_data: Feature information e.g. genes, transcripts, exons, etc.

  • column_data: Sample information about the columns of the matrices.

Important

Both row_data and column_data are expected to be BiocFrame objects and will be coerced to a BiocFrame for consistent downstream operations.

In addition, these classes can optionally accept row_names and column_names. Since row_data and column_data may also contain names, the following rules are used in the implementation:

  • On construction, if row_names or column_names are not provided, these are automatically inferred from row_data and column_data objects.

  • On accessors of these objects, the row_names in row_data and column_data are replaced by the equivalents from the SE level.

  • On setters for these attributes, especially with the functional style (set_row_data and set_column_data methods), additional options are available to replace the names in the SE object.

Caution

These rules help avoid unexpected mdifications in names, when either row_data or column_data objects are modified.

To construct a SummarizedExperiment, we’ll first generate a matrix of read counts, representing the read counts from a series of RNA-seq experiments. Following that, we’ll create a BiocFrame object to denote feature information and a table for column annotations. This table may include the names for the columns and any other values we wish to represent.

from random import random
import pandas as pd
import numpy as np
from biocframe import BiocFrame

nrows = 200
ncols = 6
counts = np.random.rand(nrows, ncols)
row_data = BiocFrame(
    {
        "seqnames": [
            "chr1",
            "chr2",
            "chr2",
            "chr2",
            "chr1",
            "chr1",
            "chr3",
            "chr3",
            "chr3",
            "chr3",
        ]
        * 20,
        "starts": range(100, 300),
        "ends": range(110, 310),
        "strand": ["-", "+", "+", "*", "*", "+", "+", "+", "-", "-"] * 20,
        "score": range(0, 200),
        "GC": [random() for _ in range(10)] * 20,
    }
)

col_data = pd.DataFrame(
    {
        "treatment": ["ChIP", "Input"] * 3,
    }
)

Note

The inputs row_data and column_data are expected to be BiocFrame objects and will be coerced to a BiocFrame if a pandas DataFrame is supplied.

SummarizedExperiment

A SummarizedExperiment is a base class to represent any genomic experiment. This class expects features (row_data) to be either a pandas DataFrame or any variant of BiocFrame.

from summarizedexperiment import SummarizedExperiment

se = SummarizedExperiment(
    assays={"counts": counts}, row_data=row_data, column_data=col_data
)

print(se)
class: SummarizedExperiment
dimensions: (200, 6)
assays(1): ['counts']
row_data columns(6): ['seqnames', 'starts', 'ends', 'strand', 'score', 'GC']
row_names(0):  
column_data columns(1): ['treatment']
column_names(6): ['0', '1', '2', '3', '4', '5']
metadata(0): 

RangedSummarizedExperiment

Similarly, we can use the same information to construct a RangeSummarizedExperiment. We convert feature information into a GenomicRanges object and provide this as row_ranges:

from genomicranges import GenomicRanges
from summarizedexperiment import RangedSummarizedExperiment

gr = GenomicRanges.from_pandas(row_data.to_pandas())

rse = RangedSummarizedExperiment(
    assays={"counts": counts}, row_data=row_data, row_ranges=gr, column_data=col_data
)
print(rse)
class: RangedSummarizedExperiment
dimensions: (200, 6)
assays(1): ['counts']
row_data columns(6): ['seqnames', 'starts', 'ends', 'strand', 'score', 'GC']
row_names(0):  
column_data columns(1): ['treatment']
column_names(6): ['0', '1', '2', '3', '4', '5']
metadata(0): 

Delayed or file-backed arrays

The general idea is that DelayedArray’s are a drop-in replacement for NumPy arrays, at least for BiocPy applications.

For example, we can use the DelayedArray inside a SummarizedExperiment:

import numpy
import delayedarray

# create a delayed array, can also be file-backed
x = numpy.random.rand(100, 20)
d = delayedarray.wrap(x)

# operate over delayed arrays
filtered = d[1:100:2,1:8]
total = filtered.sum(axis=0)
normalized = filtered / total
transformed = numpy.log1p(normalized)

import summarizedexperiment as SE
se_delayed = SE.SummarizedExperiment({ "counts": filtered, "lognorm": transformed })
print(se_delayed)
class: SummarizedExperiment
dimensions: (50, 7)
assays(2): ['counts', 'lognorm']
row_data columns(0): []
row_names(0):  
column_data columns(0): []
column_names(0):  
metadata(0): 

Interop with anndata

Converting a SummarizedExperiment to an AnnData representation is straightforward:

adata = se.to_anndata()
print(adata)
AnnData object with n_obs × n_vars = 6 × 200
    obs: 'treatment', 'rownames'
    var: 'seqnames', 'starts', 'ends', 'strand', 'score', 'GC'
    layers: 'counts'
/home/runner/work/SummarizedExperiment/SummarizedExperiment/.tox/docs/lib/python3.11/site-packages/anndata/_core/aligned_df.py:68: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)

Tip

To convert an AnnData object to a BiocPy representation, utilize the from_anndata method in the SingleCellExperiment class. This minimizes the loss of information when converting between these two representations.

Getters/Setters

Getters are available to access various attributes using either the property notation or functional style.

# access assay names
print("assay names (as property): ", se.assay_names)
print("assay names (functional style): ", se.get_assay_names())

# access row data
print(se.row_data)
assay names (as property):  ['counts']
assay names (functional style):  ['counts']
BiocFrame with 200 rows and 6 columns
      seqnames  starts    ends strand   score                  GC
        <list> <range> <range> <list> <range>              <list>
  [0]     chr1     100     110      -       0  0.5021027032251704
  [1]     chr2     101     111      +       1 0.11315767945779376
  [2]     chr2     102     112      +       2  0.6856567171517574
           ...     ...     ...    ...     ...                 ...
[197]     chr3     297     307      +     197  0.6187942318721821
[198]     chr3     298     308      -     198  0.6006342380871723
[199]     chr3     299     309      -     199  0.5026034089630075

Access an assay

One can access an assay by index or name:

se.assay(0) # same as se.assay("counts")
array([[8.47617084e-01, 1.41818725e-01, 8.80281453e-01, 2.40265063e-01,
        8.68808112e-01, 5.24471696e-02],
       [8.84237314e-01, 3.16558083e-01, 8.74361868e-01, 8.09549643e-01,
        6.17999048e-01, 8.80996819e-01],
       [5.93596681e-01, 9.71742228e-01, 8.98851414e-01, 2.56496425e-01,
        9.22791075e-01, 1.24107551e-01],
       ...,
       [8.16900783e-01, 4.61170709e-02, 2.13884096e-01, 6.58021495e-01,
        9.62391637e-01, 1.59028841e-01],
       [1.47188775e-01, 2.02690708e-01, 4.75456048e-01, 6.48863881e-01,
        2.15161648e-01, 4.61195571e-01],
       [8.26708808e-01, 1.55946421e-01, 5.87897745e-01, 8.17005659e-01,
        6.84922678e-01, 4.46766510e-05]], shape=(200, 6))

Setters

Important

All property-based setters are in_place operations, with further details discussed in functional paradigm section.

modified_column_data = se.column_data.set_column("score", range(10,16))
modified_se = se.set_column_data(modified_column_data)
print(modified_se)
class: SummarizedExperiment
dimensions: (200, 6)
assays(1): ['counts']
row_data columns(6): ['seqnames', 'starts', 'ends', 'strand', 'score', 'GC']
row_names(0):  
column_data columns(2): ['treatment', 'score']
column_names(6): ['0', '1', '2', '3', '4', '5']
metadata(0): 

Now, lets check the column_data on the original object.

print(se.column_data)
BiocFrame with 6 rows and 1 column
  treatment
     <list>
0      ChIP
1     Input
2      ChIP
3     Input
4      ChIP
5     Input

Subset experiments

You can subset experimental data by using the subset ([]) operator. This operation accepts different slice input types, such as a boolean vector, a slice object, a list of indices, or names (if available) to subset.

In our previous example, we didn’t include row or column names. Let’s create another SummarizedExperiment object that includes names.

row_data = BiocFrame({
    "seqnames": ["chr_5", "chr_3", "chr_2"],
    "start": [100, 200, 300],
    "end": [110, 210, 310],
})

col_data = BiocFrame({
    "sample": ["SAM_1", "SAM_3", "SAM_3"],
    "disease": ["True", "True", "True"],
    },
    row_names=["cell_1", "cell_2", "cell_3"],
)

se_with_names = SummarizedExperiment(
    assays={
        "counts": np.random.poisson(lam=5, size=(3, 3)),
        "lognorm": np.random.lognormal(size=(3, 3)),
    },
    row_data=row_data,
    column_data=col_data,
    row_names=["HER2", "BRCA1", "TPFK"],
    column_names=["cell_1", "cell_2", "cell_3"],
)

print(se_with_names)
class: SummarizedExperiment
dimensions: (3, 3)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['HER2', 'BRCA1', 'TPFK']
column_data columns(2): ['sample', 'disease']
column_names(3): ['cell_1', 'cell_2', 'cell_3']
metadata(0): 

Subset by index position

A straightforward slice operation:

subset_se = se_with_names[0:10, 0:3]
print(subset_se)
class: SummarizedExperiment
dimensions: (3, 3)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['HER2', 'BRCA1', 'TPFK']
column_data columns(2): ['sample', 'disease']
column_names(3): ['cell_1', 'cell_2', 'cell_3']
metadata(0): 

Subset by row names or column names

Either one or both of the slices can contain names. These names are mapped to row_names and column_names of the SummarizedExperiment object.

subset_se = se_with_names[:2, ["cell_1", "cell_3"]]
print(subset_se)
class: SummarizedExperiment
dimensions: (2, 2)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(2): ['HER2', 'BRCA1']
column_data columns(2): ['sample', 'disease']
column_names(2): ['cell_1', 'cell_3']
metadata(0): 

An Exception is raised if a names does not exist.

Subset by boolean vector

Similarly, you can also slice by a boolean array.

Important

Note that the boolean vectors should contain the same number of features for the row slice and the same number of samples for the column slice.

subset_se_with_bools = se_with_names[[True, True, False], [True, False, True]]
print(subset_se_with_bools)
class: SummarizedExperiment
dimensions: (2, 2)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(2): ['HER2', 'BRCA1']
column_data columns(2): ['sample', 'disease']
column_names(2): ['cell_1', 'cell_3']
metadata(0): 

Subset by empty list

This is a feature not a bug ;), you can specify an empty list to completely remove all rows or samples.

Warning

An empty array ([]) is not the same as an empty slice (:). This helps us avoid unintented operations.

subset = se_with_names[:2, []]
print(subset)
class: SummarizedExperiment
dimensions: (2, 0)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(2): ['HER2', 'BRCA1']
column_data columns(2): ['sample', 'disease']
column_names(0): []
metadata(0): 

Range-based operations

Additionally, since RangeSummarizedExperiment contains row_ranges, this allows us to perform a number of range-based operations that are possible on a GenomicRanges object.

For example, to subset RangeSummarizedExperiment with a query set of regions:

from iranges import IRanges
query = GenomicRanges(seqnames=["chr2"], ranges=IRanges([4], [6]), strand=["+"])

result = rse.subset_by_overlaps(query)
print(result)
class: RangedSummarizedExperiment
dimensions: (0, 6)
assays(1): ['counts']
row_data columns(6): ['seqnames', 'starts', 'ends', 'strand', 'score', 'GC']
row_names(0):  
column_data columns(1): ['treatment']
column_names(6): ['0', '1', '2', '3', '4', '5']
metadata(0): 

Additionally, RSE supports many other interval based operations. Checkout the documentation for more details.

Combining experiments

SummarizedExperiment implements methods for the combine generics from BiocUtils.

These methods enable the merging or combining of multiple SummarizedExperiment objects, allowing users to aggregate data from different experiments or conditions. To demonstrate, let’s create multiple SummarizedExperiment objects.

rowData1 = pd.DataFrame(
    {
        "seqnames": ["chr_5", "chr_3", "chr_2"],
        "start": [10293804, 12098948, 20984392],
        "end": [28937947, 3872839, 329837492]
    },
    index=["HER2", "BRCA1", "TPFK"],
)
colData1 = pd.DataFrame(
    {
        "sample": ["SAM_1", "SAM_3", "SAM_3"],
        "disease": ["True", "True", "True"],
    },
    index=["cell_1", "cell_2", "cell_3"],
)
se1 = SummarizedExperiment(
    assays={
        "counts": np.random.poisson(lam=5, size=(3, 3)),
        "lognorm": np.random.lognormal(size=(3, 3))
    },
    row_data=rowData1,
    column_data=colData1,
    metadata={"seq_type": "paired"},
)

rowData2 = pd.DataFrame(
    {
        "seqnames": ["chr_5", "chr_3", "chr_2"],
        "start": [10293804, 12098948, 20984392],
        "end": [28937947, 3872839, 329837492]
    },
    index=["HER2", "BRCA1", "TPFK"],
)
colData2 = pd.DataFrame(
    {
        "sample": ["SAM_4", "SAM_5", "SAM_6"],
        "disease": ["True", "False", "True"],
    },
    index=["cell_4", "cell_5", "cell_6"],
)
se2 = SummarizedExperiment(
    assays={
        "counts": np.random.poisson(lam=5, size=(3, 3)),
        "lognorm": np.random.lognormal(size=(3, 3))
    },
    row_data=rowData2,
    column_data=colData2,
    metadata={"seq_platform": "Illumina NovaSeq 6000"},
)

rowData3 = pd.DataFrame(
    {
        "seqnames": ["chr_7", "chr_1", "chr_Y"],
        "start": [1084390, 1874937, 243879798],
        "end": [243895239, 358908298, 390820395]
    },
    index=["MYC", "BRCA2", "TPFK"],
)
colData3 = pd.DataFrame(
    {
        "sample": ["SAM_7", "SAM_8", "SAM_9"],
        "disease": ["True", "False", "False"],
        "doublet_score": [.15, .62, .18]
    },
    index=["cell_7", "cell_8", "cell_9"],
)
se3 = SummarizedExperiment(
    assays={
        "counts": np.random.poisson(lam=5, size=(3, 3)),
        "lognorm": np.random.lognormal(size=(3, 3)),
        "beta": np.random.beta(a=1, b=1, size=(3, 3))
    },
    row_data=rowData3,
    column_data=colData3,
    metadata={"seq_platform": "Illumina NovaSeq 6000"},
)

print(se1)
print(se2)
print(se3)
class: SummarizedExperiment
dimensions: (3, 3)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['HER2', 'BRCA1', 'TPFK']
column_data columns(2): ['sample', 'disease']
column_names(3): ['cell_1', 'cell_2', 'cell_3']
metadata(1): seq_type

class: SummarizedExperiment
dimensions: (3, 3)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['HER2', 'BRCA1', 'TPFK']
column_data columns(2): ['sample', 'disease']
column_names(3): ['cell_4', 'cell_5', 'cell_6']
metadata(1): seq_platform

class: SummarizedExperiment
dimensions: (3, 3)
assays(3): ['counts', 'lognorm', 'beta']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['MYC', 'BRCA2', 'TPFK']
column_data columns(3): ['sample', 'disease', 'doublet_score']
column_names(3): ['cell_7', 'cell_8', 'cell_9']
metadata(1): seq_platform

Important

The combine_rows or combine_columns operations, expect all experiments to contain the same assay names.

To combine experiments by row:

from biocutils import relaxed_combine_columns, combine_columns, combine_rows, relaxed_combine_rows
se_combined = combine_rows(se2, se1)
print(se_combined)
class: SummarizedExperiment
dimensions: (6, 3)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(6): ['HER2', 'BRCA1', 'TPFK', 'HER2', 'BRCA1', 'TPFK']
column_data columns(2): ['sample', 'disease']
column_names(3): ['cell_4', 'cell_5', 'cell_6']
metadata(1): seq_platform
/home/runner/work/SummarizedExperiment/SummarizedExperiment/.tox/docs/lib/python3.11/site-packages/summarizedexperiment/BaseSE.py:91: UserWarning: 'row_data' does not contain unique 'row_names'.
  warn("'row_data' does not contain unique 'row_names'.", UserWarning)

Similarly to combine by column:

se_combined = combine_columns(se2, se1)
print(se_combined)
class: SummarizedExperiment
dimensions: (3, 6)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['HER2', 'BRCA1', 'TPFK']
column_data columns(2): ['sample', 'disease']
column_names(6): ['cell_4', 'cell_5', 'cell_6', 'cell_1', 'cell_2', 'cell_3']
metadata(1): seq_platform

Important

You can use relaxed_combine_columns or relaxed_combined_rows when there’s mismatch in the number of features or samples. Missing rows or columns in any object are filled in with appropriate placeholder values before combining, e.g. missing assay’s are replaced with a masked numpy array.

# se3 contains an additional assay not present in se1
se_relaxed_combine = relaxed_combine_columns(se3, se1)
print(se_relaxed_combine)
class: SummarizedExperiment
dimensions: (3, 6)
assays(3): ['counts', 'lognorm', 'beta']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['MYC', 'BRCA2', 'TPFK']
column_data columns(3): ['sample', 'disease', 'doublet_score']
column_names(6): ['cell_7', 'cell_8', 'cell_9', 'cell_1', 'cell_2', 'cell_3']
metadata(1): seq_platform

Empty experiments

Both these classes can also contain no experimental data, and they tend to be useful when integrated into more extensive data structures but do not contain any data themselves.

To create an empty SummarizedExperiment:

empty_se = SummarizedExperiment()
print(empty_se)
class: SummarizedExperiment
dimensions: (0, 0)
assays(0): []
row_data columns(0): []
row_names(0):  
column_data columns(0): []
column_names(0):  
metadata(0): 

Similarly an empty RangeSummarizedExperiment:

empty_rse = RangedSummarizedExperiment()
print(empty_rse)
class: RangedSummarizedExperiment
dimensions: (0, 0)
assays(0): []
row_data columns(0): []
row_names(0):  
column_data columns(0): []
column_names(0):  
metadata(0):