Represent genomic experimental data¶

This package provides classes to represent genomic experiments, commonly generated by sequencing experiments. It provides similar interfaces and functionality as the Bioconductor equivalent. The package currently provides both SummarizedExperiment & RangedSummarizedExperiment representations. A key difference between these two is the RangedSummarizedExperiment object provides an additional slot to store genomic regions for each feature and is expected to be GenomicRanges (more here).

Important

The design of SummarizedExperiment class and its derivates adheres to the R/Bioconductor specification, where rows correspond to features, and columns represent samples or cells.

Note

These classes follow a functional paradigm for accessing or setting properties, with further details discussed in functional paradigm section.

Construct a `SummarizedExperiment` object¶

A SummarizedExperiment contains three key attributes,

assays: A dictionary of matrices with assay names as keys, e.g. counts, logcounts etc.
row_data: Feature information e.g. genes, transcripts, exons, etc.
column_data: Sample information about the columns of the matrices.

Important

Both row_data and column_data are expected to be BiocFrame objects and will be coerced to a BiocFrame for consistent downstream operations.

In addition, these classes can optionally accept row_names and column_names. Since row_data and column_data may also contain names, the following rules are used in the implementation:

On construction, if row_names or column_names are not provided, these are automatically inferred from row_data and column_data objects.
On accessors of these objects, the row_names in row_data and column_data are replaced by the equivalents from the SE level.
On setters for these attributes, especially with the functional style (set_row_data and set_column_data methods), additional options are available to replace the names in the SE object.

Caution

These rules help avoid unexpected mdifications in names, when either row_data or column_data objects are modified.

To construct a SummarizedExperiment, we’ll first generate a matrix of read counts, representing the read counts from a series of RNA-seq experiments. Following that, we’ll create a BiocFrame object to denote feature information and a table for column annotations. This table may include the names for the columns and any other values we wish to represent.

from random import random
import pandas as pd
import numpy as np
from biocframe import BiocFrame

nrows = 200
ncols = 6
counts = np.random.rand(nrows, ncols)
row_data = BiocFrame(
    {
        "seqnames": [
            "chr1",
            "chr2",
            "chr2",
            "chr2",
            "chr1",
            "chr1",
            "chr3",
            "chr3",
            "chr3",
            "chr3",
        ]
        * 20,
        "starts": range(100, 300),
        "ends": range(110, 310),
        "strand": ["-", "+", "+", "*", "*", "+", "+", "+", "-", "-"] * 20,
        "score": range(0, 200),
        "GC": [random() for _ in range(10)] * 20,
    }
)

col_data = pd.DataFrame(
    {
        "treatment": ["ChIP", "Input"] * 3,
    }
)

Note

The inputs row_data and column_data are expected to be BiocFrame objects and will be coerced to a BiocFrame if a pandas DataFrame is supplied.

`SummarizedExperiment`¶

A SummarizedExperiment is a base class to represent any genomic experiment. This class expects features (row_data) to be either a pandas DataFrame or any variant of BiocFrame.

from summarizedexperiment import SummarizedExperiment

se = SummarizedExperiment(
    assays={"counts": counts}, row_data=row_data, column_data=col_data
)

print(se)

class: SummarizedExperiment
dimensions: (200, 6)
assays(1): ['counts']
row_data columns(6): ['seqnames', 'starts', 'ends', 'strand', 'score', 'GC']
row_names(0):  
column_data columns(1): ['treatment']
column_names(6): ['0', '1', '2', '3', '4', '5']
metadata(0): 

`RangedSummarizedExperiment`¶

Similarly, we can use the same information to construct a RangeSummarizedExperiment. We convert feature information into a GenomicRanges object and provide this as row_ranges:

from genomicranges import GenomicRanges
from summarizedexperiment import RangedSummarizedExperiment

gr = GenomicRanges.from_pandas(row_data.to_pandas())

rse = RangedSummarizedExperiment(
    assays={"counts": counts}, row_data=row_data, row_ranges=gr, column_data=col_data
)
print(rse)

class: RangedSummarizedExperiment
dimensions: (200, 6)
assays(1): ['counts']
row_data columns(6): ['seqnames', 'starts', 'ends', 'strand', 'score', 'GC']
row_names(0):  
column_data columns(1): ['treatment']
column_names(6): ['0', '1', '2', '3', '4', '5']
metadata(0): 

Delayed or file-backed arrays¶

The general idea is that DelayedArray’s are a drop-in replacement for NumPy arrays, at least for BiocPy applications.

For example, we can use the DelayedArray inside a SummarizedExperiment:

import numpy
import delayedarray

# create a delayed array, can also be file-backed
x = numpy.random.rand(100, 20)
d = delayedarray.wrap(x)

# operate over delayed arrays
filtered = d[1:100:2,1:8]
total = filtered.sum(axis=0)
normalized = filtered / total
transformed = numpy.log1p(normalized)

import summarizedexperiment as SE
se_delayed = SE.SummarizedExperiment({ "counts": filtered, "lognorm": transformed })
print(se_delayed)

class: SummarizedExperiment
dimensions: (50, 7)
assays(2): ['counts', 'lognorm']
row_data columns(0): []
row_names(0):  
column_data columns(0): []
column_names(0):  
metadata(0): 

Interop with `anndata`¶

Converting a SummarizedExperiment to an AnnData representation is straightforward:

adata = se.to_anndata()
print(adata)

AnnData object with n_obs × n_vars = 6 × 200
    obs: 'treatment', 'rownames'
    var: 'seqnames', 'starts', 'ends', 'strand', 'score', 'GC'
    layers: 'counts'

/home/runner/work/SummarizedExperiment/SummarizedExperiment/.tox/docs/lib/python3.11/site-packages/anndata/_core/aligned_df.py:68: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)

Tip

To convert an AnnData object to a BiocPy representation, utilize the from_anndata method in the SingleCellExperiment class. This minimizes the loss of information when converting between these two representations.

Getters/Setters¶

Getters are available to access various attributes using either the property notation or functional style.

# access assay names
print("assay names (as property): ", se.assay_names)
print("assay names (functional style): ", se.get_assay_names())

# access row data
print(se.row_data)

assay names (as property):  ['counts']
assay names (functional style):  ['counts']
BiocFrame with 200 rows and 6 columns
      seqnames  starts    ends strand   score                  GC
        <list> <range> <range> <list> <range>              <list>
  [0]     chr1     100     110      -       0  0.8614972925399814
  [1]     chr2     101     111      +       1  0.5096117850957297
  [2]     chr2     102     112      +       2 0.36093477315964473
           ...     ...     ...    ...     ...                 ...
[197]     chr3     297     307      +     197  0.6683726191136703
[198]     chr3     298     308      -     198  0.8675023682335099
[199]     chr3     299     309      -     199  0.6613608936111253

Access an assay¶

One can access an assay by index or name:

se.assay(0) # same as se.assay("counts")

array([[0.15799072, 0.87874675, 0.72368664, 0.13449298, 0.69506158,
        0.94369915],
       [0.32906253, 0.1708719 , 0.7096846 , 0.59833207, 0.05899121,
        0.62948518],
       [0.84055262, 0.93887631, 0.29164597, 0.6302085 , 0.42569628,
        0.37791674],
       ...,
       [0.231613  , 0.00457041, 0.14720447, 0.57071099, 0.14315786,
        0.7745697 ],
       [0.07724568, 0.8954333 , 0.75504243, 0.13784545, 0.25808788,
        0.22773342],
       [0.09552369, 0.51561967, 0.43328534, 0.5958457 , 0.5502902 ,
        0.8845494 ]], shape=(200, 6))

Setters¶

Important

All property-based setters are in_place operations, with further details discussed in functional paradigm section.

modified_column_data = se.column_data.set_column("score", range(10,16))
modified_se = se.set_column_data(modified_column_data)
print(modified_se)

class: SummarizedExperiment
dimensions: (200, 6)
assays(1): ['counts']
row_data columns(6): ['seqnames', 'starts', 'ends', 'strand', 'score', 'GC']
row_names(0):  
column_data columns(2): ['treatment', 'score']
column_names(6): ['0', '1', '2', '3', '4', '5']
metadata(0): 

Now, lets check the column_data on the original object.

print(se.column_data)

BiocFrame with 6 rows and 1 column
  treatment
     <list>
0      ChIP
1     Input
2      ChIP
3     Input
4      ChIP
5     Input

Subset experiments¶

You can subset experimental data by using the subset ([]) operator. This operation accepts different slice input types, such as a boolean vector, a slice object, a list of indices, or names (if available) to subset.

In our previous example, we didn’t include row or column names. Let’s create another SummarizedExperiment object that includes names.

row_data = BiocFrame({
    "seqnames": ["chr_5", "chr_3", "chr_2"],
    "start": [100, 200, 300],
    "end": [110, 210, 310],
})

col_data = BiocFrame({
    "sample": ["SAM_1", "SAM_3", "SAM_3"],
    "disease": ["True", "True", "True"],
    },
    row_names=["cell_1", "cell_2", "cell_3"],
)

se_with_names = SummarizedExperiment(
    assays={
        "counts": np.random.poisson(lam=5, size=(3, 3)),
        "lognorm": np.random.lognormal(size=(3, 3)),
    },
    row_data=row_data,
    column_data=col_data,
    row_names=["HER2", "BRCA1", "TPFK"],
    column_names=["cell_1", "cell_2", "cell_3"],
)

print(se_with_names)

class: SummarizedExperiment
dimensions: (3, 3)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['HER2', 'BRCA1', 'TPFK']
column_data columns(2): ['sample', 'disease']
column_names(3): ['cell_1', 'cell_2', 'cell_3']
metadata(0): 

Subset by index position¶

A straightforward slice operation:

subset_se = se_with_names[0:10, 0:3]
print(subset_se)

class: SummarizedExperiment
dimensions: (3, 3)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(3): ['HER2', 'BRCA1', 'TPFK']
column_data columns(2): ['sample', 'disease']
column_names(3): ['cell_1', 'cell_2', 'cell_3']
metadata(0): 

Subset by row names or column names¶

Either one or both of the slices can contain names. These names are mapped to row_names and column_names of the SummarizedExperiment object.

subset_se = se_with_names[:2, ["cell_1", "cell_3"]]
print(subset_se)

class: SummarizedExperiment
dimensions: (2, 2)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(2): ['HER2', 'BRCA1']
column_data columns(2): ['sample', 'disease']
column_names(2): ['cell_1', 'cell_3']
metadata(0): 

An Exception is raised if a names does not exist.

Subset by boolean vector¶

Similarly, you can also slice by a boolean array.

Important

Note that the boolean vectors should contain the same number of features for the row slice and the same number of samples for the column slice.

subset_se_with_bools = se_with_names[[True, True, False], [True, False, True]]
print(subset_se_with_bools)

class: SummarizedExperiment
dimensions: (2, 2)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(2): ['HER2', 'BRCA1']
column_data columns(2): ['sample', 'disease']
column_names(2): ['cell_1', 'cell_3']
metadata(0): 

Subset by empty list¶

This is a feature not a bug ;), you can specify an empty list to completely remove all rows or samples.

Warning

An empty array ([]) is not the same as an empty slice (:). This helps us avoid unintented operations.

subset = se_with_names[:2, []]
print(subset)

class: SummarizedExperiment
dimensions: (2, 0)
assays(2): ['counts', 'lognorm']
row_data columns(3): ['seqnames', 'start', 'end']
row_names(2): ['HER2', 'BRCA1']
column_data columns(2): ['sample', 'disease']
column_names(0): []
metadata(0): 

Range-based operations¶

Additionally, since RangeSummarizedExperiment contains row_ranges, this allows us to perform a number of range-based operations that are possible on a GenomicRanges object.

For example, to subset RangeSummarizedExperiment with a query set of regions:

from iranges import IRanges
query = GenomicRanges(seqnames=["chr2"], ranges=IRanges([4], [6]), strand=["+"])

result = rse.subset_by_overlaps(query)
print(result)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[15], line 4
      1 from iranges import IRanges
      2 query = GenomicRanges(seqnames=["chr2"], ranges=IRanges([4], [6]), strand=["+"])
----> 4 result = rse.subset_by_overlaps(query)
      5 print(result)

File ~/work/SummarizedExperiment/SummarizedExperiment/.tox/docs/lib/python3.11/site-packages/summarizedexperiment/RangedSummarizedExperiment.py:879, in RangedSummarizedExperiment.subset_by_overlaps(self, query, query_type, max_gap, min_overlap, ignore_strand)
    875 _check_gr_or_rse(query)
    877 qranges = _access_granges(query)
--> 879 result = self.row_ranges.find_overlaps(
    880     query=qranges,
    881     query_type=query_type,
    882     max_gap=max_gap,
    883     min_overlap=min_overlap,
    884     ignore_strand=ignore_strand,
    885 )
    887 return self[result.get_column("self_hits"), :]

File ~/work/SummarizedExperiment/SummarizedExperiment/.tox/docs/lib/python3.11/site-packages/genomicranges/GenomicRanges.py:2233, in GenomicRanges.find_overlaps(self, query, query_type, select, max_gap, min_overlap, ignore_strand)
   2222         _query_subset = query[indices]
   2224         res_idx = _sub_subset._ranges.find_overlaps(
   2225             query=_query_subset._ranges,
   2226             query_type=query_type,
   (...)   2230             delete_index=False,
   2231         )
-> 2233         all_qhits = np.concatenate(
   2234             (all_qhits, [indices[j] for j in res_idx.get_column("query_hits")]), dtype=np.int32
   2235         )
   2236         all_shits = np.concatenate(
   2237             (all_shits, [_subset[x] for x in res_idx.get_column("self_hits")]), dtype=np.int32
   2238         )
   2240 order = np.argsort(all_qhits, stable=True)

TypeError: Cannot cast array data from dtype('float64') to dtype('int32') according to the rule 'same_kind'

Additionally, RSE supports many other interval based operations. Checkout the documentation for more details.

Combining experiments¶

SummarizedExperiment implements methods for the combine generics from BiocUtils.

These methods enable the merging or combining of multiple SummarizedExperiment objects, allowing users to aggregate data from different experiments or conditions. To demonstrate, let’s create multiple SummarizedExperiment objects.

rowData1 = pd.DataFrame(
    {
        "seqnames": ["chr_5", "chr_3", "chr_2"],
        "start": [10293804, 12098948, 20984392],
        "end": [28937947, 3872839, 329837492]
    },
    index=["HER2", "BRCA1", "TPFK"],
)
colData1 = pd.DataFrame(
    {
        "sample": ["SAM_1", "SAM_3", "SAM_3"],
        "disease": ["True", "True", "True"],
    },
    index=["cell_1", "cell_2", "cell_3"],
)
se1 = SummarizedExperiment(
    assays={
        "counts": np.random.poisson(lam=5, size=(3, 3)),
        "lognorm": np.random.lognormal(size=(3, 3))
    },
    row_data=rowData1,
    column_data=colData1,
    metadata={"seq_type": "paired"},
)

rowData2 = pd.DataFrame(
    {
        "seqnames": ["chr_5", "chr_3", "chr_2"],
        "start": [10293804, 12098948, 20984392],
        "end": [28937947, 3872839, 329837492]
    },
    index=["HER2", "BRCA1", "TPFK"],
)
colData2 = pd.DataFrame(
    {
        "sample": ["SAM_4", "SAM_5", "SAM_6"],
        "disease": ["True", "False", "True"],
    },
    index=["cell_4", "cell_5", "cell_6"],
)
se2 = SummarizedExperiment(
    assays={
        "counts": np.random.poisson(lam=5, size=(3, 3)),
        "lognorm": np.random.lognormal(size=(3, 3))
    },
    row_data=rowData2,
    column_data=colData2,
    metadata={"seq_platform": "Illumina NovaSeq 6000"},
)

rowData3 = pd.DataFrame(
    {
        "seqnames": ["chr_7", "chr_1", "chr_Y"],
        "start": [1084390, 1874937, 243879798],
        "end": [243895239, 358908298, 390820395]
    },
    index=["MYC", "BRCA2", "TPFK"],
)
colData3 = pd.DataFrame(
    {
        "sample": ["SAM_7", "SAM_8", "SAM_9"],
        "disease": ["True", "False", "False"],
        "doublet_score": [.15, .62, .18]
    },
    index=["cell_7", "cell_8", "cell_9"],
)
se3 = SummarizedExperiment(
    assays={
        "counts": np.random.poisson(lam=5, size=(3, 3)),
        "lognorm": np.random.lognormal(size=(3, 3)),
        "beta": np.random.beta(a=1, b=1, size=(3, 3))
    },
    row_data=rowData3,
    column_data=colData3,
    metadata={"seq_platform": "Illumina NovaSeq 6000"},
)

print(se1)
print(se2)
print(se3)

Important

The combine_rows or combine_columns operations, expect all experiments to contain the same assay names.

To combine experiments by row:

from biocutils import relaxed_combine_columns, combine_columns, combine_rows, relaxed_combine_rows
se_combined = combine_rows(se2, se1)
print(se_combined)

Similarly to combine by column:

se_combined = combine_columns(se2, se1)
print(se_combined)

Important

You can use relaxed_combine_columns or relaxed_combined_rows when there’s mismatch in the number of features or samples. Missing rows or columns in any object are filled in with appropriate placeholder values before combining, e.g. missing assay’s are replaced with a masked numpy array.

# se3 contains an additional assay not present in se1
se_relaxed_combine = relaxed_combine_columns(se3, se1)
print(se_relaxed_combine)

Empty experiments¶

Both these classes can also contain no experimental data, and they tend to be useful when integrated into more extensive data structures but do not contain any data themselves.

To create an empty SummarizedExperiment:

empty_se = SummarizedExperiment()
print(empty_se)

Similarly an empty RangeSummarizedExperiment:

empty_rse = RangedSummarizedExperiment()
print(empty_rse)

Represent genomic experimental data¶

Construct a SummarizedExperiment object¶

SummarizedExperiment¶

RangedSummarizedExperiment¶

Delayed or file-backed arrays¶

Interop with anndata¶

Getters/Setters¶

Access an assay¶

Setters¶

Subset experiments¶

Subset by index position¶

Subset by row names or column names¶

Subset by boolean vector¶

Subset by empty list¶

Range-based operations¶

Combining experiments¶

Empty experiments¶

Construct a `SummarizedExperiment` object¶

`SummarizedExperiment`¶

`RangedSummarizedExperiment`¶

Interop with `anndata`¶