cellarr package¶
Submodules¶
cellarr.CellArrDataset module¶
Query the CellArrDataset.
This class provides methods to access the directory containing the
generated TileDB files usually using the
build_cellarrdataset()
.
Example
from cellarr import CellArrDataset
cd = CellArrDataset(dataset_path="/path/to/cellar/dir")
gene_list = ["gene_1", "gene_95", "gene_50"]
result1 = cd[0, gene_list]
print(result1)
- class cellarr.CellArrDataset.CellArrDataset(dataset_path, matrix_tdb_uri='counts', gene_annotation_uri='gene_annotation', cell_metadata_uri='cell_metadata', sample_metadata_uri='sample_metadata')[source]¶
Bases:
object
A class that represent a collection of cells and their associated metadata in a TileDB backed store.
- __getitem__(args)[source]¶
Subset a
CellArrDataset
.Mostly an alias to
get_slice()
.- Parameters:
args (
Union
[int
,Sequence
,tuple
]) –Integer indices, a boolean filter, or (if the current object is named) names specifying the ranges to be extracted.
Alternatively a tuple of length 1. The first entry specifies the rows (or cells) to retain based on their names or indices.
Alternatively a tuple of length 2. The first entry specifies the rows (or cells) to retain, while the second entry specifies the columns (or features/genes) to retain, based on their names or indices.
- Raises:
ValueError – If too many or too few slices provided.
- Return type:
- Returns:
A
CellArrDatasetSlice
object containing the cell_metadata, gene_annotation and the matrix.
- __init__(dataset_path, matrix_tdb_uri='counts', gene_annotation_uri='gene_annotation', cell_metadata_uri='cell_metadata', sample_metadata_uri='sample_metadata')[source]¶
Initialize a
CellArrDataset
.- Parameters:
dataset_path (
str
) – Path to the directory containing the TileDB stores. Usually theoutput_path
from thebuild_cellarrdataset()
.counts_tdb_uri – Relative path to matrix store.
gene_annotation_uri (
str
) – Relative path to gene annotation store.cell_metadata_uri (
str
) – Relative path to cell metadata store.sample_metadata_uri (
str
) – Relative path to sample metadata store.
- get_cell_metadata_column(column_name)[source]¶
Access a column from the
cell_metadata
store.- Parameters:
column_name (
str
) – Name of the column or attribute. Usually one of the column names from ofget_cell_metadata_columns()
.- Return type:
- Returns:
A list of values for this column.
- get_cell_subset(subset, columns=None)[source]¶
Slice the
cell_metadata
store.- Parameters:
subset (
Union
[slice
,QueryCondition
]) –A list of integer indices to subset the
cell_metadata
store.Alternatively, may also provide a
tiledb.QueryCondition
to query the store.columns –
List of specific column names to access.
Defaults to None, in which case all columns are extracted.
- Return type:
- Returns:
A pandas Dataframe of the subset.
- get_gene_annotation_column(column_name)[source]¶
Access a column from the
gene_annotation
store.- Parameters:
column_name (
str
) – Name of the column or attribute. Usually one of the column names from ofget_gene_annotation_columns()
.- Return type:
- Returns:
A list of values for this column.
- get_gene_subset(subset, columns=None)[source]¶
Slice the
gene_metadata
store.- Parameters:
subset (
Union
[slice
,List
[str
],QueryCondition
]) –A list of integer indices to subset the
gene_metadata
store.Alternatively, may provide a
tiledb.QueryCondition
to query the store.Alternatively, may provide a list of strings to match with the index of
gene_metadata
store.columns –
List of specific column names to access.
Defaults to None, in which case all columns are extracted.
- Return type:
- Returns:
A pandas Dataframe of the subset.
- get_sample_metadata_column(column_name)[source]¶
Access a column from the
sample_metadata
store.- Parameters:
column_name (
str
) – Name of the column or attribute. Usually one of the column names from ofget_sample_metadata_columns()
.- Return type:
- Returns:
A list of values for this column.
- get_sample_subset(subset, columns=None)[source]¶
Slice the
sample_metadata
store.- Parameters:
subset (
Union
[slice
,QueryCondition
]) –A list of integer indices to subset the
sample_metadata
store.Alternatively, may also provide a
tiledb.QueryCondition
to query the store.columns –
List of specific column names to access.
Defaults to None, in which case all columns are extracted.
- Return type:
- Returns:
A pandas Dataframe of the subset.
- get_slice(cell_subset, gene_subset)[source]¶
Subset a
CellArrDataset
.- Parameters:
cell_subset (
Union
[slice
,QueryCondition
]) – Integer indices, a boolean filter, or (if the current object is named) names specifying the rows (or cells) to retain.cell_subset – Integer indices, a boolean filter, or (if the current object is named) names specifying the columns (or features/genes) to retain.
- Return type:
- Returns:
A
CellArrDatasetSlice
object containing the cell_metadata, gene_annotation and the matrix for the given slice ranges.
- property shape¶
cellarr.CellArrDatasetSlice module¶
Class that represents a realized subset of the CellArrDataset.
This class provides a slice data class usually generated by the access
methods from
cellarr.CellArrDataset.CellArrDataset()
.
Example
from cellarr import CellArrDataset
cd = CellArrDataset(dataset_path="/path/to/cellar/dir")
gene_list = ["gene_1", "gene_95", "gene_50"]
result1 = cd[0, gene_list]
print(result1)
- class cellarr.CellArrDatasetSlice.CellArrDatasetSlice(cell_metadata, gene_annotation, matrix)[source]¶
Bases:
object
Class that represents a realized subset of the CellArrDataset.
- __annotations__ = {'cell_metadata': <class 'pandas.core.frame.DataFrame'>, 'gene_annotation': <class 'pandas.core.frame.DataFrame'>, 'matrix': typing.Any}¶
- __dataclass_fields__ = {'cell_metadata': Field(name='cell_metadata',type=<class 'pandas.core.frame.DataFrame'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'gene_annotation': Field(name='gene_annotation',type=<class 'pandas.core.frame.DataFrame'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'matrix': Field(name='matrix',type=typing.Any,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD)}¶
- __dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)¶
- __eq__(other)¶
Return self==value.
- __hash__ = None¶
- __init__(cell_metadata, gene_annotation, matrix)¶
- property shape¶
cellarr.build_cellarrdataset module¶
Build the CellArrDatset.
The CellArrDataset method is designed to store single-cell RNA-seq datasets but can be generalized to store any 2-dimensional experimental data.
This method creates four TileDB files in the directory specified by output_path:
gene_annotation: A TileDB file containing feature/gene annotations.
sample_metadata: A TileDB file containing sample metadata.
cell_metadata: A TileDB file containing cell metadata including mapping to the samples
they are tagged with in sample_metadata
.
- A matrix TileDB file named by the layer_matrix_name parameter. This allows the package
to store multiple different matrices, e.g. normalized, scaled for the same cell, gene, sample
metadata attributes.
The TileDB matrix file is stored in a cell X gene
orientation. This orientation
is chosen because the fastest-changing dimension as new files are added to the
collection is usually the cells rather than genes.
Process:
1. Scan the Collection: Scan the entire collection of files to create a unique set of feature ids (e.g. gene symbols). Store this set as the gene_annotation TileDB file.
2. Sample Metadata: Store sample metadata in sample_metadata TileDB file. Each file is typically considered a sample, and an automatic mapping is created between files and samples.
3. Store Cell Metadata: Store cell metadata in the cell_metadata TileDB file.
4. Remap and Orient Data: For each dataset in the collection, remap and orient the feature dimension using the feature set from Step 1. This step ensures consistency in gene measurement and order, even if some genes are unmeasured or ordered differently in the original experiments.
Example
import anndata
import numpy as np
import tempfile
from cellarr import build_cellarrdataset, CellArrDataset, MatrixOptions
# Create a temporary directory
tempdir = tempfile.mkdtemp()
# Read AnnData objects
adata1 = anndata.read_h5ad("path/to/object1.h5ad", "r")
# or just provide the path
adata2 = "path/to/object2.h5ad"
# Build CellArrDataset
dataset = build_cellarrdataset(
output_path=tempdir,
files=[adata1, adata2],
matrix_options=MatrixOptions(dtype=np.float32),
)
- cellarr.build_cellarrdataset.build_cellarrdataset(files, output_path, gene_annotation=None, sample_metadata=None, cell_metadata=None, sample_metadata_options=SampleMetadataOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='sample_metadata', column_types=None), cell_metadata_options=CellMetadataOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='cell_metadata', column_types=None), gene_annotation_options=GeneAnnotationOptions(skip=False, feature_column='index', dtype=<class 'numpy.uint32'>, tiledb_store_name='gene_annotation', column_types=None), matrix_options=MatrixOptions(skip=False, consolidate_duplicate_gene_func=<built-in function sum>, matrix_name='counts', matrix_attr_name='data', dtype=<class 'numpy.uint16'>, tiledb_store_name='counts'), optimize_tiledb=True, num_threads=1)[source]¶
Generate the CellArrDataset.
All files are expected to be consistent and any modifications to make them consistent is outside the scope of this function and package.
There’s a few assumptions this process makes: - If object in
files
is anAnnData
or H5AD object, these must contain an assay matrix in the layers slot of the object named aslayer_matrix_name
parameter. - Feature information must contain a column defined by the parameterfeature_column
in thethat contains feature ids or gene symbols across all files. - If no
cell_metadata
is provided, we scan to count the number of cells and create a simple range index. - Each file is considered a sample and a mapping between cells and samples is automatically created. Hence the sample information provided must match the number of input files.- Parameters:
files (
List
[Union
[str
,AnnData
]]) – List of file paths to H5AD orAnnData
objects.output_path (
str
) – Path to where the output TileDB files should be stored.gene_metadata –
A
DataFrame
containing the feature/gene annotations across all objects.Alternatively, may provide a path to the file containing a concatenated gene annotations across all datasets. In this case, the first row is expected to contain the column names and an index column containing the feature ids or gene symbols.
Alternatively, a list or a dictionary of gene symbols.
Irrespective of the input, the object will be appended with a
cellarr_gene_index
column that contains the gene index across all objects.Defaults to None, then a gene set is generated by scanning all objects in
files
.sample_metadata (
Union
[DataFrame
,str
,None
]) –A
DataFrame
containing the sample metadata for each file infiles
. Hences the number of rows in the dataframe must match the number offiles
.Alternatively, may provide path to the file containing a concatenated sample metadata across all cells. In this case, the first row is expected to contain the column names.
Additionally, the order of rows is expected to be in the same order as the input list of
files
.Irrespective of the input, this object is appended with a
cellarr_original_gene_list
column that contains the original set of feature ids (or gene symbols) from the dataset to differentiate between zero-expressed vs unmeasured genes.Defaults to None, in which case, we create a simple sample metadata dataframe containing the list of datasets. Each dataset is named as
sample_{i}
where i refers to the index position of the object infiles
.cell_metadata (
Union
[DataFrame
,str
,None
]) –A
DataFrame
containing the cell metadata for cells acrossfiles
. Hences the number of rows in the dataframe must match the number of cells across all files.Alternatively, may provide path to the file containing a concatenated cell metadata across all cells. In this case, the first row is expected to contain the column names.
Additionally, the order of cells is expected to be in the same order as the input list of
files
. If the input is a path, the file is expected to contain mappings between cells and datasets (or samples).Defaults to None, we scan all files to count the number of cells, then create a simple cell metadata DataFrame containing mappings from cells to their associated datasets. Each dataset is named as
sample_{i}
where i refers to the index position of the object infiles
.sample_metadata_options (
SampleMetadataOptions
) – Optional parameters when generatingsample_metadata
store.cell_metadata_options (
CellMetadataOptions
) – Optional parameters when generatingcell_metadata
store.gene_annotation_options (
GeneAnnotationOptions
) – Optional parameters when generatinggene_annotation
store.matrix_options (
MatrixOptions
) – Optional parameters when generatingmatrix
store.optimize_tiledb (
bool
) – Whether to run TileDB’s vaccum and consolidation (may take long).num_threads (
int
) – Number of threads. Defaults to 1.
- cellarr.build_cellarrdataset.generate_metadata_tiledb_csv(output_uri, input, column_dtype=None, index_col=False, chunksize=1000)[source]¶
Generate a metadata TileDB from csv.
The difference between this and
generate_metadata_tiledb_frame
is when the csv is super large and it won’t fit into memory.- Parameters:
output_uri (
str
) – TileDB URI or path to save the file.input (
str
) – Path to the csv file. The first row is expected to contain the column names.column_dtype (
Optional
[Dict
[str
,dtype
]]) – Dtype for each of the columns. Defaults to None.chunksize – Chunk size to read the dataframe. Defaults to 1000.
cellarr.build_options module¶
- class cellarr.build_options.CellMetadataOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='cell_metadata', column_types=None)[source]¶
Bases:
object
Optional arguments for the
cell_metadata
store forbuild_cellarrdataset()
.- skip¶
Whether to skip generating cell metadata TileDB. Defaults to False.
- dtype¶
NumPy dtype for the cell dimension. Defaults to np.uint32.
Note: make sure the number of cells fit within the integer limits of unsigned-int32.
- tiledb_store_name¶
Name of the TileDB file. Defaults to “cell_metadata”.
- column_types¶
A dictionary containing column names as keys and the value representing the type to in the tiledb.
If None, all columns are cast as ‘ascii’.
- __annotations__ = {'column_types': typing.Dict[str, numpy.dtype], 'dtype': <class 'numpy.dtype'>, 'skip': <class 'bool'>, 'tiledb_store_name': <class 'str'>}¶
- __dataclass_fields__ = {'column_types': Field(name='column_types',type=typing.Dict[str, numpy.dtype],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'dtype': Field(name='dtype',type=<class 'numpy.dtype'>,default=<class 'numpy.uint32'>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'skip': Field(name='skip',type=<class 'bool'>,default=False,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'tiledb_store_name': Field(name='tiledb_store_name',type=<class 'str'>,default='cell_metadata',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD)}¶
- __dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)¶
- __eq__(other)¶
Return self==value.
- __hash__ = None¶
- __init__(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='cell_metadata', column_types=None)¶
- __repr__()¶
Return repr(self).
- dtype¶
alias of
uint32
- class cellarr.build_options.GeneAnnotationOptions(skip=False, feature_column='index', dtype=<class 'numpy.uint32'>, tiledb_store_name='gene_annotation', column_types=None)[source]¶
Bases:
object
Optional arguments for the
gene_annotation
store forbuild_cellarrdataset()
.- feature_column¶
Column in
var
containing the feature ids (e.g. gene symbols). Defaults to the index of thevar
slot.
- skip¶
Whether to skip generating gene annotation TileDB. Defaults to False.
- dtype¶
NumPy dtype for the gene dimension. Defaults to np.uint32.
Note: make sure the number of genes fit within the integer limits of unsigned-int32.
- tiledb_store_name¶
Name of the TileDB file. Defaults to “gene_annotation”.
- column_types¶
A dictionary containing column names as keys and the value representing the type to in the tiledb.
If None, all columns are cast as ‘ascii’.
- __annotations__ = {'column_types': typing.Dict[str, numpy.dtype], 'dtype': <class 'numpy.dtype'>, 'feature_column': <class 'str'>, 'skip': <class 'bool'>, 'tiledb_store_name': <class 'str'>}¶
- __dataclass_fields__ = {'column_types': Field(name='column_types',type=typing.Dict[str, numpy.dtype],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'dtype': Field(name='dtype',type=<class 'numpy.dtype'>,default=<class 'numpy.uint32'>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'feature_column': Field(name='feature_column',type=<class 'str'>,default='index',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'skip': Field(name='skip',type=<class 'bool'>,default=False,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'tiledb_store_name': Field(name='tiledb_store_name',type=<class 'str'>,default='gene_annotation',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD)}¶
- __dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)¶
- __eq__(other)¶
Return self==value.
- __hash__ = None¶
- __init__(skip=False, feature_column='index', dtype=<class 'numpy.uint32'>, tiledb_store_name='gene_annotation', column_types=None)¶
- __repr__()¶
Return repr(self).
- dtype¶
alias of
uint32
- class cellarr.build_options.MatrixOptions(skip=False, consolidate_duplicate_gene_func=<built-in function sum>, matrix_name='counts', matrix_attr_name='data', dtype=<class 'numpy.uint16'>, tiledb_store_name='counts')[source]¶
Bases:
object
Optional arguments for the
matrix
store forbuild_cellarrdataset()
.- matrix_name¶
Matrix name from
layers
slot to add to TileDB. Must be consistent across all objects infiles
.Defaults to “counts”.
- matrix_attr_name¶
Name of the matrix to be stored in the TileDB file. Defaults to “data”.
- consolidate_duplicate_gene_func¶
Function to consolidate when the AnnData object contains multiple rows with the same feature id or gene symbol.
Defaults to
sum()
.
- skip¶
Whether to skip generating matrix TileDB. Defaults to False.
- dtype¶
NumPy dtype for the values in the matrix. Defaults to np.uint16.
Note: make sure the matrix values fit within the range limits of unsigned-int16.
- tiledb_store_name¶
Name of the TileDB file. Defaults to matrix.
- __annotations__ = {'consolidate_duplicate_gene_func': <built-in function callable>, 'dtype': <class 'numpy.dtype'>, 'matrix_attr_name': <class 'str'>, 'matrix_name': <class 'str'>, 'skip': <class 'bool'>, 'tiledb_store_name': <class 'str'>}¶
- __dataclass_fields__ = {'consolidate_duplicate_gene_func': Field(name='consolidate_duplicate_gene_func',type=<built-in function callable>,default=<built-in function sum>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'dtype': Field(name='dtype',type=<class 'numpy.dtype'>,default=<class 'numpy.uint16'>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'matrix_attr_name': Field(name='matrix_attr_name',type=<class 'str'>,default='data',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'matrix_name': Field(name='matrix_name',type=<class 'str'>,default='counts',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'skip': Field(name='skip',type=<class 'bool'>,default=False,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'tiledb_store_name': Field(name='tiledb_store_name',type=<class 'str'>,default='counts',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD)}¶
- __dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)¶
- __eq__(other)¶
Return self==value.
- __hash__ = None¶
- __init__(skip=False, consolidate_duplicate_gene_func=<built-in function sum>, matrix_name='counts', matrix_attr_name='data', dtype=<class 'numpy.uint16'>, tiledb_store_name='counts')¶
- __repr__()¶
Return repr(self).
- consolidate_duplicate_gene_func(start=0)¶
Return the sum of a ‘start’ value (default: 0) plus an iterable of numbers
When the iterable is empty, return the start value. This function is intended specifically for use with numeric values and may reject non-numeric types.
- dtype¶
alias of
uint16
- class cellarr.build_options.SampleMetadataOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='sample_metadata', column_types=None)[source]¶
Bases:
object
Optional arguments for the
sample
store forbuild_cellarrdataset()
.- skip¶
Whether to skip generating sample TileDB. Defaults to False.
- dtype¶
NumPy dtype for the sample dimension. Defaults to np.uint32.
Note: make sure the number of samples fit within the integer limits of unsigned-int32.
- tiledb_store_name¶
Name of the TileDB file. Defaults to “sample_metadata”.
- column_types¶
A dictionary containing column names as keys and the value representing the type to in the tiledb.
If None, all columns are cast as ‘ascii’.
- __annotations__ = {'column_types': typing.Dict[str, numpy.dtype], 'dtype': <class 'numpy.dtype'>, 'skip': <class 'bool'>, 'tiledb_store_name': <class 'str'>}¶
- __dataclass_fields__ = {'column_types': Field(name='column_types',type=typing.Dict[str, numpy.dtype],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'dtype': Field(name='dtype',type=<class 'numpy.dtype'>,default=<class 'numpy.uint32'>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'skip': Field(name='skip',type=<class 'bool'>,default=False,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'tiledb_store_name': Field(name='tiledb_store_name',type=<class 'str'>,default='sample_metadata',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD)}¶
- __dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)¶
- __eq__(other)¶
Return self==value.
- __hash__ = None¶
- __init__(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='sample_metadata', column_types=None)¶
- __repr__()¶
Return repr(self).
- dtype¶
alias of
uint32
cellarr.buildutils_tiledb_array module¶
- cellarr.buildutils_tiledb_array.create_tiledb_array(tiledb_uri_path, x_dim_length=None, y_dim_length=None, x_dim_name='cell_index', y_dim_name='gene_index', matrix_attr_name='data', x_dim_dtype=<class 'numpy.uint32'>, y_dim_dtype=<class 'numpy.uint32'>, matrix_dim_dtype=<class 'numpy.uint32'>, is_sparse=True)[source]¶
Create a tiledb file with the provided attributes to persistent storage.
This will materialize the array directory and all related schema files.
- Parameters:
tiledb_uri_path (
str
) – Path to create the array tiledb file.x_dim_length (
Optional
[int
]) – Number of entries along the x/fastest-changing dimension. e.g. Number of cells. Defaults to None, in which case, the max integer value ofx_dim_dtype
is used.y_dim_length (
Optional
[int
]) – Number of entries along the y dimension. e.g. Number of genes. Defaults to None, in which case, the max integer value ofy_dim_dtype
is used.x_dim_name (
str
) – Name for the x-dimension. Defaults to “cell_index”.y_dim_name (
str
) – Name for the y-dimension. Defaults to “gene_index”.matrix_attr_name (
str
) – Name for the attribute in the array. Defaults to “data”.x_dim_dtype (
dtype
) – NumPy dtype for the x-dimension. Defaults to np.uint32.y_dim_dtype (
dtype
) – NumPy dtype for the y-dimension. Defaults to np.uint32.matrix_dim_dtype (
dtype
) – NumPy dtype for the values in the matrix. Defaults to np.uint32.is_sparse (
bool
) – Whether the matrix is sparse. Defaults to True.
- cellarr.buildutils_tiledb_array.optimize_tiledb_array(tiledb_array_uri, verbose=True)[source]¶
Consolidate tiledb fragments.
- cellarr.buildutils_tiledb_array.write_csr_matrix_to_tiledb(tiledb_array_uri, matrix, value_dtype=<class 'numpy.uint32'>, row_offset=0, batch_size=25000)[source]¶
Append and save a
csr_matrix
to tiledb.- Parameters:
tiledb_array_uri (
Union
[str
,SparseArray
]) – Tiledb array object or path to a tiledb object.matrix (
csr_matrix
) – Input matrix to write to tiledb, must be acsr_matrix
matrix.value_dtype (
dtype
) – NumPy dtype to reformat the matrix values. Defaults touint32
.row_offset (
int
) – Offset row number to append to matrix. Defaults to 0.batch_size (
int
) – Batch size. Defaults to 25000.
cellarr.buildutils_tiledb_frame module¶
- cellarr.buildutils_tiledb_frame.append_to_tiledb_frame(tiledb_uri_path, frame, row_offset=0)[source]¶
Create a TileDB file with the provided attributes to persistent storage.
This will materialize the array directory and all related schema files.
- cellarr.buildutils_tiledb_frame.create_tiledb_frame_from_chunk(tiledb_uri_path, chunk, column_types)[source]¶
Create a TileDB file from the DataFrame chunk, to persistent storage. This is used by the importer for large datasets stored in csv.
This will materialize the array directory and all related schema files.
- cellarr.buildutils_tiledb_frame.create_tiledb_frame_from_column_names(tiledb_uri_path, column_names, column_types)[source]¶
Create a TileDB file with the provided attributes to persistent storage.
This will materialize the array directory and all related schema files.
- cellarr.buildutils_tiledb_frame.create_tiledb_frame_from_dataframe(tiledb_uri_path, frame, column_types=<class 'dict'>)[source]¶
Create a TileDB file with the provided attributes to persistent storage.
This will materialize the array directory and all related schema files.
- Parameters:
tiledb_uri_path (
str
) – Path to create the metadata TileDB file.column_names – Column names of the data frame.
column_types – Dictionary specifying the column types for each column in the frame.
cellarr.dataloader module¶
A dataloader using TileDB files in the pytorch-lightning framework.
This class provides a dataloader using the generated TileDB files built using the
build_cellarrdataset()
.
Example
from cellarr.dataloader import DataModule
datamodule = DataModule(
dataset_path="/path/to/cellar/dir",
cell_metadata_uri="cell_metadata",
gene_annotation_uri="gene_annotation",
matrix_uri="counts",
val_studies=["test3"],
label_column_name="label",
study_column_name="study",
batch_size=100,
lognorm=True,
target_sum=1e4,
)
dataloader = datamodule.train_dataloader()
batch = next(iter(dataloader))
data, labels, studies = batch
print(data, labels, studies)
- class cellarr.dataloader.DataModule(dataset_path, cell_metadata_uri='cell_metadata', gene_annotation_uri='gene_annotation', matrix_uri='counts', label_column_name='celltype_id', study_column_name='study', val_studies=None, gene_order=None, batch_size=1000, num_workers=0, lognorm=True, target_sum=10000.0, nan_string='nan')[source]¶
Bases:
LightningDataModule
A class that extends a pytorch-lightning
LightningDataModule
to create pytorch dataloaders using TileDB.The dataloader uniformly samples across training labels and study labels to create a diverse batch of cells.
- __init__(dataset_path, cell_metadata_uri='cell_metadata', gene_annotation_uri='gene_annotation', matrix_uri='counts', label_column_name='celltype_id', study_column_name='study', val_studies=None, gene_order=None, batch_size=1000, num_workers=0, lognorm=True, target_sum=10000.0, nan_string='nan')[source]¶
Initialize a
DataModule
.- Parameters:
dataset_path (
str
) – Path to the directory containing the TileDB stores. Usually theoutput_path
from thebuild_cellarrdataset()
.cell_metadata_uri (
str
) – Relative path to cell metadata store.gene_annotation_uri (
str
) – Relative path to gene annotation store.matrix_uri (
str
) – Relative path to matrix store.label_column_name (
str
) – Column name in cell_metadata_uri containing cell labels.study_column_name (
str
) – Column name in cell_metadata_uri containing study information.val_studies (
Optional
[List
[str
]]) – List of studies to use for validation and test. If None, all studies are used for training.gene_order (
Optional
[List
[str
]]) – List of genes to subset to from the gene space. If None, all genes from the gene_annotation are used for training.batch_size (
int
) – Batch size to use. Defaults to 1000.num_workers (
int
) – The number of worker threads for dataloaders.lognorm (
bool
) – Whether to return log-normalized expression instead of raw counts.target_sum (
float
) – Target sum for log-normalization.nan_string (
str
) – A string representing NaN. Defaults to “nan”.
- collate(batch)[source]¶
Collate tensors.
- Parameters:
batch – Batch to collate.
- Returns:
A Tuple[torch.Tensor, torch.Tensor, list] containing information on the collated tensors.
- get_sampler_weights(labels, studies=None)[source]¶
Get weighted random sampler.
- Parameters:
dataset – Single cell dataset.
- Return type:
WeightedRandomSampler
- Returns:
A WeightedRandomSampler object.
- class cellarr.dataloader.scDataset(data_df, matrix_tdb, matrix_shape, gene_indices, label_column_name, study_column_name, lognorm=True, target_sum=10000.0)[source]¶
Bases:
Dataset
A class that extends pytorch
Dataset
to enumerate cells and cell labels using TileDB.- __init__(data_df, matrix_tdb, matrix_shape, gene_indices, label_column_name, study_column_name, lognorm=True, target_sum=10000.0)[source]¶
Initialize a
scDataset
.- Parameters:
data_df (
DataFrame
) – Pandas dataframe of valid cells.matrix_tdb (
Array
) – TileDB object containing the experimental data, e.g. counts.matrix_shape (
tuple
) – Shape of the counts matrix.label_column_name (
str
) – Column name containing cell labels.study_column_name (
str
) – Column name containing study information.lognorm (
bool
) – Whether to return log-normalized expression instead of raw counts.target_sum (
float
) – Target sum for log-normalization.
- __parameters__ = ()¶
cellarr.queryutils_tiledb_frame module¶
- cellarr.queryutils_tiledb_frame.get_a_column(tiledb_obj, column_name)[source]¶
Access column(s) from the TileDB object.
- cellarr.queryutils_tiledb_frame.get_schema_names_frame(tiledb_obj)[source]¶
Get Attributes from a TileDB object.
- cellarr.queryutils_tiledb_frame.subset_array(tiledb_obj, row_subset, column_subset, shape)[source]¶
Subset a tiledb storing array data.
Uses multi_index to slice.
- Parameters:
- Return type:
- Returns:
A sparse array in a coordinate format.
- cellarr.queryutils_tiledb_frame.subset_frame(tiledb_obj, subset, columns)[source]¶
Subset a TileDB object.
- Parameters:
tiledb_obj (
Array
) – TileDB object to subset.subset (
Union
[slice
,QueryCondition
]) –A
slice
to subset.Alternatively, may provide a
QueryCondition
to subset the object.columns (
list
) – List specifying the atrributes from the schema to extract.
- Return type:
- Returns:
A slices DataFrame or a matrix with the subset.
cellarr.utils_anndata module¶
- cellarr.utils_anndata.consolidate_duplicate_symbols(matrix, feature_ids, consolidate_duplicate_gene_func)[source]¶
Consolidate duplicate gene symbols.
- Parameters:
matrix (
Any
) – data matrix with rows for cells and columns for genes.feature_ids (
List
[str
]) – List of feature ids along the column axis of the matrix.consolidate_duplicate_gene_func (
callable
) –Function to consolidate when the AnnData object contains multiple rows with the same feature id or gene symbol.
Defaults to
sum()
.
- Return type:
- Returns:
AnnData object with duplicate gene symbols consolidated.
- cellarr.utils_anndata.extract_anndata_info(h5ad_or_adata, var_feature_column='index', num_threads=1)[source]¶
Extract and generate the list of unique feature identifiers and cell counts across files.
- Parameters:
h5ad_or_adata (
List
[Union
[str
,AnnData
]]) – List of anndata objects or path to h5ad files.var_feature_column (
str
) – Column containing the feature ids (e.g. gene symbols). Defaults to “index”.num_threads (
int
) – Number of threads to use. Defaults to 1.force – Whether to rescan all the files even though the cache exists. Defaults to False.
- cellarr.utils_anndata.remap_anndata(h5ad_or_adata, feature_set_order, var_feature_column='index', layer_matrix_name='counts', consolidate_duplicate_gene_func=<built-in function sum>)[source]¶
Extract and remap the count matrix to the provided feature (gene) set order from the
AnnData
object.- Parameters:
adata –
Input
AnnData
object.Alternatively, may also provide a path to the H5ad file.
The index of the var slot must contain the feature ids for the columns in the matrix.
feature_set_order (
dict
) – A dictionary with the feature ids as keys and their index as value (e.g. gene symbols). The feature ids from theAnnData
object are remapped to the feature order from this dictionary.var_feature_column (
str
) – Column invar
containing the feature ids (e.g. gene symbols). Defaults to the index of thevar
slot.layer_matrix_name (
str
) – Layer containing the matrix to add to TileDB. Defaults to “counts”.consolidate_duplicate_gene_func –
Function to consolidate when the AnnData object contains multiple rows with the same feature id or gene symbol.
Defaults to
sum()
.
- Return type:
- Returns:
A
csr_matrix
representation of the assay matrix.
- cellarr.utils_anndata.scan_for_cellcounts(cache)[source]¶
Extract cell counts across files.
Needs calling
extract_anndata_info()
first.- Parameters:
cache – Info extracted by typically running
extract_anndata_info()
.- Return type:
- Returns:
List of cell counts across files.
- cellarr.utils_anndata.scan_for_features(cache, unique=True)[source]¶
Extract and generate the list of unique feature identifiers across files.
Needs calling
extract_anndata_info()
first.- Parameters:
cache – Info extracted by typically running
extract_anndata_info()
.unique (
bool
) – Compute gene list to a unique list.
- Return type:
- Returns:
List of all unique feature ids across all files.