---
file_format: mystnb
kernelspec:
  name: python
---

# Cell Arrays

Cell Arrays is a Python package that provides a TileDB-backed store for large collections of
genomic experimental data, such as millions of cells across multiple single-cell experiment objects.

# Install the package

To get started, install the package from [PyPI](https://pypi.org/project/cellarr/)

```bash
pip install cellarr
```

# Build the `CellArrDataset`

The `CellArrDataset` method is designed to store single-cell RNA-seq
datasets but can be generalized to store any 2-dimensional experimental data.

This method creates four TileDB files in the directory specified by `output_path`:

- `gene_annotation`: A TileDB file containing feature/gene annotations.
- `sample_metadata`: A TileDB file containing sample metadata.
- `cell_metadata`: A TileDB file containing cell metadata including mapping to the samples
they are tagged with in ``sample_metadata``.
- A matrix TileDB file named by the `layer_matrix_name` parameter. This allows the package
to store multiple different matrices, e.g. normalized, scaled for the same cell, gene, sample
metadata attributes.

The TileDB matrix file is stored in a ``cell X gene`` orientation. This orientation
is chosen because the fastest-changing dimension as new files are added to the
collection is usually the cells rather than genes.

Process:

1. **Scan the Collection**: Scan the entire collection of files to create
a unique set of feature ids (e.g. gene symbols). Store this set as the
`gene_annotation` TileDB file.

2. **Sample Metadata**: Store sample metadata in `sample_metadata`
TileDB file. Each file is typically considered a sample, and an automatic
mapping is created between files and samples.

3. **Store Cell Metadata**: Store cell metadata in the `cell_metadata`
TileDB file.

4. **Remap and Orient Data**: For each dataset in the collection,
remap and orient the feature dimension using the feature set from Step 1.
This step ensures consistency in gene measurement and order, even if
some genes are unmeasured or ordered differently in the original experiments.

:::{note}
Check out the [reference](https://biocpy.github.io/cellarr/api/cellarr#module-cellarr.build_cellarrdataset) documentation for modifying the parameters for any of these steps.
:::

![`CellArrDataset` structure](../assets/cellarr.png "CellArrDataset")

First lets mock a few `AnnData` objects:

```{code-cell}
import anndata
import numpy as np
import pandas as pd

def generate_adata(n, d, k):
    np.random.seed(1)

    z = np.random.normal(loc=np.arange(k), scale=np.arange(k) * 2, size=(n, k))
    w = np.random.normal(size=(d, k))
    y = np.dot(z, w.T)

    gene_index = [f"gene_{i+1}" for i in range(d)]
    var_df = pd.DataFrame({"names": gene_index}, index=gene_index)
    obs_df = pd.DataFrame({"cells": [f"cell1_{j+1}" for j in range(n)]})

    adata = anndata.AnnData(layers={"counts": y}, var=var_df, obs=obs_df)

    return adata

adata1 = generate_adata(1000, 100, 10)
adata2 = generate_adata(100, 1000, 100)

print("datasets")
print(adata1, adata2)
```

To build a `CellArrDataset` from a collection of `H5AD` or `AnnData` objects:

```{code-cell}
import anndata
import numpy as np
import tempfile
from cellarr import build_cellarrdataset, CellArrDataset, MatrixOptions

# Create a temporary directory
tempdir = tempfile.mkdtemp()

# # Read AnnData objects
# adata1 = anndata.read_h5ad("path/to/object1.h5ad", "r")
# # or just provide the path
# adata2 = "path/to/object2.h5ad"

# Build CellArrDataset
dataset = build_cellarrdataset(
    output_path=tempdir,
    files=[adata1, adata2],
    matrix_options=MatrixOptions(dtype=np.float32),
)

print(dataset)
```

:::{important}
All files are expected to be consistent and any modifications
to make them consistent is outside the scope of this function
and package.

There's a few assumptions this process makes:

- If object in ``files`` is an `AnnData`
or H5AD object, these must contain an assay matrix in the
layers slot of the object named as ``layer_matrix_name`` parameter.

- Feature information must contain a column defined by the parameter
``feature_column`` in the
`GeneAnnotationOptions.` that
contains feature ids or gene symbols across all files.

- If no ``cell_metadata`` is provided, we scan to count the number of cells
and create a simple range index.

- Each file is considered a sample and a mapping between cells and samples
is automatically created. Hence the sample information provided must match
the number of input files.
:::

# Query a `CellArrDataset`

Users have the option to reuse the `dataset` object retuned when building the dataset or by creating a `CellArrDataset` object by initializing it to the path where the files were created.

```{code-cell}
# Create a CellArrDataset object from the existing dataset
dataset = CellArrDataset(dataset_path=tempdir)

# Query data from the dataset
expression_data = dataset[10, ["gene_1", "gene_10", "gene_500"]]

print("matrix slice:")
print(expression_data.matrix)

print("\n\n gene_annotation slice:")
print(expression_data.gene_annotation)

print("\n\n cell_metadata slice:")
print(expression_data.cell_metadata)
```

This returns a `CellArrDatasetSlice` object that contains the matrix and metadata `DataFrame`'s along the cell and gene axes.

Users can easily convert these to analysis-ready representations

```{code-cell}
print("as anndata:")
print(expression_data.to_anndata())

print("\n\n as summarizedexperiment:")
print(expression_data.to_summarizedexperiment())
```

# A single cell dataloader

A basic single cell dataloader can be instantiated by using the `DataModule` class.

```python
from cellarr.dataloader import DataModule

datamodule = DataModule(
    dataset_path="/path/to/cellar/dir",
    cell_metadata_uri="cell_metadata",
    gene_annotation_uri="gene_annotation",
    matrix_uri="counts",
    label_column_name="label",
    study_column_name="study",
    batch_size=1000,
    lognorm=True,
    target_sum=1e4,
)
```

Users can optionally set a list of studies to be used as validation. If not provided, all studies are used for training.
Additionally users may also provide the gene space to train their models.

```python
val_studies = ["study1", "study100"]

gene_list = [
    "GPNMB", "TREM2", "LPL", "HLA-DQA1", "CD109",
    "IL6ST", "SDC2", "MSR1", "ALCAM", "SLC1A3",
    "CD9", "CD59", "MRC1", "SLC11A1", "CPM",
    "GPR183", "ITGAX", "HLA-DMB", "NRP2", "SV2C",
    "PTPRJ", "EMP1", "HLA-DQB1", "MERTK", "CD52",
    "CXCL16", "ABCA1", "HLA-DPB1", "OLR1", "CD83"
]

datamodule = DataModule(
    dataset_path="/path/to/cellar/dir",
    cell_metadata_uri="cell_metadata",
    gene_annotation_uri="gene_annotation",
    matrix_uri="counts",
    val_studies=val_studies,
    label_column_name="label",
    study_column_name="study",
    gene_order=gene_list,
    batch_size=1000,
    lognorm=True,
    target_sum=1e4,
)
```

Users can access training cells by index.

```python
datamodule.train_dataset[100]
```

Batches can be created and examined.

```python
dataloader = datamodule.train_dataloader()

batch = next(iter(dataloader))
expression, labels, studies = batch
```

---

Check out the [documentation](https://biocpy.github.io/cellarr/api/modules.html) for more details.