scrnaseq package¶

Submodules¶

scrnaseq.fetch_dataset module¶

scrnaseq.fetch_dataset.fetch_dataset(name, version, path=None, package='scRNAseq', cache_dir='/home/runner/.cache/gypsum', overwrite=False, realize_assays=False, realize_reduced_dims=True, **kwargs)[source]¶

Fetch a single-cell dataset from the gypsum backend.

See also

metadata index, on the expected schema for the metadata.

save_dataset() and upload_directory(), to save and upload a dataset.

list_datasets() and list_versions(), to get possible values for name and version.

Example

sce = fetch_dataset(
    "zeisel-brain-2015",
    "2023-12-14",
)

Parameters:

name (str) – Name of the dataset.
version (str) – Version of the dataset.
path (str) – Path to a subdataset, if name contains multiple datasets. Defaults to None.
package (str) – Name of the package. Defaults to “scRNAseq”.
cache_dir (str) – Path to cache directory.
overwrite (bool) – Whether to overwrite existing files. Defaults to False.
realize_assays (bool) – Whether to realize assays into memory. Defaults to False.
realize_reduced_dims (bool) – Whether to realize reduced dimensions into memory. Defaults to True.
**kwargs – Further arguments to pass to read_object().

Return type:

SummarizedExperiment

Returns:

The dataset as a SummarizedExperiment or one of its subclasses.

scrnaseq.fetch_dataset.fetch_metadata(name, version, path=None, package='scRNAseq', cache_dir='/home/runner/.cache/gypsum', overwrite=False)[source]¶

Fetch metadata for a dataset from the gypsum backend.

See also

fetch_dataset(), to fetch a dataset.

Example:

meta = fetch_metadata(
    "zeisel-brain-2015",
    "2023-12-14",
)

Parameters:

name (str) – Name of the dataset.
version (str) – Version of the dataset.
path (str) – Path to a subdataset, if name contains multiple datasets. Defaults to None.
package (str) – Name of the package. Defaults to “scRNAseq”.
cache – Path to the cache directory.
overwrite (bool) – Whether to overwrite existing files. Defaults to False.

Returns:

Dictionary containing metadata for the specified dataset.

scrnaseq.list_datasets module¶

scrnaseq.list_datasets.list_datasets(cache_dir='/home/runner/.cache/gypsum', overwrite=False, latest=True)[source]¶

List all available datasets.

Example

datasets = (
    list_datasets()
)

Parameters:

cache_dir (str) – Path to cache directory.
overwrite (bool) – Whether to overwrite the database in cache. Defaults to False.
latest (bool) – Whether to only fetch the latest version of each dataset. Defaults to True.

Return type:

BiocFrame

Returns:

A BiocFrame where each row corresponds to a dataset. Each row contains title and description for each dataset, the number of rows and columns, the organisms and genome builds involved, whether the dataset has any pre-computed reduced dimensions, and so on. More details can be found in the Bioconductor metadata schema.

scrnaseq.list_versions module¶

scrnaseq.list_versions.fetch_latest_version(name)[source]¶

Fetch latest version for a dataset.

Example

version = fetch_latest_version(
    "romanov-brain-2017"
)

Parameters:: name (str) – Name of the dataset.
Return type:: str
Returns:: Latest version name.

scrnaseq.list_versions.list_versions(name)[source]¶

List all available versions for a dataset.

Example

versions = list_versions(
    "romanov-brain-2017"
)

Parameters:: name (str) – Name of the dataset.
Return type:: List[str]
Returns:: A list of version names.

scrnaseq.polish_dataset module¶

scrnaseq.polish_dataset.polish_dataset(x, reformat_assay_by_density=0.3, attempt_integer_conversion=True, remove_altexp_coldata=True, forbid_nested_altexp=True)[source]¶

Optimize dataset for saving.

Prepare a SummarizedExperiment or SingleCellExperiment to be saved with scrnaseq.save_dataset.save_dataset().

This performs minor changes to improve storage efficiency, especially with matrices.

Parameters:

x (Type[SummarizedExperiment]) – A SummarizedExperiment or one of its derivative.
reformat_assay_by_density (float) –
Whether to optimize assay formats based on the density of non-zero values. Assays with densities above this number are converted to ordinary dense arrays (if they are not already), while those with lower densities are converted to sparse matrices.

This can be disabled by setting it to None.
attempt_integer_conversion (bool) –
Whether to convert double-precision assays containing integer values to actually have the integer type.

This can improve efficiency of downstream applications by avoiding the need to operate in double precision.
remove_altexp_coldata (bool) – Whether column data for alternative experiments should be removed. Defaults to True as the alternative experiment column data is usually redundant compared to the main experiment.
forbid_nested_altexp (bool) – Whether nested alternative experiments (i.e., alternative experiments of alternative experiments) should be forbidden.

Return type:

Type[SummarizedExperiment]

Returns:

A modifed object with the same type as x.

scrnaseq.save_dataset module¶

scrnaseq.save_dataset.save_dataset(x, path, metadata)[source]¶

Save a dataset to disk.

Parameters:

x (Any) – An object containing single-cell data. May be a derivative of SummarizedExperiment or AnnData.
path – Path to a new directory to save the dataset.
metadata –
Dictionary containing the metadata for this dataset. see the schema returned by fetch_metadata_schema().

Note that the applications.takane property will be automatically added by this function and does not have to be supplied.

See also

metadata index, on the expected schema for the metadata.

polish_dataset(), to polish x before saving it.

upload_dataset(), to upload the saved contents.

Example

# Fetch an existing dataset
# or create your own ``SingleCellExperiment``
# or ``AnnData`` object.
sce = scrnaseq.fetch_dataset(
    "zeisel-brain-2015",
    "2023-12-14",
)

# Provide dataset level metadata for search and findability
meta = {
    "title": "My dataset made from ziesel brain",
    "description": "This is a copy of the ziesel",
    "taxonomy_id": [
        "10090"
    ],  # NCBI ID
    "genome": [
        "GRCh38"
    ],  # genome build
    "sources": [
        {
            "provider": "GEO",
            "id": "GSE12345",
        }
    ],
    "maintainer_name": "Shizuka Mogami",
    "maintainer_email": "mogami.shizuka@765pro.com",
}

import shutil
import tempfile

cache_dir = tempfile.mkdtemp()

# Make sure the directory is clean
shutil.rmtree(
    cache_dir
)

# Save the dataset
scrnaseq.save_dataset(
    sce,
    cache_dir,
    meta,
)

scrnaseq.save_dataset.save_dataset_anndata(x, path, metadata)[source]¶: Save AnnData to disk.

scrnaseq.save_dataset.save_dataset_sce(x, path, metadata)[source]¶: Save SingleCellExperiment to disk.

scrnaseq.save_dataset.save_dataset_se(x, path, metadata)[source]¶: Save SummarizedExperiment to disk.

scrnaseq.search_datasets module¶

scrnaseq.search_datasets.search_datasets(query, cache_dir='/home/runner/.cache/gypsum', overwrite=False, latest=True)[source]¶

Search for datasets of interest based on matching text in the associated metadata.

This is a wrapper around search_metadata_text().

The returned DataFrame contains the usual suspects like the title and description for each dataset, the number of rows and columns, the organisms and genome builds involved, whether the dataset has any pre-computed reduced dimensions, and so on.

More details can be found in the Bioconductor metadata index.

See also

list_datasets(), to list all available datasets.

search_metadata_text(), to search metadata.

Examples:

res = search_datasets("brain")

res = search_datasets(define_text_query("Neuro%", partial=True")

res = search_datasets(define_text_query("10090", field="taxonomy_id")

res = search_datasets(
    define_text_query("GRCm38", field="genome") &
    (define_text_query("neuro%", partial=True) |
        define_text_query("pancrea%", partial=True))
)

Parameters:

query (Union[str, GypsumSearchClause]) – The search query string or a gypsum.search.object for more complex queries.
cache_directory – Path to cache directory.
overwrite (bool) – Whether to overwrite the existing cache. Defaults to False.
latest (bool) – Whether to fetch only the latest versions of datasets. Defaults to True.

Return type:

BiocFrame

Returns:

A BiocFrame where each row corresponds to a dataset, containing various columns of metadata. Some columns may be lists to capture 1:many mappings.

scrnaseq.upload_dataset module¶

scrnaseq.upload_dataset.upload_dataset(directory, name, version, package='scRNAseq', cache_dir='/home/runner/.cache/gypsum', deduplicate=True, probation=False, url='https://gypsum.artifactdb.com', token=None, concurrent=1, abort_failed=True)[source]¶

Upload the dataset to the gypsum bucket.

This is a wrapper around upload_directory() specific to the scRNAseq package.

See also

upload_directory(), to upload a directory to the gypsum backend.

Parameters:

Name – Dataset name.
version (str) – Version name.
directory (str) – Path to a directory containing the files to be uploaded. This directory is assumed to correspond to a version of an asset.
cache_dir (str) –
Path to the cache for saving files, e.g., in save_version().

Used to convert symbolic links to upload links,see prepare_directory_upload().
deduplicate (bool) – Whether the backend should attempt deduplication of files in the immediately previous version. Defaults to True.
probation (bool) – Whether to perform a probational upload. Defaults to False.
url (str) – URL of the gypsum REST API.
token (str) – GitHub access token to authenticate to the gypsum REST API.
concurrent (int) – Number of concurrent downloads. Defaults to 1.
abort_failed (bool) –
Whether to abort the upload on any failure.

Setting this to False can be helpful for diagnosing upload problems.

Returns:

True if successfull, otherwise False.

scrnaseq.utils module¶

scrnaseq.utils.format_object_metadata(x)[source]¶

Format object related metadata.

Create object-related metadata to validate against the default schema from fetch_metadata_schema(). This is intended for downstream package developers who are auto-generating metadata documents to be validated by validate_metadata().

Parameters:: x – An Python object, typically an instance of a BiocPy class.
Return type:: dict
Returns:: Dictionary containing metadata for the object.

scrnaseq.utils.realize_array(x)[source]¶

Realize a ReloadedArray into a dense array or sparse matrix.

Parameters:: x – ReloadedArray object.
Returns:: Realized array or matrix.

scrnaseq.utils.single_cell_load_object(path, metadata=None, scrnaseq_realize_assays=False, scrnaseq_realize_reduced_dims=True, **kwargs)[source]¶

Load a SummarizedExperiment or SingleCellExperiment object from a file.

Parameters:

path (str) – Path to the dataset.
metadata (dict) – Metadata for the dataset. Defaults to None.
scrnaseq_realize_assays (bool) – Whether to realize assays into memory. Defaults to False.
scrnaseq_realize_reduced_dims (bool) – Whether to realize reduced dimensions into memory. Defaults to True.
**kwargs – Further arguments to pass to read_object().

Returns:

A SingleCellExperiment or a SummarizedExperiment derivative of the object.

scrnaseq package¶

Submodules¶

scrnaseq.fetch_dataset module¶

scrnaseq.list_datasets module¶

scrnaseq.list_versions module¶

scrnaseq.polish_dataset module¶

scrnaseq.save_dataset module¶

scrnaseq.search_datasets module¶

scrnaseq.upload_dataset module¶

scrnaseq.utils module¶

Module contents¶