scrnaseq package¶
Submodules¶
scrnaseq.fetch_dataset module¶
- scrnaseq.fetch_dataset.fetch_dataset(name: str, version: str, path: str | None = None, package: str = 'scRNAseq', cache_dir: str = '/home/runner/gypsum/cache', overwrite: bool = False, realize_assays: bool = False, realize_reduced_dims: bool = True, **kwargs) SummarizedExperiment [source]¶
Fetch a single-cell dataset from the gypsum backend.
See also
metadata index, on the expected schema for the metadata.
save_dataset()
andupload_directory()
, to save and upload a dataset.list_datasets()
andlist_versions()
, to get possible values for name and version.Example
sce = fetch_dataset("zeisel-brain-2015", "2023-12-14")
- Parameters:
name – Name of the dataset.
version – Version of the dataset.
path – Path to a subdataset, if name contains multiple datasets. Defaults to None.
package – Name of the package. Defaults to “scRNAseq”.
cache_dir – Path to cache directory.
overwrite – Whether to overwrite existing files. Defaults to False.
realize_assays – Whether to realize assays into memory. Defaults to False.
realize_reduced_dims – Whether to realize reduced dimensions into memory. Defaults to True.
**kwargs – Further arguments to pass to
read_object()
.
- Returns:
The dataset as a
SummarizedExperiment
or one of its subclasses.
- scrnaseq.fetch_dataset.fetch_metadata(name: str, version: str, path: str | None = None, package: str = 'scRNAseq', cache_dir: str = '/home/runner/gypsum/cache', overwrite: bool = False)[source]¶
Fetch metadata for a dataset from the gypsum backend.
See also
fetch_dataset()
, to fetch a dataset.Example:
meta = fetch_metadata("zeisel-brain-2015", "2023-12-14")
- Parameters:
name – Name of the dataset.
version – Version of the dataset.
path – Path to a subdataset, if name contains multiple datasets. Defaults to None.
package – Name of the package. Defaults to “scRNAseq”.
cache – Path to the cache directory.
overwrite – Whether to overwrite existing files. Defaults to False.
- Returns:
Dictionary containing metadata for the specified dataset.
scrnaseq.list_datasets module¶
- scrnaseq.list_datasets.list_datasets(cache_dir: str = '/home/runner/gypsum/cache', overwrite: bool = False, latest: bool = True) DataFrame [source]¶
List all available datasets.
Example
datasets = list_datasets()
- Parameters:
cache_dir – Path to cache directory.
overwrite – Whether to overwrite the database in cache. Defaults to False.
latest – Whether to only fetch the latest version of each dataset. Defaults to True.
- Returns:
A
DataFrame
where each row corresponds to a dataset. Each row contains title and description for each dataset, the number of rows and columns, the organisms and genome builds involved, whether the dataset has any pre-computed reduced dimensions, and so on. More details can be found in the Bioconductor metadata schema.
scrnaseq.list_versions module¶
scrnaseq.polish_dataset module¶
- scrnaseq.polish_dataset.polish_dataset(x: Type[SummarizedExperiment], reformat_assay_by_density: float = 0.3, attempt_integer_conversion: bool = True, remove_altexp_coldata: bool = True, forbid_nested_altexp: bool = True) Type[SummarizedExperiment] [source]¶
Optimize dataset for saving.
Prepare a
SummarizedExperiment
orSingleCellExperiment
to be saved withscrnaseq.save_dataset.save_dataset()
.This performs minor changes to improve storage efficiency, especially with matrices.
- Parameters:
x – A
SummarizedExperiment
or one of its derivative.reformat_assay_by_density –
Whether to optimize assay formats based on the density of non-zero values. Assays with densities above this number are converted to ordinary dense arrays (if they are not already), while those with lower densities are converted to sparse matrices.
This can be disabled by setting it to None.
attempt_integer_conversion –
Whether to convert double-precision assays containing integer values to actually have the integer type.
This can improve efficiency of downstream applications by avoiding the need to operate in double precision.
remove_altexp_coldata – Whether column data for alternative experiments should be removed. Defaults to True as the alternative experiment column data is usually redundant compared to the main experiment.
forbid_nested_altexp – Whether nested alternative experiments (i.e., alternative experiments of alternative experiments) should be forbidden.
- Returns:
A modifed object with the same type as
x
.
scrnaseq.save_dataset module¶
- scrnaseq.save_dataset.save_dataset(x: Any, path, metadata)[source]¶
- scrnaseq.save_dataset.save_dataset(x: SingleCellExperiment, path: str, metadata: dict)
- scrnaseq.save_dataset.save_dataset(x: SummarizedExperiment, path: str, metadata: dict)
- scrnaseq.save_dataset.save_dataset(x: AnnData, path: str, metadata: dict)
Save a dataset to disk.
- Parameters:
x – An object containing single-cell data. May be a derivative of
SummarizedExperiment
orAnnData
.path – Path to a new directory to save the dataset.
metadata –
Dictionary containing the metadata for this dataset. see the schema returned by
fetch_metadata_schema()
.Note that the
applications.takane
property will be automatically added by this function and does not have to be supplied.
See also
metadata index, on the expected schema for the metadata.
polish_dataset()
, to polishx
before saving it.upload_dataset()
, to upload the saved contents.Example
# Fetch an existing dataset # or create your own ``SingleCellExperiment`` # or ``AnnData`` object. sce = scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14") # Provide dataset level metadata for search and findability meta = { "title": "My dataset made from ziesel brain", "description": "This is a copy of the ziesel", "taxonomy_id": ["10090"], # NCBI ID "genome": ["GRCh38"], # genome build "sources": [{"provider": "GEO", "id": "GSE12345"}], "maintainer_name": "Shizuka Mogami", "maintainer_email": "mogami.shizuka@765pro.com", } import shutil import tempfile cache_dir = tempfile.mkdtemp() # Make sure the directory is clean shutil.rmtree(cache_dir) # Save the dataset scrnaseq.save_dataset(sce, cache_dir, meta)
- scrnaseq.save_dataset.save_dataset_anndata(x: AnnData, path: str, metadata: dict)[source]¶
Save
AnnData
to disk.
- scrnaseq.save_dataset.save_dataset_sce(x: SingleCellExperiment, path: str, metadata: dict)[source]¶
Save
SingleCellExperiment
to disk.
- scrnaseq.save_dataset.save_dataset_se(x: SummarizedExperiment, path: str, metadata: dict)[source]¶
Save
SummarizedExperiment
to disk.
scrnaseq.search_datasets module¶
- scrnaseq.search_datasets.search_datasets(query: str | GypsumSearchClause, cache_dir: str = '/home/runner/gypsum/cache', overwrite: bool = False, latest: bool = True) DataFrame [source]¶
Search for datasets of interest based on matching text in the associated metadata.
This is a wrapper around
search_metadata_text()
.The returned DataFrame contains the usual suspects like the title and description for each dataset, the number of rows and columns, the organisms and genome builds involved, whether the dataset has any pre-computed reduced dimensions, and so on.
More details can be found in the Bioconductor metadata index.
See also
list_datasets()
, to list all available datasets.search_metadata_text()
, to search metadata.Examples:
res = search_datasets("brain") res = search_datasets(define_text_query("Neuro%", partial=True") res = search_datasets(define_text_query("10090", field="taxonomy_id") res = search_datasets( define_text_query("GRCm38", field="genome") & (define_text_query("neuro%", partial=True) | define_text_query("pancrea%", partial=True)) )
- Parameters:
query – The search query string or a gypsum.search.object for more complex queries.
cache_directory – Path to cache directory.
overwrite – Whether to overwrite the existing cache. Defaults to False.
latest – Whether to fetch only the latest versions of datasets. Defaults to True.
- Returns:
A
DataFrame
where each row corresponds to a dataset, containing various columns of metadata. Some columns may be lists to capture 1:many mappings.
scrnaseq.upload_dataset module¶
- scrnaseq.upload_dataset.upload_dataset(directory: str, name: str, version: str, package: str = 'scRNAseq', cache_dir: str = '/home/runner/gypsum/cache', deduplicate: bool = True, probation: bool = False, url: str = 'https://gypsum.artifactdb.com', token: str | None = None, concurrent: int = 1, abort_failed: bool = True)[source]¶
Upload the dataset to the gypsum bucket.
This is a wrapper around
upload_directory()
specific to the scRNAseq package.See also
upload_directory()
, to upload a directory to the gypsum backend.- Parameters:
Name – Dataset name.
version – Version name.
directory – Path to a directory containing the
files
to be uploaded. This directory is assumed to correspond to a version of an asset.cache_dir –
Path to the cache for saving files, e.g., in
save_version()
.Used to convert symbolic links to upload links,see
prepare_directory_upload()
.deduplicate – Whether the backend should attempt deduplication of
files
in the immediately previous version. Defaults to True.probation – Whether to perform a probational upload. Defaults to False.
url – URL of the gypsum REST API.
token – GitHub access token to authenticate to the gypsum REST API.
concurrent – Number of concurrent downloads. Defaults to 1.
abort_failed –
Whether to abort the upload on any failure.
Setting this to False can be helpful for diagnosing upload problems.
- Returns:
True if successfull, otherwise False.
scrnaseq.utils module¶
- scrnaseq.utils.format_object_metadata(x) dict [source]¶
Format object related metadata.
Create object-related metadata to validate against the default schema from
fetch_metadata_schema()
. This is intended for downstream package developers who are auto-generating metadata documents to be validated byvalidate_metadata()
.- Parameters:
x – An Python object, typically an instance of a BiocPy class.
- Returns:
Dictionary containing metadata for the object.
- scrnaseq.utils.realize_array(x)[source]¶
Realize a ReloadedArray into a dense array or sparse matrix.
- Parameters:
x – ReloadedArray object.
- Returns:
Realized array or matrix.
- scrnaseq.utils.single_cell_load_object(path: str, metadata: dict | None = None, scrnaseq_realize_assays: bool = False, scrnaseq_realize_reduced_dims: bool = True, **kwargs)[source]¶
Load a
SummarizedExperiment
orSingleCellExperiment
object from a file.- Parameters:
path – Path to the dataset.
metadata – Metadata for the dataset. Defaults to None.
scrnaseq_realize_assays – Whether to realize assays into memory. Defaults to False.
scrnaseq_realize_reduced_dims – Whether to realize reduced dimensions into memory. Defaults to True.
**kwargs – Further arguments to pass to
read_object()
.
- Returns:
A SingleCellExperiment or a SummarizedExperiment derivative of the object.