scranpy.feature_selection package#

Submodules#

scranpy.feature_selection.choose_hvgs module#

class scranpy.feature_selection.choose_hvgs.ChooseHvgsOptions(number=2500)[source]#

Bases: object

Optional arguments for choose_hvgs().

number#

Number of HVGs to retain. Larger values preserve more biological structure at the cost of increasing computational work and random noise from less-variable genes.

Defaults to 2500.

__annotations__ = {'number': <class 'int'>}#
__dataclass_fields__ = {'number': Field(name='number',type=<class 'int'>,default=2500,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD)}#
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)#
__eq__(other)#

Return self==value.

__hash__ = None#
__repr__()#

Return repr(self).

number: int = 2500#
scranpy.feature_selection.choose_hvgs.choose_hvgs(stat, options=ChooseHvgsOptions(number=2500))[source]#

Choose highly variable genes for high-dimensional downstream steps such as run_pca(). This ensures that those steps focus on interesting biology, under the assumption that biological variation is larger than random noise.

Parameters:
  • stat (ndarray) – Array of variance modelling statistics, where larger values correspond to higher variability. This usually contains the residuals of the fitted mean-variance trend from model_gene_variances().

  • options (ChooseHvgsOptions) – Optional parameters.

Return type:

ndarray

Returns:

Array of booleans of length equal to stat, specifying whether a given gene is considered to be highly variable.

scranpy.feature_selection.model_gene_variances module#

class scranpy.feature_selection.model_gene_variances.ModelGeneVariancesOptions(block=None, span=0.3, assay_type='logcounts', feature_names=None, num_threads=1)[source]#

Bases: object

Optional arguments for model_gene_variances().

block#

Block assignment for each cell. Variance modelling is performed within each block to avoid interference from inter-block differences.

If provided, this should have length equal to the number of cells, where cells have the same value if and only if they are in the same block. Defaults to None, indicating all cells are part of the same block.

span#

Span to use for the LOWESS trend fitting. Larger values yield a smoother curve and reduces the risk of overfitting, at the cost of being less responsive to local variations. Defaults to 0.3.

assay_type#

Assay to use from input if it is a SummarizedExperiment.

feature_names#

Sequence of feature names of length equal to the number of rows in input. If provided, this is used as the row names of the output data frames.

num_threads#

Number of threads to use. Defaults to 1.

__annotations__ = {'assay_type': typing.Union[int, str], 'block': typing.Optional[typing.Sequence], 'feature_names': typing.Optional[typing.Sequence[str]], 'num_threads': <class 'int'>, 'span': <class 'float'>}#
__dataclass_fields__ = {'assay_type': Field(name='assay_type',type=typing.Union[int, str],default='logcounts',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'block': Field(name='block',type=typing.Optional[typing.Sequence],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'feature_names': Field(name='feature_names',type=typing.Optional[typing.Sequence[str]],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'num_threads': Field(name='num_threads',type=<class 'int'>,default=1,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), 'span': Field(name='span',type=<class 'float'>,default=0.3,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD)}#
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)#
__eq__(other)#

Return self==value.

__hash__ = None#
__repr__()#

Return repr(self).

assay_type: Union[int, str] = 'logcounts'#
block: Optional[Sequence] = None#
feature_names: Optional[Sequence[str]] = None#
num_threads: int = 1#
span: float = 0.3#
scranpy.feature_selection.model_gene_variances.model_gene_variances(input, options=ModelGeneVariancesOptions(block=None, span=0.3, assay_type='logcounts', feature_names=None, num_threads=1))[source]#

Compute gene variances and model them with a trend to account for non-trivial mean-variance relationships in count data. The residual from the trend can then be used to identify highly variable genes, e.g., with choose_hvgs().

Parameters:
Return type:

BiocFrame

Returns:

Data frame with variance modelling results for each gene, specifically the mean log-expression, the variance, the fitted value of the mean-variance trend and the residual from the trend. Each row of the data frame corresponds to a row of input.

For multiple blocks, the data frame’s columns will represent the average across blocks. An extra per_block column will also be present containing a nested BiocFrame with the same per-block statistics.

Module contents#