genomicranges package¶

Subpackages¶

genomicranges.io package

Submodules¶

genomicranges.GenomicRanges module¶

class genomicranges.GenomicRanges.GenomicRanges(seqnames, ranges, strand=None, names=None, mcols=None, seqinfo=None, metadata=None, validate=True)[source]¶

Bases: object

GenomicRanges provides a container class to represent and operate over genomic regions and annotations.

Note: The documentation for some of the methods are derived from the GenomicRanges R/Bioconductor package.

__copy__()[source]¶

Returns:: A shallow copy of the current GenomicRanges.

__deepcopy__(memo=None, _nil=[])[source]¶

Returns:: A deep copy of the current GenomicRanges.

__getitem__(subset)[source]¶

Alias to get_subset.

Return type:: GenomicRanges

__init__(seqnames, ranges, strand=None, names=None, mcols=None, seqinfo=None, metadata=None, validate=True)[source]¶

Initialize a GenomicRanges object.

Parameters:

seqnames (Sequence[str]) – List of sequence or chromosome names.
ranges (IRanges) – Genomic positions and widths of each position. Must have the same length as seqnames.
strand (Union[Sequence[str], Sequence[int], ndarray, None]) –
Strand information for each genomic range. This should be 0 (any strand), 1 (forward strand) or -1 (reverse strand). If None, all genomic ranges are assumed to be 0 (any strand).

May be provided as a list of strings representing the strand; “+” for forward strand, “-” for reverse strand, or “*” for any strand and will be mapped to 1, -1 or 0 respectively.
names (Optional[Sequence[str]]) – Names for each genomic range. Defaults to None, which means the ranges are unnamed.
mcols (Optional[BiocFrame]) – A ~py:class:~biocframe.BiocFrame.BiocFrame with the number of rows same as number of genomic ranges, containing per-range annotation. Defaults to None, in which case an empty BiocFrame object is created.
seqinfo (Optional[SeqInfo]) – Sequence information. Defaults to None, in which case a SeqInfo object is created with the unique set of chromosome names from seqnames.
metadata (Optional[dict]) – Additional metadata. Defaults to None, and is assigned to an empty dictionary.
validate (bool) – Internal use only.

__iter__()[source]¶

Iterator over rows.

Return type:: GenomicRangesIter

__len__()[source]¶

Return type:: int
Returns:: Number of rows.

__repr__()[source]¶

Return type:: str
Returns:: A string representation of this GenomicRanges.

__setitem__(args, value)[source]¶

Alias to set_subset.

This operation modifies object in-place.

Return type:: GenomicRanges

binned_average(scorename, bins, outname='binned_average', in_place=False)[source]¶

Calculate average for a column across all regions in bins, then set a column specified by ‘outname’ with those values.

Parameters:

scorename (str) – Score column to compute averages on.
bins (GenomicRanges) – Bins you want to use.
outname (str) – New column name to add to the object.
in_place (bool) – Whether to modify bins in place.

Raises:

ValueError – If scorename column does not exist. scorename is not all ints or floats.
TypeError – If bins is not of type GenomicRanges.

Return type:

GenomicRanges

Returns:

A modified bins object with the computed averages, either as a copy of the original or as a reference to the (in-place-modified) original.

copy()[source]¶: Alias for __copy__().

count_overlaps(query, query_type='any', max_gap=-1, min_overlap=0, ignore_strand=False, num_threads=1)[source]¶

Count overlaps between subject (self) and a query GenomicRanges object.

Parameters:

query (GenomicRanges) – Query GenomicRanges.
query_type (Literal['any', 'start', 'end', 'within']) –
Overlap query type, must be one of
- ”any”: Any overlap is good
- ”start”: Overlap at the beginning of the intervals
- ”end”: Must overlap at the end of the intervals
- ”within”: Fully contain the query interval
Defaults to “any”.
max_gap (int) – Maximum gap allowed in the overlap. Defaults to -1 (no gap allowed).
min_overlap (int) – Minimum overlap with query. Defaults to 0.
ignore_strand (bool) – Whether to ignore strands. Defaults to False.
num_threads (int) – Number of threads to use. Defaults to 1.

Raises:

TypeError – If query is not of type GenomicRanges.

Return type:

ndarray

Returns:

NumPy vector with length matching query, value represents the number of overlaps in self for each query.

coverage(shift=0, width=None, weight=1, ignore_strand=True)[source]¶

Calculate coverage for each chromosome. For each position, this method counts the number of ranges that cover it.

Parameters:

shift (int) – Shift all genomic positions. Defaults to 0.
width (Optional[int]) – Restrict the width of all chromosomes. Defaults to None.
weight (int) – Weight to use for each range. Defaults to 1.
ignore_strand (bool) – Whether to ignore strand, effectively combining all strands for coverage calculation. Defaults to True.

Return type:

Dict[str, ndarray]

Returns:

A dictionary with chromosome names as keys and the coverage vector as values. The dictionary is sorted by chromosome name.

disjoin(with_reverse_map=False, ignore_strand=False)[source]¶

Calculate disjoint genomic positions for each distinct (seqname, strand) pair.

Parameters:

with_reverse_map (bool) – Whether to return a map of indices back to the original object. Defaults to False.
ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Return type:

GenomicRanges

Returns:

A new GenomicRanges containing disjoint ranges.

disjoint_bins(ignore_strand=False)[source]¶

Split ranges into a set of bins so that the ranges in each bin are disjoint.

Parameters:: ignore_strand (bool) – Whether to ignore strands. Defaults to False.
Return type:: ndarray
Returns:: An ndarray indicating the bin index for each range.

distance(query)[source]¶

Compute the pair-wise distance with intervals in query.

Parameters:: query (Union[GenomicRanges, IRanges]) – Query GenomicRanges or IRanges.
Return type:: ndarray
Returns:: Numpy vector containing distances for each interval in query.

classmethod empty()[source]¶

Create an zero-length GenomicRanges object.

Returns:: same type as caller, in this case a GenomicRanges.

property end: ndarray¶: Alias for get_end.

extract_groups_by_seqnames()[source]¶

find_overlaps(query, query_type='any', select='all', max_gap=-1, min_overlap=0, ignore_strand=False, num_threads=1)[source]¶

Find overlaps between subject (self) and a query GenomicRanges object.

Parameters:

query (GenomicRanges) – Query GenomicRanges.
query_type (Literal['any', 'start', 'end', 'within']) –
Overlap query type, must be one of
- ”any”: Any overlap is good
- ”start”: Overlap at the beginning of the intervals
- ”end”: Must overlap at the end of the intervals
- ”within”: Fully contain the query interval
Defaults to “any”.
select (Literal['all', 'first', 'last', 'arbitrary']) – Determine what hit to choose when there are multiple hits for an interval in subject.
max_gap (int) – Maximum gap allowed in the overlap. Defaults to -1 (no gap allowed).
min_overlap (int) – Minimum overlap with query. Defaults to 0.
ignore_strand (bool) – Whether to ignore strands. Defaults to False.
num_threads (int) – Number of threads to use. Defaults to 1.

Raises:

TypeError – If query is not of type GenomicRanges.

Returns:

query_hits: Indices into query ranges
self_hits: Corresponding indices into self ranges that are upstream

Each row represents a query-self pair where self overlaps query.

Return type:

A BiocFrame with two columns

flank(width, start=True, both=False, ignore_strand=False, in_place=False)[source]¶

Compute flanking ranges for each range. The logic for this comes from the R/GenomicRanges & IRanges packages.

If start is True for a given range, the flanking occurs at the start, otherwise the end. The widths of the flanks are given by the width parameter.

width can be negative, in which case the flanking region is reversed so that it represents a prefix or suffix of the range.

Usage:

gr.flank(3, True), where “x” indicates a range in gr and “-” indicates the resulting flanking region:

—xxxxxxx

If start were False, the range in gr becomes
xxxxxxx—

For negative width, i.e. gr.flank(x, -3, FALSE), where “*” indicates the overlap between “x” and the result:

xxxx***

If both is True, then, for all ranges in “x”, the flanking regions are extended into (or out of, if width is negative) the range, so that the result straddles the given endpoint and has twice the width given by width.

This is illustrated below for gr.flank(3, both=TRUE):
—***xxxx

Parameters:

width (int) – Width to flank by. May be negative.
start (Union[bool, ndarray, List[bool]]) –
Whether to only flank starts. Defaults to True.

Alternatively, you may provide a list of start values, whose length is the same as the number of ranges.
both (bool) – Whether to flank both starts and ends. Defaults to False.
ignore_strand (bool) – Whether to ignore strands. Defaults to False.
in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the flanked regions, either as a copy of the original or as a reference to the (in-place-modified) original.

follow(query, select='last', ignore_strand=False, num_threads=1)[source]¶

Search nearest positions only upstream that overlap with each range in query.

Parameters:

query (GenomicRanges) – Query GenomicRanges to find nearest positions.
select (Literal['all', 'last']) – Whether to return “all” hits or just “last”. Defaults to “last”.
ignore_strand (bool) – Whether to ignore strand. Defaults to False.
num_threads (int) – Number of threads to use. Defaults to 1.

Returns:

A numpy array of integers with length matching query, containing indices: into self for the closest downstream position of each query range. Value may be None if there are no matches.
If select=”all”:: A BiocFrame with two columns: - query_hits: Indices into query ranges - self_hits: Corresponding indices into self ranges that are upstream Each row represents a query-self pair where self follows query.

Return type:

If select=”last”

classmethod from_pandas(input)[source]¶

Create an GenomicRanges object from a DataFrame.

Parameters:: input – Input data. Must contain columns ‘seqnames’, ‘starts’ and ‘widths’ or “ends”. ‘ends’ are expected to be inclusive.
Return type:: GenomicRanges
Returns:: A GenomicRanges object.

classmethod from_polars(input)[source]¶

Create an GenomicRanges object from a DataFrame.

Parameters:: input – Input polars DataFrame. Must contain columns ‘seqnames’, ‘starts’ and ‘widths’ or “ends”. ‘ends’ are expected to be inclusive.
Return type:: GenomicRanges
Returns:: A GenomicRanges object.

gaps(start=1, end=None, ignore_strand=False)[source]¶

Identify complemented ranges for each distinct (seqname, strand) pair.

Parameters:

start (int) – Restrict chromosome start position. Defaults to 1.
end (Union[int, Dict[str, int], None]) –
Restrict end position for each chromosome. Defaults to None. If None, extracts sequence information from seqinfo object if available.

Alternatively, you may provide a dictionary with seqnames as keys and the values specifying the ends.
ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Return type:

GenomicRanges

Returns:

A new GenomicRanges with complement ranges.

get_end()[source]¶

Get all end positions.

Return type:: ndarray
Returns:: NumPy array of 32-bit signed integers containing the end positions for all ranges.

get_mcols()[source]¶

Return type:: BiocFrame
Returns:: A ~py:class:~biocframe.BiocFrame.BiocFrame containing per-range annotations.

get_metadata()[source]¶

Return type:: dict
Returns:: Dictionary of metadata for this object.

get_names()[source]¶

Return type:: Names
Returns:: A list of names for each genomic range.

get_out_of_bound_index()[source]¶

Find indices of genomic ranges that are out of bounds.

Return type:: ndarray
Returns:: A numpy array of integer indices where ranges are out of bounds.

get_ranges()[source]¶

Return type:: IRanges
Returns:: An IRanges object containing the positions.

get_seqinfo()[source]¶

Return type:: SeqInfo
Returns:: A ~py:class:~genomicranges.SeqInfo.SeqInfo containing sequence information.

get_seqlengths()[source]¶

Get sequence lengths for each genomic range.

Return type:: ndarray
Returns:: An ndarray containing the sequence lengths.

get_seqnames(as_type='list')[source]¶

Access sequence names.

Parameters:

as_type (Literal['factor', 'list']) –

Access seqnames as factor tuple, in which case, levels and codes are returned.

If list, then codes are mapped to levels and returned.

Return type:

Union[Factor, List[str]]

Returns:

A biocutils.Factor if as_type=”factor”. Otherwise a list of sequence names.

get_start()[source]¶

Get all start positions.

Return type:: ndarray
Returns:: NumPy array of 32-bit signed integers containing the start positions for all ranges.

get_strand(as_type='numpy')[source]¶

Access strand information.

Parameters:

as_type (Literal['numpy', 'factor', 'list']) –

Access seqnames as factor codes, in which case, a numpy: vector is retuned.

If factor, a tuple with codes as the strand vector and levels a dictionary containing the mapping.

If list, then codes are mapped to levels and returned.

Return type:

Union[Tuple[ndarray, dict], List[str]]

Returns:

A numpy vector representing strand, 0 for any strand, -1 for reverse strand and 1 for forward strand.

A tuple of codes and levels.

A list of “+”, “-”, or “*” for each range.

get_subset(subset)[source]¶

Subset GenomicRanges, based on their indices or names.

Parameters:

subset (Union[str, int, bool, Sequence]) –

Indices to be extracted. This may be an integer, boolean, string, or any sequence thereof, as supported by normalize_subscript(). Scalars are treated as length-1 sequences.

Strings may only be used if :py:attr:~names are available (see get_names()). The first occurrence of each string in the names is used for extraction.

Return type:

GenomicRanges

Returns:

A new GenomicRanges object with the ranges of interest.

get_width()[source]¶

Get width of each genomic range.

Return type:: ndarray
Returns:: NumPy array of 32-bit signed integers containing the width for all ranges.

intersect(other, ignore_strand=False)[source]¶

Find intersecting genomic intervals with other.

Parameters:

other (GenomicRanges) – The other GenomicRanges object.
ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Return type:

GenomicRanges

Returns:

A new GenomicRanges object with intersecting ranges.

intersect_ncls(other, delete_index=True, num_threads=1)[source]¶

Find intersecting genomic intervals with other (uses NCLS index).

Parameters:

other (GenomicRanges) – The other GenomicRanges object.
delete_index (bool) – Defaults to True, to delete the cached ncls index. Set to False, to reuse the index across multiple queries.
num_threads (int) – Number of threads to use. Defaults to 1.

Return type:

GenomicRanges

Returns:

A new GenomicRanges object with intersecting ranges.

invert_strand(in_place=False)[source]¶

Invert strand for each range.

Conversion map:

“+” map to “-”
“-” becomes “+”
“*” stays the same

Parameters:: in_place (bool) – Whether to modify the object in place. Defaults to False.
Return type:: GenomicRanges
Returns:: A modified GenomicRanges object with the trimmed regions, either as a copy of the original or as a reference to the (in-place-modified) original.

is_disjoint(ignore_strand=False)[source]¶

Calculate disjoint genomic positions for each distinct (seqname, strand) pair.

Parameters:: ignore_strand (bool) – Whether to ignore strands. Defaults to False.
Return type:: bool
Returns:: True if all ranges are disjoint, otherwise False.

match(query, ignore_strand=False)[source]¶

Element wise comparison to find exact match ranges.

Parameters:

query (GenomicRanges) – Query GenomicRanges to search for matches.
ignore_strand (bool) – Whether to ignore strand. Defaults to False.

Return type:

ndarray

Returns:

A NumPy array with length matching query containing the matched indices.

property mcols: BiocFrame¶: Alias for get_mcols().

property metadata: dict¶: Alias for get_metadata.

property names: Names¶: Alias for get_names().

narrow(start=None, width=None, end=None, in_place=False)[source]¶

Narrow genomic positions by provided start, width and end parameters.

Important: these parameters are relative shift in positions for each range.

Parameters:

start (Union[int, List[int], ndarray, None]) – Relative start position. Defaults to None.
width (Union[int, List[int], ndarray, None]) – Relative end position. Defaults to None.
end (Union[int, List[int], ndarray, None]) – Relative width of the interval. Defaults to None.
in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the trimmed regions, either as a copy of the original or as a reference to the (in-place-modified) original.

nearest(query, select='arbitrary', ignore_strand=False, num_threads=1, adjacent_equals_overlap=True)[source]¶

Search nearest positions both upstream and downstream that overlap with each range in query.

Parameters:

query (GenomicRanges) – Query GenomicRanges to find nearest positions.
select (Literal['all', 'arbitrary']) – Determine what hit to choose when there are multiple hits for an interval in query.
ignore_strand (bool) – Whether to ignore strand. Defaults to False.
num_threads (int) – Number of threads to use. Defaults to 1.
adjacent_equals_overlap (bool) –
Whether to consider immediately-adjacent subject intervals to be equally “nearest” to the query as an overlapping subject interval.

If true, both overlapping and immediately-adjacent subject intervals (i.e., a gap of zero) will be reported in matches.

Otherwise, immediately-adjacent subjects will only be reported if overlapping subjects are not present.

Returns:

A numpy array of integers with length matching query, containing indices: into self for the closest for each query range. Value may be None if there are no matches.
If select=”all”:: A BiocFrame with two columns: - query_hits: Indices into query ranges - self_hits: Corresponding indices into self ranges that are upstream Each row represents a query-self pair where subject is nearest to query.

Return type:

If select=”arbitrary”

order(decreasing=False)[source]¶

Get the order of indices for sorting.

Order orders the genomic ranges by chromosome and strand. Strand is ordered by reverse first (-1), any strand (0) and forward strand (-1). Then by the start positions and width if two regions have the same start.

Parameters:: decreasing (bool) – Whether to sort in descending order. Defaults to False.
Return type:: ndarray
Returns:: NumPy vector containing index positions in the sorted order.

precede(query, select='first', ignore_strand=False, num_threads=1)[source]¶

Search nearest positions only downstream that overlap with each range in query.

Parameters:

query (GenomicRanges) – Query GenomicRanges to find nearest positions.
select (Literal['all', 'first']) – Whether to return “all” hits or just “first”. Defaults to “first”.
ignore_strand (bool) – Whether to ignore strand. Defaults to False.
num_threads (int) – Number of threads to use. Defaults to 1.

Returns:

A numpy array of integers with length matching query, containing indices: into self for the closest upstream position of each query range. Value may be None if there are no matches.
If select=”all”:: A BiocFrame with two columns: - query_hits: Indices into query ranges - self_hits: Corresponding indices into self ranges that are upstream Each row represents a query-self pair where self precedes query.

Return type:

If select=”first”

promoters(upstream=2000, downstream=200, in_place=False)[source]¶

Extend ranges to promoter regions.

Generates promoter ranges relative to the transcription start site (TSS).

Parameters:

upstream (int) – Number of positions to extend in the 5’ direction. Defaults to 2000.
downstream (int) – Number of positions to extend in the 3’ direction. Defaults to 200.
in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the promoter regions, either as a copy of the original or as a reference to the (in-place-modified) original.

range(with_reverse_map=False, ignore_strand=False)[source]¶

Calculate range bounds for each distinct (seqname, strand) pair.

Parameters:

with_reverse_map (bool) – Whether to return map of indices back to original object. Defaults to False.
ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Return type:

GenomicRanges

Returns:

A new GenomicRanges object with the range bounds.

property ranges: IRanges¶: Alias for get_ranges().

rank()[source]¶

Get rank of the GenomicRanges object.

For each range identifies its position is a sorted order.

Return type:: List[int]
Returns:: Numpy vector containing rank.

reduce(with_reverse_map=False, drop_empty_ranges=False, min_gap_width=1, ignore_strand=False)[source]¶

Reduce orders the ranges, then merges overlapping or adjacent ranges.

Parameters:

with_reverse_map (bool) – Whether to return map of indices back to original object. Defaults to False.
drop_empty_ranges (bool) – Whether to drop empty ranges. Defaults to False.
min_gap_width (int) – Ranges separated by a gap of at least min_gap_width positions are not merged. Defaults to 1.
ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Return type:

GenomicRanges

Returns:

A new GenomicRanges object with reduced intervals.

resize(width, fix='start', ignore_strand=False, in_place=False)[source]¶

Resize ranges to the specified width where either the start, end, or center is used as an anchor.

Parameters:

width (Union[int, List[int], ndarray]) – Width to resize, cannot be negative!
fix (Union[Literal['start', 'end', 'center'], List[Literal['start', 'end', 'center']]]) –
Fix positions by “start”, “end”, or “center”. Defaults to “start”.

Alternatively, may provide a list of fix positions for each genomic range.
ignore_strand (bool) – Whether to ignore strands. Defaults to False.
in_place (bool) – Whether to modify the GenomicRanges object in place.

Raises:

ValueError – If parameter fix is neither start, end, nor center. If width is negative.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the resized regions, either as a copy of the original or as a reference to the (in-place-modified) original.

restrict(start=None, end=None, keep_all_ranges=False)[source]¶

Restrict ranges to a given start and end positions.

Parameters:

start (Union[int, Dict[str, int], ndarray, None]) –
Start position. Defaults to None.

Alternatively may provide a dictionary mapping sequence names to starts, or array of starts.
end (Union[int, Dict[str, int], ndarray, None]) –
End position. Defaults to None.

Alternatively may provide a dictionary mapping sequence names to starts, or array of starts.
keep_all_ranges (bool) – Whether to keep intervals that do not overlap with start and end. Defaults to False.

Return type:

GenomicRanges

Returns:

A new GenomicRanges object with the restricted regions.

sample(k=5)[source]¶

Randomly sample k intervals.

Parameters:: k (int) – Number of intervals to sample. Defaults to 5.
Return type:: GenomicRanges
Returns:: A new GenomicRanges with randomly sampled ranges.

property seqinfo: ndarray¶: Alias for get_seqinfo().

property seqnames: ndarray | List[str]¶: Alias for get_seqnames().

set_mcols(mcols, in_place=False)[source]¶

Set new range metadata.

Parameters:

mcols (Optional[BiocFrame]) –
A ~py:class:~biocframe.BiocFrame.BiocFrame with length same as the number of ranges, containing per-range annotations.

May be None to remove range metadata.
in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_metadata(metadata, in_place=False)[source]¶

Set additional metadata.

Parameters:

metadata (dict) – New metadata for this object.
in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_names(names, in_place=False)[source]¶

Set new names.

Parameters:

names (Optional[Sequence[str]]) – Names for each genomic range. May be None to remove names.
in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_ranges(ranges, in_place=False)[source]¶

Set new ranges.

Parameters:

ranges (IRanges) – IRanges containing the genomic positions and widths. Must have the same length as seqnames.
in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_seqinfo(seqinfo, in_place=False)[source]¶

Set new sequence information.

Parameters:

seqinfo (Optional[SeqInfo]) –
A ~py:class:~genomicranges.SeqInfo.SeqInfo object containing information about sequences in seqnames.

May be None to remove sequence information. This would then generate a new sequence information based on the current sequence names.
in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_seqnames(seqnames, in_place=False)[source]¶

Set new sequence names.

Parameters:

seqnames (Union[Sequence[str], ndarray]) – List of sequence or chromosome names. Optionally can be a numpy array with indices mapped to :py:attr:~seqinfo.
in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_strand(strand, in_place=False)[source]¶

Set new strand information.

Parameters:

strand (Union[Sequence[str], Sequence[int], ndarray, None]) –
Strand information for each genomic range. This should be 0 (any strand), 1 (forward strand) or -1 (reverse strand).

Alternatively, may provide a list of strings representing the strand; “+” for forward strand, “-” for reverse strand, or “*” for any strand and will be automatically mapped to 1, -1 or 0 respectively.

May be set to None; in which case all genomic ranges are assumed to be 0.
in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_subset(args, value, in_place=False)[source]¶

Udate positions.

Parameters:

args (Union[Sequence, int, str, bool, slice, range]) – Integer indices, a boolean filter, or (if the current object is named) names specifying the ranges to be replaced, see normalize_subscript().
value (GenomicRanges) – An GenomicRanges object of length equal to the number of ranges to be replaced, as specified by subset.
in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

setdiff(other, ignore_strand=False)[source]¶

Find set difference of genomic intervals with other.

Parameters:

other (GenomicRanges) – The other GenomicRanges object.
ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Return type:

GenomicRanges

Returns:

A new GenomicRanges object with the diff ranges.

shift(shift=0, in_place=False)[source]¶

Shift all intervals.

Parameters:

shift (Union[int, List[int], ndarray]) – Shift interval. If shift is 0, the current object is returned. Defaults to 0.
in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the shifted regions, either as a copy of the original or as a reference to the (in-place-modified) original.

sliding_windows(width, step=1)[source]¶

Slide along each range by width (intervals with equal width) and step.

Also, checkout tile_genome() for splitting a gneomic into chunks, or tile_by_range().

Parameters:

n – Number of intervals to split into. Defaults to None.
width (int) – Width of each interval. Defaults to None.

Return type:

List[GenomicRanges]

Returns:

A new GenomicRanges with the sliding ranges.

sort(decreasing=False, in_place=False)[source]¶

Get the order of indices for sorting.

Parameters:

decreasing (bool) – Whether to sort in descending order. Defaults to False.
in_place (bool) – Whether to modify the object in place. Defaults to False.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the trimmed regions, either as a copy of the original or as a reference to the (in-place-modified) original.

split(groups)[source]¶

Split the GenomicRanges object into a GenomicRangesList.

Parameters:

groups (list) –

A list specifying the groups or factors to split by.

Must have the same length as the number of genomic elements in the object.

Return type:

GenomicRangesList

Returns:

A GenomicRangesList containing the groups and their corresponding elements.

property start: ndarray¶: Alias for get_start.

property strand: ndarray¶: Alias for get_strand().

subset_by_overlaps(query, query_type='any', max_gap=-1, min_overlap=0, ignore_strand=False, num_threads=1)[source]¶

Subset subject (self) with overlaps in query GenomicRanges object.

Parameters:

query (GenomicRanges) – Query GenomicRanges.
query_type (Literal['any', 'start', 'end', 'within']) –
Overlap query type, must be one of
- ”any”: Any overlap is good
- ”start”: Overlap at the beginning of the intervals
- ”end”: Must overlap at the end of the intervals
- ”within”: Fully contain the query interval
Defaults to “any”.
max_gap (int) – Maximum gap allowed in the overlap. Defaults to -1 (no gap allowed).
min_overlap (int) – Minimum overlap with query. Defaults to 0.
ignore_strand (bool) – Whether to ignore strands. Defaults to False.
num_threads (int) – Number of threads to use. Defaults to 1.

Raises:

TypeError – If query is not of type GenomicRanges.

Return type:

GenomicRanges

Returns:

A GenomicRanges object containing overlapping ranges.

subtract(other, min_overlap=1, ignore_strand=False)[source]¶

Subtract searches for features in x that overlap self by at least the number of base pairs given by min_overlap.

Parameters:

other (GenomicRanges) – Object to subtract.
min_overlap (int) – Minimum overlap with query. Defaults to 1.
ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Return type:

GenomicRangesList

Returns:

A GenomicRangesList with the same size as self containing the subtracted regions.

terminators(upstream=2000, downstream=200, in_place=False)[source]¶

Extend ranges to termiantor regions.

Generates terminator ranges relative to the transcription end site (TES).

Parameters:

upstream (int) – Number of positions to extend in the 5’ direction. Defaults to 2000.
downstream (int) – Number of positions to extend in the 3’ direction. Defaults to 200.
in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the promoter regions, either as a copy of the original or as a reference to the (in-place-modified) original.

tile(n=None, width=None)[source]¶

Split each interval by n (number of sub intervals) or width (intervals with equal width).

Note: Either n or width must be provided but not both.

Also, checkout tile_genome() for splitting a genome into chunks.

Parameters:

n (Optional[int]) – Number of intervals to split into. Defaults to None.
width (Optional[int]) – Width of each interval. Defaults to None.

Raises:

ValueError – If both n and width are provided.

Return type:

List[GenomicRanges]

Returns:

List of GenomicRanges with the split ranges.

classmethod tile_genome(seqlengths, ntile=None, tilewidth=None, cut_last_tile_in_chrom=False)[source]¶

Tile genome into approximately equal-sized regions.

Parameters:

seqlengths (Dict[str, int]) – Dictionary of sequence lengths by chromosome.
ntile (Optional[int]) – Number of tiles (exclusive with tilewidth).
tilewidth (Optional[int]) – Width of tiles (exclusive with ntile).
cut_last_tile_in_chrom (bool) – Whether to cut the last tile in each chromosome.

Return type:

GenomicRanges

Returns:

GenomicRanges object with ranges and bin numbers in mcols.

to_pandas()[source]¶

Convert this GenomicRanges to a DataFrame object.

Returns:: A DataFrame object.

to_polars()[source]¶

Convert this GenomicRanges to a DataFrame object.

Returns:: A DataFrame object.

trim(in_place=False)[source]¶

Trim sequences outside of bounds for non-circular chromosomes.

Parameters:: in_place (bool) – Whether to modify the GenomicRanges object in place.
Return type:: GenomicRanges
Returns:: A new GenomicRanges object with the trimmed regions.

union(other, ignore_strand=False)[source]¶

Find union of genomic intervals with other.

Parameters:

other (GenomicRanges) – The other GenomicRanges object.
ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Return type:

GenomicRanges

Returns:

A new GenomicRanges object with all ranges.

property width: ndarray¶: Alias for get_width.

class genomicranges.GenomicRanges.GenomicRangesIter(obj)[source]¶

Bases: object

An iterator to a GenomicRanges object.

__init__(obj)[source]¶

Initialize the iterator.

Parameters:: obj (GenomicRanges) – Source object to iterate.

__iter__()[source]¶

__next__()[source]¶

genomicranges.GenomicRangesList module¶

class genomicranges.GenomicRangesList.GenomicRangesList(ranges, range_lengths=None, names=None, mcols=None, metadata=None, validate=True)[source]¶

Bases: object

Just as it sounds, a GenomicRangesList is a named-list like object.

If you are wondering why you need this class, a GenomicRanges object lets us specify multiple genomic elements, usually where the genes start and end. Genes are themselves made of many sub regions, e.g. exons. GenomicRangesList allows us to represent this nested structure.

Currently, this class is limited in functionality, purely a read-only class with basic accessors.

Typical usage:

To construct a GenomicRangesList object, simply pass in a list of genomicranges.GenomicRanges.GenomicRanges objects and Optionally names.

a = GenomicRanges(
    seqnames=[
        "chr1",
        "chr2",
        "chr1",
        "chr3",
    ],
    ranges=IRanges(
        [
            1,
            3,
            2,
            4,
        ],
        [
            10,
            30,
            50,
            60,
        ],
    ),
    strand=[
        "-",
        "+",
        "*",
        "+",
    ],
    mcols=BiocFrame(
        {
            "score": [
                1,
                2,
                3,
                4,
            ]
        }
    ),
)

b = GenomicRanges(
    seqnames=[
        "chr2",
        "chr4",
        "chr5",
    ],
    ranges=IRanges(
        [3, 6, 4],
        [
            30,
            50,
            60,
        ],
    ),
    strand=[
        "-",
        "+",
        "*",
    ],
    mcols=BiocFrame(
        {
            "score": [
                2,
                3,
                4,
            ]
        }
    ),
)

grl = GenomicRangesList(
    ranges=[
        gr1,
        gr2,
    ],
    names=[
        "gene1",
        "gene2",
    ],
)

Additionally, you may also provide metadata about the genomic elements in the dictionary using mcols attribute.

__copy__()[source]¶

Returns:: A shallow copy of the current GenomicRangesList.

__deepcopy__(memo=None, _nil=[])[source]¶

Returns:: A deep copy of the current GenomicRangesList.

__getitem__(args)[source]¶

Subset individual genomic elements.

Parameters:

args (Union[str, int, tuple, list, slice]) –

Name of the genomic element to access.

Alternatively, if names of genomic elements are not available, you may provide an index position of the genomic element to access.

Alternatively, args may also specify a list of positions to slice specified either as a list or slice.

A tuple may also be specified along each dimension. Currently if the tuple contains more than one dimension, its ignored.

Raises:

TypeError – If args is not a supported slice argument.

Return type:

Union[GenomicRanges, GenomicRangesList]

Returns:

A new GenomicRangesList of the slice.

__init__(ranges, range_lengths=None, names=None, mcols=None, metadata=None, validate=True)[source]¶

Initialize a GenomicRangesList object.

Parameters:

ranges (Union[GenomicRanges, List[GenomicRanges]]) – List of genomic elements. All elements in this list must be genomicranges.GenomicRanges.GenomicRanges objects.
range_lengths (Optional[Sequence[int]]) – Number of ranges within each genomic element. Defaults to None, and is inferred from ranges.
names (Optional[List[str]]) – Names of the genomic elements. The length of this must match the number of genomic elements in ranges. Defaults to None.
mcols (Optional[BiocFrame]) – Metadata about each genomic element. Defaults to None.
metadata (Optional[dict]) – Additional metadata. Defaults to None.
validate (bool) – Internal use only.

__iter__()[source]¶

Iterator over rows.

Return type:: GenomicRangesListIter

__len__()[source]¶

Return type:: int
Returns:: Number of rows.

__repr__()[source]¶

Return type:: str
Returns:: A string representation of this GenomicRangesList.

as_genomic_ranges()[source]¶

Coerce object to a GenomicRanges.

Return type:: GenomicRanges
Returns:: A GenomicRanges object.

as_granges()[source]¶

Alias to to_genomic_ranges().

Return type:: GenomicRanges

copy()[source]¶: Alias for __copy__().

element_nrows()[source]¶

Get a vector of the length of each element.

Return type:: Dict[str, List[str]]
Returns:: An integer vector where each value corresponds to the length of the contained GenomicRanges object.

classmethod empty(n)[source]¶

Create an empty n-length GenomicRangesList object.

Returns:: same type as caller, in this case a GenomicRangesList.

property end: Dict[str, List[int]]¶

Get all end positions.

Returns:: A list with the same length as keys in the object, each element in the list contains another list values.

classmethod from_dict(x)[source]¶

Create a GenomicRangesList object from dict.

Returns:: same type as caller, in this case a GenomicRangesList.

generic_accessor(prop, func=False, cast=False)[source]¶

Return type:: Union[Dict[str, list], GenomicRangesList]

get_mcols()[source]¶

Return type:: BiocFrame
Returns:: A ~py:class:~biocframe.BiocFrame.BiocFrame containing per-genomic element annotations.

get_metadata()[source]¶

Return type:: dict
Returns:: Dictionary of metadata for this object.

get_names()[source]¶

Return type:: Names
Returns:: A list of names for each genomic element.

get_range_lengths()[source]¶

Return type:: dict
Returns:: Number of ranges for each genomic element.

get_ranges()[source]¶

Return type:: Union[GenomicRanges, List[GenomicRanges]]
Returns:: List of genomic ranges.

groups(group)[source]¶

Get a genomic element by their name or index position.

Parameters:: group (Union[str, int]) – Name or index of the genomic element to access.
Return type:: GenomicRangesList
Returns:: The genomic element for group i.

property is_circular: Dict[str, List[int]]¶

Get the circularity flag.

Returns:: A list with the same length as keys in the object, each element in the list contains another list values.

is_empty()[source]¶

Whether GRangesList has no elements or if all its elements are empty.

Return type:: bool
Returns:: True if the object has no elements.

property mcols: BiocFrame¶: Alias for get_mcols().

property metadata: dict¶: Alias for get_metadata.

property names: Names¶: Alias for get_names().

range()[source]¶

Calculate range bounds for each genomic element.

Return type:: GenomicRangesList
Returns:: A new GenomicRangesList object with the range bounds.

property range_lengths: dict¶: Alias for get_range_lengths.

property ranges: GenomicRanges¶: Alias for get_ranges().

property seq_info: Dict[str, List[int]]¶

Get information about the underlying sequences.

Returns:: A list with the same length as keys in the object, each element in the list contains another list values.

property seqnames: Dict[str, List[str]]¶

Get all sequence names.

Returns:: A list with the same length as keys in the object, each element in the list contains another list of sequence names.

set_mcols(mcols, in_place=False)[source]¶

Set new range metadata.

Parameters:

mcols (Optional[BiocFrame]) – A ~py:class:~biocframe.BiocFrame.BiocFrame with length same as the number of genomic elements, containing per-genomic element annotations.
in_place (bool) – Whether to modify the GenomicRangesList object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRangesList object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_metadata(metadata, in_place=False)[source]¶

Set additional metadata.

Parameters:

metadata (dict) – New metadata for this object.
in_place (bool) – Whether to modify the GenomicRangesList object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRangesList object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_names(names, in_place=False)[source]¶

Set new names.

Parameters:

names (Optional[Sequence[str]]) – Names for each genomic element.
in_place (bool) – Whether to modify the GenomicRangesList object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRangesList object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_ranges(ranges, in_place=False)[source]¶

Set new genomic ranges.

Parameters:

ranges (Union[GenomicRanges, List[GenomicRanges]]) – List of genomic elements. All elements in this list must be genomicranges.GenomicRanges.GenomicRanges objects.
in_place (bool) – Whether to modify the GenomicRangesList object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRangesList object, either as a copy of the original or as a reference to the (in-place-modified) original.

property start: Dict[str, List[int]]¶

Get all start positions.

Returns:: A list with the same length as keys in the object, each element in the list contains another list values.

property strand: Dict[str, List[int]]¶

Get strand of all regions across all elements.

Returns:: A list with the same length as keys in the object, each element in the list contains another list values.

to_pandas()[source]¶

Coerce object to a DataFrame.

Return type:: pandas.DataFrame
Returns:: A DataFrame object.

property width: Dict[str, List[int]]¶

Get width of all regions across all elements.

Returns:: A list with the same length as keys in the object, each element in the list contains another list values.

class genomicranges.GenomicRangesList.GenomicRangesListIter(obj)[source]¶

Bases: object

An iterator to a GenomicRangesList object.

__init__(obj)[source]¶

Initialize the iterator.

Parameters:: obj (GenomicRangesList) – source object to iterate.

__iter__()[source]¶

__next__()[source]¶

genomicranges.SeqInfo module¶

class genomicranges.SeqInfo.SeqInfo(seqnames, seqlengths=None, is_circular=None, genome=None, validate=True)[source]¶

Bases: object

Information about the reference sequences, specifically the name and length of each sequence, whether it is a circular, and the identity of the genome from which it was derived.

__copy__()[source]¶

Returns:: A shallow copy of the current SeqInfo.

__deepcopy__(memo=None, _nil=[])[source]¶

Returns:: A deep copy of the current SeqInfo.

__getitem__(subset)[source]¶

Alias to get_subset.

Return type:: SeqInfo

__init__(seqnames, seqlengths=None, is_circular=None, genome=None, validate=True)[source]¶

Parameters:

seqnames (Sequence[str]) – Names of all reference sequences, should be unique.
seqlengths (Union[int, Sequence[int], Dict[str, int], None]) –
Lengths of all sequences in base pairs. This should contain non-negative values and have the same number of elements as seqnames. Entries may also be None if no lengths are available for that sequence.

Alternatively, a dictionary where keys are the sequence names and values are the lengths. If a name is missing from this dictionary, the length of the sequence is set to None.

Alternatively a single integer, if all sequences are of the same length.

Alternatively None, if no length information is available for any sequence.
is_circular (Union[bool, Sequence[bool], Dict[str, bool], None]) –
Whether each sequence is circular. This should have the same number of elements as seqnames. Entries may also be None if no information is available for that sequence.

Alternatively, a dictionary where keys are the sequence names and values are the circular flags. If a name is missing from this dictionary, the flag for the sequence is set to None.

Alternatively a single boolean, if all sequences have the same circular flag.

Alternatively None, if no flags are available for any sequence.
genome (Union[str, Sequence[str], Dict[str, str], None]) –
The genome build containing each reference sequence. This should have the same number of elements as seqnames. Entries may also be None if no information is available.

Alternatively, a dictionary where keys are the sequence names and values are the genomes. If a name is missing from this dictionary, the genome is set to None.

Alternatively a single string, if all sequences are derived from the same genome.

Alternatively None, if no genome information is available for any sequence.
validate (bool) – Whether to validate the arguments, internal use only.

__iter__()[source]¶

Iterator over sequences.

Return type:: SeqInfoIterator

__len__()[source]¶

Return type:: int
Returns:: Number of sequences in this object.

__repr__()[source]¶

Return type:: str
Returns:: A string representation of this SeqInfo.

copy()[source]¶: Alias for __copy__().

classmethod empty()[source]¶

Create an zero-length SeqInfo object.

Returns:: same type as caller, in this case a SeqInfo.

property genome: List[str]¶

get_genome()[source]¶

Return type:: List[str]
Returns:: A list of strings is returned containing the genome identity for all sequences in get_seqnames().

get_is_circular()[source]¶

Return type:: List[bool]
Returns:: A list of booleans is returned specifying whether each sequence (from get_seqnames()) is circular.

get_seqlengths()[source]¶

Return type:: List[int]
Returns:: A list of integers is returned containing the lengths of all sequences, in the same order as the sequence names from get_seqnames().

get_seqnames()[source]¶

Return type:: List[str]
Returns:: List of all chromosome names.

get_subset(subset)[source]¶

Subset SeqInfo, based on their indices or seqnames.

Parameters:

subset (Union[str, int, bool, Sequence]) –

Indices to be extracted. This may be an integer, boolean, string, or any sequence thereof, as supported by normalize_subscript(). Scalars are treated as length-1 sequences.

Strings may only be used if :py:attr:~seqnames are available (see get_seqnames()). The first occurrence of each string in the seqnames is used for extraction.

Return type:

SeqInfo

Returns:

A new SeqInfo object with the sequences of interest.

property is_circular: List[bool]¶

property seqlengths: List[int]¶

property seqnames: List[str]¶

set_genome(genome, in_place=False)[source]¶

Parameters:

genome (Union[str, Sequence[str], Dict[str, str], None]) – List of genomes, of length equal to the number of names in this SeqInfo object. Values may be None or strings.
in_place (bool) – Whether to modify the SeqInfo object in place.

Return type:

SeqInfo

Returns:

A modified SeqInfo object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_is_circular(is_circular, in_place=False)[source]¶

Parameters:

is_circular (Union[bool, Sequence[bool], Dict[str, bool], None]) –
List of circular flags, of length equal to the number of names in this SeqInfo object. Values may be None or booleans.

Alternatively, a dictionary where keys are the sequence names and values are the flags. Not all names need to be present in which case the flag is assumed to be None.
in_place (bool) – Whether to modify the SeqInfo object in place.

Return type:

SeqInfo

Returns:

A modified SeqInfo object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_seqlengths(seqlengths, in_place=False)[source]¶

Parameters:

seqlengths (Union[int, Sequence[int], Dict[str, int], None]) –
List of sequence lengths, of length equal to the number of names in this SeqInfo object. Values may be None or non-negative integers.

Alternatively, a dictionary where keys are the sequence names and values are the lengths. Not all names need to be present in which case the length is assumed to be None.
in_place (bool) – Whether to modify the SeqInfo object in place.

Return type:

SeqInfo

Returns:

A modified SeqInfo object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_seqnames(seqnames, in_place=False)[source]¶

Parameters:

seqnames (Sequence[str]) – List of sequence names, of length equal to the number of names in this SeqInfo object. All names should be unique strings.
in_place (bool) – Whether to modify the SeqInfo object in place.

Return type:

SeqInfo

Returns:

A modified SeqInfo object, either as a copy of the original or as a reference to the (in-place-modified) original.

class genomicranges.SeqInfo.SeqInfoIterator(obj)[source]¶

Bases: object

An iterator to a SeqInfo object.

__init__(obj)[source]¶

Initialize the iterator.

Parameters:: obj (SeqInfo) – Source object to iterate.

__iter__()[source]¶

__next__()[source]¶

genomicranges.SeqInfo.merge_SeqInfo(objects)[source]¶

Merge multiple SeqInfo objects, taking the union of all reference sequences. If the same reference sequence is present with the same details across objects, only a single instance is present in the final object; if details are contradictory, they are replaced with None.

Parameters:: objects (List[SeqInfo]) – List of SeqInfo objects.
Return type:: SeqInfo
Returns:: A single merged SeqInfo object.

genomicranges.utils module¶

genomicranges.utils.compute_up_down(starts, ends, strands, upstream, downstream, site='TSS')[source]¶

Compute promoter or terminator regions for genomic ranges.

Parameters:

x – GenomicRanges object
upstream – Number of bases upstream (scalar or array)
downstream – Number of bases downstream (scalar or array)
site (str) – “TSS” for promoters or “TES” for terminators

Returns:

New starts and ends.

genomicranges.utils.extract_groups_from_granges(x, ignore_strand=False)[source]¶

genomicranges.utils.group_by_indices(groups)[source]¶

Return type:: dict

genomicranges.utils.sanitize_strand_vector(strand)[source]¶

Create a numpy representation for strand.

Mapping: 1 for “+” (forward strand), 0 for “*” (any strand) and -1 for “-” (reverse strand).

Parameters:: strand (Union[Sequence[str], Sequence[int], ndarray]) – List of strand.
Raises:: ValueError – If strand is None. If strand contains values other than +,- and *. If strand is not a numpy vector, string of integers or strings.
Return type:: ndarray
Returns:: A numpy vector.

genomicranges.utils.slide_intervals(start, end, width, step)[source]¶

Sliding intervals.

Parameters:

start (int) – Start interval.
end (int) – End interval.
step (int) – Step of each interval.
width (int) – Width of each interval.

Return type:

List

Returns:

List of intervals split into bins.

genomicranges.utils.split_intervals(start, end, step)[source]¶

Split an interval range into equal bins.

Parameters:

start (int) – Start interval.
end (int) – End interval.
step (int) – Width or step of each interval.

Return type:

List

Returns:

List of intervals split into bins.

genomicranges.utils.wrapper_follow_precede(args)[source]¶: Processes a single group for precede/follow operations. This function is designed to be called by a multiprocessing pool.

genomicranges package¶

Subpackages¶

Submodules¶

genomicranges.GenomicRanges module¶

genomicranges.GenomicRangesList module¶

genomicranges.SeqInfo module¶

genomicranges.utils module¶

Module contents¶