genomicranges package

Subpackages

Submodules

genomicranges.GenomicRanges module

class genomicranges.GenomicRanges.GenomicRanges(seqnames, ranges, strand=None, names=None, mcols=None, seqinfo=None, metadata=None, validate=True)[source]

Bases: object

GenomicRanges provides a container class to represent and operate over genomic regions and annotations.

Note: The documentation for some of the methods are derived from the GenomicRanges R/Bioconductor package.

__copy__()[source]
Returns:

A shallow copy of the current GenomicRanges.

__deepcopy__(memo=None, _nil=[])[source]
Returns:

A deep copy of the current GenomicRanges.

__getitem__(subset)[source]

Alias to get_subset.

Return type:

GenomicRanges

__init__(seqnames, ranges, strand=None, names=None, mcols=None, seqinfo=None, metadata=None, validate=True)[source]

Initialize a GenomicRanges object.

Parameters:
  • seqnames (Sequence[str]) – List of sequence or chromosome names.

  • ranges (IRanges) – Genomic positions and widths of each position. Must have the same length as seqnames.

  • strand (Union[Sequence[str], Sequence[int], ndarray, None]) –

    Strand information for each genomic range. This should be 0 (any strand), 1 (forward strand) or -1 (reverse strand). If None, all genomic ranges are assumed to be 0.

    May be provided as a list of strings representing the strand; “+” for forward strand, “-” for reverse strand, or “*” for any strand and will be mapped accordingly to 1, -1 or 0.

  • names (Optional[Sequence[str]]) – Names for each genomic range. Defaults to None, which means the ranges are unnamed.

  • mcols (Optional[BiocFrame]) – A ~py:class:~biocframe.BiocFrame.BiocFrame with the number of rows same as number of genomic ranges, containing per-range annotation. Defaults to None, in which case an empty BiocFrame object is created.

  • seqinfo (Optional[SeqInfo]) – Sequence information. Defaults to None, in which case a SeqInfo object is created with the unique set of chromosome names from seqnames.

  • metadata (Optional[dict]) – Additional metadata. Defaults to None, and is assigned to an empty dictionary.

  • validate (bool) – Internal use only.

__iter__()[source]

Iterator over rows.

Return type:

GenomicRangesIter

__len__()[source]
Return type:

int

Returns:

Number of rows.

__repr__()[source]
Return type:

str

Returns:

A string representation of this GenomicRanges.

__setitem__(args, value)[source]

Alias to set_subset.

This operation modifies object in-place.

Return type:

GenomicRanges

binned_average(scorename, bins, outname='binned_average', in_place=False)[source]

Calculate average for a column across all regions in bins, then set a column specified by ‘outname’ with those values.

Parameters:
  • scorename (str) – Score column to compute averages on.

  • bins (GenomicRanges) – Bins you want to use.

  • outname (str) – New column name to add to the object.

  • in_place (bool) – Whether to modify bins in place.

Raises:
  • ValueError – If scorename column does not exist. scorename is not all ints or floats.

  • TypeError – If bins is not of type GenomicRanges.

Return type:

GenomicRanges

Returns:

A modified bins object with the computed averages, either as a copy of the original or as a reference to the (in-place-modified) original.

copy()[source]

Alias for __copy__().

count_overlaps(query, query_type='any', max_gap=-1, min_overlap=1, ignore_strand=False)[source]

Count overlaps between subject (self) and a query GenomicRanges object.

Parameters:
  • query (GenomicRanges) – Query GenomicRanges.

  • query_type (Literal['any', 'start', 'end', 'within']) –

    Overlap query type, must be one of

    • ”any”: Any overlap is good

    • ”start”: Overlap at the beginning of the intervals

    • ”end”: Must overlap at the end of the intervals

    • ”within”: Fully contain the query interval

    Defaults to “any”.

  • max_gap (int) – Maximum gap allowed in the overlap. Defaults to -1 (no gap allowed).

  • min_overlap (int) – Minimum overlap with query. Defaults to 1.

  • ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Raises:

TypeError – If query is not of type GenomicRanges.

Return type:

List[int]

Returns:

A list with the same length as query, containing number of overlapping indices.

coverage(shift=0, width=None, weight=1)[source]

Calculate coverage for each chromosome, For each position, counts the number of ranges that cover it.

Parameters:
  • shift (int) – Shift all genomic positions. Defaults to 0.

  • width (Optional[int]) – Restrict the width of all chromosomes. Defaults to None.

  • weight (int) – Weight to use. Defaults to 1.

Return type:

Dict[str, ndarray]

Returns:

A dictionary with chromosome names as keys and the coverage vector as value.

disjoin(with_reverse_map=False, ignore_strand=False)[source]

Calculate disjoint genomic positions for each distinct (seqname, strand) pair.

Parameters:
  • with_reverse_map (bool) – Whether to return a map of indices back to the original object. Defaults to False.

  • ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Return type:

GenomicRanges

Returns:

A new GenomicRanges containing disjoint ranges.

distance(query)[source]

Compute the pair-wise distance with intervals in query.

Parameters:

query (Union[GenomicRanges, IRanges]) – Query GenomicRanges or IRanges.

Return type:

ndarray

Returns:

Numpy vector containing distances for each interval in query.

classmethod empty()[source]

Create an zero-length GenomicRanges object.

Returns:

same type as caller, in this case a GenomicRanges.

property end: ndarray

Alias for get_end.

find_overlaps(query, query_type='any', select='all', max_gap=-1, min_overlap=1, ignore_strand=False)[source]

Find overlaps between subject (self) and a query GenomicRanges object.

Parameters:
  • query (GenomicRanges) – Query GenomicRanges.

  • query_type (Literal['any', 'start', 'end', 'within']) –

    Overlap query type, must be one of

    • ”any”: Any overlap is good

    • ”start”: Overlap at the beginning of the intervals

    • ”end”: Must overlap at the end of the intervals

    • ”within”: Fully contain the query interval

    Defaults to “any”.

  • select (Literal['all', 'first', 'last', 'arbitrary']) – Determine what hit to choose when there are multiple hits for an interval in subject.

  • max_gap (int) – Maximum gap allowed in the overlap. Defaults to -1 (no gap allowed).

  • min_overlap (int) – Minimum overlap with query. Defaults to 1.

  • ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Raises:

TypeError – If query is not of type GenomicRanges.

Return type:

List[List[int]]

Returns:

A list with the same length as query, containing hits to overlapping indices.

flank(width, start=True, both=False, ignore_strand=False, in_place=False)[source]

Compute flanking ranges for each range. The logic for this comes from the R/GenomicRanges & IRanges packages.

If start is True for a given range, the flanking occurs at the start, otherwise the end. The widths of the flanks are given by the width parameter.

width can be negative, in which case the flanking region is reversed so that it represents a prefix or suffix of the range.

Usage:

gr.flank(3, True), where “x” indicates a range in gr and “-” indicates the resulting flanking region:

—xxxxxxx

If start were False, the range in gr becomes

xxxxxxx—

For negative width, i.e. gr.flank(x, -3, FALSE), where “*” indicates the overlap between “x” and the result:

xxxx***

If both is True, then, for all ranges in “x”, the flanking regions are extended into (or out of, if width is negative) the range, so that the result straddles the given endpoint and has twice the width given by width.

This is illustrated below for gr.flank(3, both=TRUE):

***xxxx

Parameters:
  • width (int) – Width to flank by. May be negative.

  • start (bool) – Whether to only flank starts. Defaults to True.

  • both (bool) – Whether to flank both starts and ends. Defaults to False.

  • ignore_strand (bool) – Whether to ignore strands. Defaults to False.

  • in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the flanked regions, either as a copy of the original or as a reference to the (in-place-modified) original.

follow(query, select='all', ignore_strand=False)[source]

Search nearest positions only upstream that overlap with each range in query.

Parameters:
  • query (GenomicRanges) – Query GenomicRanges to find nearest positions.

  • select (Literal['all', 'arbitrary']) – Determine what hit to choose when there are multiple hits for an interval in query.

  • ignore_strand (bool) – Whether to ignore strand. Defaults to False.

Return type:

List[List[int]]

Returns:

A List with the same length as query, containing hits to nearest indices.

classmethod from_pandas(input)[source]

Create a GenomicRanges from a DataFrame object.

Parameters:

input (pandas.DataFrame) – Input data. must contain columns ‘seqnames’, ‘starts’ and ‘widths’ or “ends”.

Return type:

GenomicRanges

Returns:

A GenomicRanges object.

classmethod from_polars(input)[source]

Create a GenomicRanges from a DataFrame object.

Parameters:

input (polars.DataFrame) – Input polars DataFrame. must contain columns ‘seqnames’, ‘starts’ and ‘widths’ or “ends”.

Return type:

GenomicRanges

Returns:

A GenomicRanges object.

gaps(start=1, end=None, ignore_strand=False)[source]

Identify complemented ranges for each distinct (seqname, strand) pair.

Parameters:
  • start (int) – Restrict chromosome start position. Defaults to 1.

  • end (Union[int, Dict[str, int], None]) – Restrict end position for each chromosome. Defaults to None. If None, extracts sequence information from seqinfo object if available.

  • ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Return type:

GenomicRanges

Returns:

A new GenomicRanges with complement ranges.

get_end()[source]

Get all end positions.

Return type:

ndarray

Returns:

NumPy array of 32-bit signed integers containing the end positions for all ranges.

get_mcols()[source]
Return type:

BiocFrame

Returns:

A ~py:class:~biocframe.BiocFrame.BiocFrame containing per-range annotations.

get_metadata()[source]
Return type:

dict

Returns:

Dictionary of metadata for this object.

get_names()[source]
Return type:

Names

Returns:

A list of names for each genomic range.

get_ranges()[source]
Return type:

IRanges

Returns:

An IRanges object containing the positions.

get_seqinfo()[source]
Return type:

SeqInfo

Returns:

A ~py:class:~genomicranges.SeqInfo.SeqInfo containing sequence information.

get_seqnames(as_type='list')[source]

Access sequence names.

Parameters:

as_type (Literal['factor', 'list']) –

Access seqnames as factor tuple, in which case, levels and codes are returned.

If list, then codes are mapped to levels and returned.

Return type:

Union[Tuple[ndarray, List[str]], List[str]]

Returns:

List of sequence names.

get_start()[source]

Get all start positions.

Return type:

ndarray

Returns:

NumPy array of 32-bit signed integers containing the start positions for all ranges.

get_strand(as_type='numpy')[source]

Access strand information.

Parameters:

as_type (Literal['numpy', 'factor', 'list']) –

Access seqnames as factor codes, in which case, a numpy

vector is retuned.

If factor, a tuple width levels as a dictionary and

indices to seqinfo.get_seqnames() is returned.

If list, then codes are mapped to levels and returned.

Return type:

Union[Tuple[ndarray, dict], List[str]]

Returns:

A numpy vector representing strand, 0 for any strand, -1 for reverse strand and 1 for forward strand.

A tuple of codes and levels.

A list of “+”, “-”, or “*” for each range.

get_subset(subset)[source]

Subset GenomicRanges, based on their indices or names.

Parameters:

subset (Union[str, int, bool, Sequence]) –

Indices to be extracted. This may be an integer, boolean, string, or any sequence thereof, as supported by normalize_subscript(). Scalars are treated as length-1 sequences.

Strings may only be used if :py:attr:~names are available (see get_names()). The first occurrence of each string in the names is used for extraction.

Return type:

GenomicRanges

Returns:

A new GenomicRanges object with the ranges of interest.

get_width()[source]

Get width of each genomic range.

Return type:

ndarray

Returns:

NumPy array of 32-bit signed integers containing the width for all ranges.

intersect(other)[source]

Find intersecting genomic intervals with other.

Parameters:

other (GenomicRanges) – The other GenomicRanges object.

Raises:

TypeError – If other is not a GenomicRanges.

Return type:

GenomicRanges

Returns:

A new GenomicRanges object with intersecting ranges.

intersect_ncls(other)[source]

Find intersecting genomic intervals with other (uses NCLS index).

Parameters:

other (GenomicRanges) – The other GenomicRanges object.

Raises:

TypeError – If other is not a GenomicRanges.

Return type:

GenomicRanges

Returns:

A new GenomicRanges object with intersecting ranges.

invert_strand(in_place=False)[source]

Invert strand for each range.

Conversion map:
  • “+” map to “-”

  • “-” becomes “+”

  • “*” stays the same

Parameters:

in_place (bool) – Whether to modify the object in place. Defaults to False.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the trimmed regions, either as a copy of the original or as a reference to the (in-place-modified) original.

match(query)[source]

Element wise comparison to find exact match ranges.

Parameters:

query (GenomicRanges) – Query GenomicRanges to search for matches.

Raises:

TypeError – If query is not of type GenomicRanges.

Return type:

List[List[int]]

Returns:

A List with the same length as query, containing hits to matching indices.

property mcols: BiocFrame

Alias for get_mcols().

property metadata: dict

Alias for get_metadata.

property names: Names

Alias for get_names().

narrow(start=None, width=None, end=None, in_place=False)[source]

Narrow genomic positions by provided start, width and end parameters.

Important: these parameters are relative shift in positions for each range.

Parameters:
Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the trimmed regions, either as a copy of the original or as a reference to the (in-place-modified) original.

nearest(query, select='all', ignore_strand=False)[source]

Search nearest positions both upstream and downstream that overlap with each range in query.

Parameters:
  • query (GenomicRanges) – Query GenomicRanges to find nearest positions.

  • select (Literal['all', 'arbitrary']) – Determine what hit to choose when there are multiple hits for an interval in query.

  • ignore_strand (bool) – Whether to ignore strand. Defaults to False.

Return type:

List[List[int]]

Returns:

A list with the same length as query, containing hits to nearest indices.

order(decreasing=False)[source]

Get the order of indices for sorting.

Order orders the genomic ranges by chromosome and strand. Strand is ordered by reverse first (-1), any strand (0) and forward strand (-1). Then by the start positions and width if two regions have the same start.

Parameters:

decreasing (bool) – Whether to sort in descending order. Defaults to False.

Return type:

ndarray

Returns:

NumPy vector containing index positions in the sorted order.

precede(query, select='all', ignore_strand=False)[source]

Search nearest positions only downstream that overlap with each range in query.

Parameters:
  • query (GenomicRanges) – Query GenomicRanges to find nearest positions.

  • select (Literal['all', 'arbitrary']) – Determine what hit to choose when there are multiple hits for an interval in query.

  • ignore_strand (bool) – Whether to ignore strand. Defaults to False.

Return type:

List[List[int]]

Returns:

A List with the same length as query, containing hits to nearest indices.

promoters(upstream=2000, downstream=200, in_place=False)[source]

Extend intervals to promoter regions.

Generates promoter ranges relative to the transcription start site (TSS), where TSS is start(x). The promoter range is expanded around the TSS according to the upstream and downstream arguments. Upstream represents the number of nucleotides in the 5’ direction and downstream the number in the 3’ direction. The full range is defined as, (start(x) - upstream) to (start(x) + downstream - 1).

Parameters:
  • upstream (int) – Number of positions to extend in the 5’ direction. Defaults to 2000.

  • downstream (int) – Number of positions to extend in the 3’ direction. Defaults to 200.

  • in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the promoter regions, either as a copy of the original or as a reference to the (in-place-modified) original.

range(with_reverse_map=False, ignore_strand=False)[source]

Calculate range bounds for each distinct (seqname, strand) pair.

Parameters:
  • with_reverse_map (bool) – Whether to return map of indices back to original object. Defaults to False.

  • ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Return type:

GenomicRanges

Returns:

A new GenomicRanges object with the range bounds.

property ranges: IRanges

Alias for get_ranges().

rank()[source]

Get rank of the GenomicRanges object.

For each range identifies its position is a sorted order.

Return type:

List[int]

Returns:

Numpy vector containing rank.

reduce(with_reverse_map=False, drop_empty_ranges=False, min_gap_width=1, ignore_strand=False)[source]

Reduce orders the ranges, then merges overlapping or adjacent ranges.

Parameters:
  • with_reverse_map (bool) – Whether to return map of indices back to original object. Defaults to False.

  • drop_empty_ranges (bool) – Whether to drop empty ranges. Defaults to False.

  • min_gap_width (int) – Ranges separated by a gap of at least min_gap_width positions are not merged. Defaults to 1.

  • ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Return type:

GenomicRanges

Returns:

A new GenomicRanges object with reduced intervals.

resize(width, fix='start', ignore_strand=False, in_place=False)[source]

Resize ranges to the specified width where either the start, end, or center is used as an anchor.

Parameters:
  • width (Union[int, List[int], ndarray]) – Width to resize, cannot be negative!

  • fix (Literal['start', 'end', 'center']) – Fix positions by “start”, “end”, or “center”. Defaults to “start”.

  • ignore_strand (bool) – Whether to ignore strands. Defaults to False.

  • in_place (bool) – Whether to modify the GenomicRanges object in place.

Raises:

ValueError – If parameter fix is neither start, end, nor center. If width is negative.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the resized regions, either as a copy of the original or as a reference to the (in-place-modified) original.

restrict(start=None, end=None, keep_all_ranges=False, in_place=False)[source]

Restrict ranges to a given start and end positions.

Parameters:
  • start (Union[int, List[int], ndarray, None]) – Start position. Defaults to None.

  • end (Union[int, List[int], ndarray, None]) – End position. Defaults to None.

  • keep_all_ranges (bool) – Whether to keep intervals that do not overlap with start and end. Defaults to False.

  • in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the restricted regions, either as a copy of the original or as a reference to the (in-place-modified) original.

sample(k=5)[source]

Randomly sample k intervals.

Parameters:

k (int) – Number of intervals to sample. Defaults to 5.

Return type:

GenomicRanges

Returns:

A new GenomicRanges with randomly sampled ranges.

property seqinfo: ndarray

Alias for get_seqinfo().

property seqnames: ndarray | List[str]

Alias for get_seqnames().

set_mcols(mcols, in_place=False)[source]

Set new range metadata.

Parameters:
  • mcols (Optional[BiocFrame]) – A ~py:class:~biocframe.BiocFrame.BiocFrame with length same as the number of ranges, containing per-range annotations.

  • in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_metadata(metadata, in_place=False)[source]

Set additional metadata.

Parameters:
  • metadata (dict) – New metadata for this object.

  • in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_names(names, in_place=False)[source]

Set new names.

Parameters:
  • names (Optional[Sequence[str]]) – Names for each genomic range.

  • in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_ranges(ranges, in_place=False)[source]

Set new ranges.

Parameters:
  • ranges (IRanges) – IRanges containing the genomic positions and widths. Must have the same length as seqnames.

  • in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_seqinfo(seqinfo, in_place=False)[source]

Set new sequence information.

Parameters:
  • seqinfo (Optional[SeqInfo]) – A ~py:class:~genomicranges.SeqInfo.SeqInfo object contaning information about sequences in seqnames.

  • in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_seqnames(seqnames, in_place=False)[source]

Set new sequence names.

Parameters:
  • seqnames (Union[Sequence[str], ndarray]) – List of sequence or chromosome names. Optionally can be a numpy array with indices mapped to :py:attr:~seqinfo.

  • in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_strand(strand, in_place=False)[source]

Set new strand information.

Parameters:
  • strand (Union[Sequence[str], Sequence[int], ndarray, None]) –

    Strand information for each genomic range. This should be 0 (any strand), 1 (forward strand) or -1 (reverse strand). If None, all genomic ranges are assumed to be 0.

    May be provided as a list of strings representing the strand; “+” for forward strand, “-” for reverse strand, or “*” for any strand and will be mapped accordingly to 1, -1 or 0.

  • in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_subset(args, value, in_place=False)[source]

Add or update positions (in-place operation).

Parameters:
  • args (Union[Sequence, int, str, bool, slice, range]) – Integer indices, a boolean filter, or (if the current object is named) names specifying the ranges to be replaced, see normalize_subscript().

  • value (GenomicRanges) – An GenomicRanges object of length equal to the number of ranges to be replaced, as specified by subset.

  • in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object, either as a copy of the original or as a reference to the (in-place-modified) original.

setdiff(other)[source]

Find set difference of genomic intervals with other.

Parameters:

other (GenomicRanges) – The other GenomicRanges object.

Raises:

TypeError – If other is not of type GenomicRanges.

Return type:

GenomicRanges

Returns:

A new GenomicRanges object with the diff ranges.

shift(shift=0, in_place=False)[source]

Shift all intervals.

Parameters:
  • shift (Union[int, List[int], ndarray]) – Shift interval. If shift is 0, the current object is returned. Defaults to 0.

  • in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the shifted regions, either as a copy of the original or as a reference to the (in-place-modified) original.

sliding_windows(width, step=1)[source]

Slide along each range by width (intervals with equal width) and step.

Also, checkout tile_genome() for splitting a gneomic into chunks, or tile_by_range().

Parameters:
  • n – Number of intervals to split into. Defaults to None.

  • width (int) – Width of each interval. Defaults to None.

Return type:

GenomicRanges

Returns:

A new GenomicRanges with the sliding ranges.

sort(decreasing=False, in_place=False)[source]

Get the order of indices for sorting.

Parameters:
  • decreasing (bool) – Whether to sort in descending order. Defaults to False.

  • in_place (bool) – Whether to modify the object in place. Defaults to False.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the trimmed regions, either as a copy of the original or as a reference to the (in-place-modified) original.

split(groups)[source]

Split the GenomicRanges object into a GenomicRangesList.

Parameters:

groups (list) –

A list specifying the groups or factors to split by.

Must have the same length as the number of genomic elements in the object.

Return type:

GenomicRangesList

Returns:

A GenomicRangesList containing the groups and their corresponding elements.

property start: ndarray

Alias for get_start.

property strand: ndarray

Alias for get_strand().

subset_by_overlaps(query, query_type='any', max_gap=-1, min_overlap=1, ignore_strand=False)[source]

Subset subject (self) with overlaps in query GenomicRanges object.

Parameters:
  • query (GenomicRanges) – Query GenomicRanges.

  • query_type (Literal['any', 'start', 'end', 'within']) –

    Overlap query type, must be one of

    • ”any”: Any overlap is good

    • ”start”: Overlap at the beginning of the intervals

    • ”end”: Must overlap at the end of the intervals

    • ”within”: Fully contain the query interval

    Defaults to “any”.

  • max_gap (int) – Maximum gap allowed in the overlap. Defaults to -1 (no gap allowed).

  • min_overlap (int) – Minimum overlap with query. Defaults to 1.

  • ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Raises:

TypeError – If query is not of type GenomicRanges.

Return type:

GenomicRanges

Returns:

A GenomicRanges object containing overlapping ranges.

subtract(x, min_overlap=1, ignore_strand=False)[source]

Subtract searches for features in x that overlap self by at least the number of base pairs given by min_overlap.

Parameters:
  • x (GenomicRanges) – Object to subtract.

  • min_overlap (int) – Minimum overlap with query. Defaults to 1.

  • ignore_strand (bool) – Whether to ignore strands. Defaults to False.

Return type:

GenomicRangesList

Returns:

GenomicRangesList with the same size as self containing the subtracted regions.

tile(n=None, width=None)[source]

Split each interval by n (number of sub intervals) or width (intervals with equal width).

Note: Either n or width must be provided but not both.

Also, checkout tile_genome() for splitting a genome into chunks, or tile_by_range().

Parameters:
  • n (Optional[int]) – Number of intervals to split into. Defaults to None.

  • width (Optional[int]) – Width of each interval. Defaults to None.

Raises:

ValueError – If both n and width are provided.

Return type:

GenomicRanges

Returns:

A new GenomicRanges with the split ranges.

tile_by_range(n=None, width=None)[source]

Split each sequence length into chunks by n (number of intervals) or width (intervals with equal width).

Note: Either n or width must be provided, but not both.

Also, checkout tile_genome() for splitting the genome into chunks.

Parameters:
  • n (Optional[int]) – Number of intervals to split into. Defaults to None.

  • width (Optional[int]) – Width of each interval. Defaults to None.

Raises:

ValueError – If both n or width are provided.

Return type:

GenomicRanges

Returns:

A new GenomicRanges with the split ranges.

classmethod tile_genome(seqlengths, n=None, width=None)[source]

Create a new GenomicRanges by partitioning a specified genome.

If n is provided, the region is split into n intervals. The last interval may not contain the same ‘width’ as the other regions.

Alternatively, width may be provided for each interval. Similarly, the last region may be less than width.

Either n or width must be provided but not both.

Parameters:
  • seqlengths (Union[Dict, SeqInfo]) –

    Sequence lengths of each chromosome.

    seqlengths may be a dictionary, where keys specify the chromosome and the value is the length of each chromosome in the genome.

    Alternatively, seqlengths may be an instance of SeqInfo.

  • n (Optional[int]) – Number of intervals to split into. Defaults to None, then ‘width’ of each interval is computed from seqlengths.

  • width (Optional[int]) – Width of each interval. Defaults to None.

Raises:

ValueError – If both n and width are provided.

Return type:

GenomicRanges

Returns:

A new GenomicRanges with the tiled regions.

to_pandas()[source]

Convert this GenomicRanges object into a DataFrame.

Return type:

pandas.DataFrame

Returns:

A DataFrame object.

to_polars()[source]

Convert this GenomicRanges object into a DataFrame.

Return type:

polars.DataFrame

Returns:

A DataFrame object.

trim(in_place=False)[source]

Trim sequences outside of bounds for non-circular chromosomes.

Parameters:

in_place (bool) – Whether to modify the GenomicRanges object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRanges object with the trimmed regions, either as a copy of the original or as a reference to the (in-place-modified) original.

union(other)[source]

Find union of genomic intervals with other.

Parameters:

other (GenomicRanges) – The other GenomicRanges object.

Raises:

TypeError – If other is not a GenomicRanges.

Return type:

GenomicRanges

Returns:

A new GenomicRanges object with all ranges.

property width: ndarray

Alias for get_width.

class genomicranges.GenomicRanges.GenomicRangesIter(obj)[source]

Bases: object

An iterator to a GenomicRanges object.

__init__(obj)[source]

Initialize the iterator.

Parameters:

obj (GenomicRanges) – Source object to iterate.

__iter__()[source]
__next__()[source]

genomicranges.GenomicRangesList module

class genomicranges.GenomicRangesList.GenomicRangesList(ranges, range_lengths=None, names=None, mcols=None, metadata=None, validate=True)[source]

Bases: object

Just as it sounds, a GenomicRangesList is a named-list like object.

If you are wondering why you need this class, a GenomicRanges object lets us specify multiple genomic elements, usually where the genes start and end. Genes are themselves made of many sub regions, e.g. exons. GenomicRangesList allows us to represent this nested structure.

Currently, this class is limited in functionality, purely a read-only class with basic accessors.

Typical usage:

To construct a GenomicRangesList object, simply pass in a list of genomicranges.GenomicRanges.GenomicRanges objects and Optionally names.

a = GenomicRanges(
    seqnames=[
        "chr1",
        "chr2",
        "chr1",
        "chr3",
    ],
    ranges=IRanges(
        [
            1,
            3,
            2,
            4,
        ],
        [
            10,
            30,
            50,
            60,
        ],
    ),
    strand=[
        "-",
        "+",
        "*",
        "+",
    ],
    mcols=BiocFrame(
        {
            "score": [
                1,
                2,
                3,
                4,
            ]
        }
    ),
)

b = GenomicRanges(
    seqnames=[
        "chr2",
        "chr4",
        "chr5",
    ],
    ranges=IRanges(
        [3, 6, 4],
        [
            30,
            50,
            60,
        ],
    ),
    strand=[
        "-",
        "+",
        "*",
    ],
    mcols=BiocFrame(
        {
            "score": [
                2,
                3,
                4,
            ]
        }
    ),
)

grl = GenomicRangesList(
    ranges=[
        gr1,
        gr2,
    ],
    names=[
        "gene1",
        "gene2",
    ],
)

Additionally, you may also provide metadata about the genomic elements in the dictionary using mcols attribute.

__copy__()[source]
Returns:

A shallow copy of the current GenomicRangesList.

__deepcopy__(memo=None, _nil=[])[source]
Returns:

A deep copy of the current GenomicRangesList.

__getitem__(args)[source]

Subset individual genomic elements.

Parameters:

args (Union[str, int, tuple, list, slice]) –

Name of the genomic element to access.

Alternatively, if names of genomic elements are not available, you may provide an index position of the genomic element to access.

Alternatively, args may also specify a list of positions to slice specified either as a list or slice.

A tuple may also be specified along each dimension. Currently if the tuple contains more than one dimension, its ignored.

Raises:

TypeError – If args is not a supported slice argument.

Return type:

Union[GenomicRanges, GenomicRangesList]

Returns:

A new GenomicRangesList of the slice.

__init__(ranges, range_lengths=None, names=None, mcols=None, metadata=None, validate=True)[source]

Initialize a GenomicRangesList object.

Parameters:
__iter__()[source]

Iterator over rows.

Return type:

GenomicRangesListIter

__len__()[source]
Return type:

int

Returns:

Number of rows.

__repr__()[source]
Return type:

str

Returns:

A string representation of this GenomicRangesList.

as_genomic_ranges()[source]

Coerce object to a GenomicRanges.

Return type:

GenomicRanges

Returns:

A GenomicRanges object.

as_granges()[source]

Alias to to_genomic_ranges().

Return type:

GenomicRanges

copy()[source]

Alias for __copy__().

element_nrows()[source]

Get a vector of the length of each element.

Return type:

Dict[str, List[str]]

Returns:

An integer vector where each value corresponds to the length of the contained GenomicRanges object.

classmethod empty(n)[source]

Create an empty n-length GenomicRangesList object.

Returns:

same type as caller, in this case a GenomicRangesList.

property end: Dict[str, List[int]]

Get all end positions.

Returns:

A list with the same length as keys in the object, each element in the list contains another list values.

classmethod from_dict(x)[source]

Create a GenomicRangesList object from dict.

Returns:

same type as caller, in this case a GenomicRangesList.

generic_accessor(prop, func=False, cast=False)[source]
Return type:

Union[Dict[str, list], GenomicRangesList]

get_mcols()[source]
Return type:

BiocFrame

Returns:

A ~py:class:~biocframe.BiocFrame.BiocFrame containing per-genomic element annotations.

get_metadata()[source]
Return type:

dict

Returns:

Dictionary of metadata for this object.

get_names()[source]
Return type:

Names

Returns:

A list of names for each genomic element.

get_range_lengths()[source]
Return type:

dict

Returns:

Number of ranges for each genomic element.

get_ranges()[source]
Return type:

Union[GenomicRanges, List[GenomicRanges]]

Returns:

List of genomic ranges.

groups(group)[source]

Get a genomic element by their name or index position.

Parameters:

group (Union[str, int]) – Name or index of the genomic element to access.

Return type:

GenomicRangesList

Returns:

The genomic element for group i.

property is_circular: Dict[str, List[int]]

Get the circularity flag.

Returns:

A list with the same length as keys in the object, each element in the list contains another list values.

is_empty()[source]

Whether GRangesList has no elements or if all its elements are empty.

Return type:

bool

Returns:

True if the object has no elements.

property mcols: BiocFrame

Alias for get_mcols().

property metadata: dict

Alias for get_metadata.

property names: Names

Alias for get_names().

range()[source]

Calculate range bounds for each genomic element.

Return type:

GenomicRangesList

Returns:

A new GenomicRangesList object with the range bounds.

property range_lengths: dict

Alias for get_range_lengths.

property ranges: GenomicRanges

Alias for get_ranges().

property seq_info: Dict[str, List[int]]

Get information about the underlying sequences.

Returns:

A list with the same length as keys in the object, each element in the list contains another list values.

property seqnames: Dict[str, List[str]]

Get all sequence names.

Returns:

A list with the same length as keys in the object, each element in the list contains another list of sequence names.

set_mcols(mcols, in_place=False)[source]

Set new range metadata.

Parameters:
  • mcols (Optional[BiocFrame]) – A ~py:class:~biocframe.BiocFrame.BiocFrame with length same as the number of genomic elements, containing per-genomic element annotations.

  • in_place (bool) – Whether to modify the GenomicRangesList object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRangesList object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_metadata(metadata, in_place=False)[source]

Set additional metadata.

Parameters:
  • metadata (dict) – New metadata for this object.

  • in_place (bool) – Whether to modify the GenomicRangesList object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRangesList object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_names(names, in_place=False)[source]

Set new names.

Parameters:
  • names (Optional[Sequence[str]]) – Names for each genomic element.

  • in_place (bool) – Whether to modify the GenomicRangesList object in place.

Return type:

GenomicRanges

Returns:

A modified GenomicRangesList object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_ranges(ranges, in_place=False)[source]

Set new genomic ranges.

Parameters:
Return type:

GenomicRanges

Returns:

A modified GenomicRangesList object, either as a copy of the original or as a reference to the (in-place-modified) original.

property start: Dict[str, List[int]]

Get all start positions.

Returns:

A list with the same length as keys in the object, each element in the list contains another list values.

property strand: Dict[str, List[int]]

Get strand of all regions across all elements.

Returns:

A list with the same length as keys in the object, each element in the list contains another list values.

to_pandas()[source]

Coerce object to a DataFrame.

Return type:

pandas.DataFrame

Returns:

A DataFrame object.

property width: Dict[str, List[int]]

Get width of all regions across all elements.

Returns:

A list with the same length as keys in the object, each element in the list contains another list values.

class genomicranges.GenomicRangesList.GenomicRangesListIter(obj)[source]

Bases: object

An iterator to a GenomicRangesList object.

__init__(obj)[source]

Initialize the iterator.

Parameters:

obj (GenomicRangesList) – source object to iterate.

__iter__()[source]
__next__()[source]

genomicranges.SeqInfo module

class genomicranges.SeqInfo.SeqInfo(seqnames, seqlengths=None, is_circular=None, genome=None, validate=True)[source]

Bases: object

Information about the reference sequences, specifically the name and length of each sequence, whether it is a circular, and the identity of the genome from which it was derived.

__copy__()[source]
Returns:

A shallow copy of the current SeqInfo.

__deepcopy__(memo=None, _nil=[])[source]
Returns:

A deep copy of the current SeqInfo.

__getitem__(subset)[source]

Alias to get_subset.

Return type:

SeqInfo

__init__(seqnames, seqlengths=None, is_circular=None, genome=None, validate=True)[source]
Parameters:
  • seqnames (Sequence[str]) – Names of all reference sequences, should be unique.

  • seqlengths (Union[int, Sequence[int], Dict[str, int], None]) –

    Lengths of all sequences in base pairs. This should contain non-negative values and have the same number of elements as seqnames. Entries may also be None if no lengths are available for that sequence.

    Alternatively, a dictionary where keys are the sequence names and values are the lengths. If a name is missing from this dictionary, the length of the sequence is set to None.

    Alternatively a single integer, if all sequences are of the same length.

    Alternatively None, if no length information is available for any sequence.

  • is_circular (Union[bool, Sequence[bool], Dict[str, bool], None]) –

    Whether each sequence is circular. This should have the same number of elements as seqnames. Entries may also be None if no information is available for that sequence.

    Alternatively, a dictionary where keys are the sequence names and values are the circular flags. If a name is missing from this dictionary, the flag for the sequence is set to None.

    Alternatively a single boolean, if all sequences have the same circular flag.

    Alternatively None, if no flags are available for any sequence.

  • genome (Union[str, Sequence[str], Dict[str, str], None]) –

    The genome build containing each reference sequence. This should have the same number of elements as seqnames. Entries may also be None if no information is available.

    Alternatively, a dictionary where keys are the sequence names and values are the genomes. If a name is missing from this dictionary, the genome is set to None.

    Alternatively a single string, if all sequences are derived from the same genome.

    Alternatively None, if no genome information is available for any sequence.

  • validate (bool) – Whether to validate the arguments, internal use only.

__iter__()[source]

Iterator over sequences.

Return type:

SeqInfoIterator

__len__()[source]
Return type:

int

Returns:

Number of sequences in this object.

__repr__()[source]
Return type:

str

Returns:

A string representation of this SeqInfo.

copy()[source]

Alias for __copy__().

classmethod empty()[source]

Create an zero-length SeqInfo object.

Returns:

same type as caller, in this case a SeqInfo.

property genome: List[str]
get_genome()[source]
Return type:

List[str]

Returns:

A list of strings is returned containing the genome identity for all sequences in get_seqnames().

get_is_circular()[source]
Return type:

List[bool]

Returns:

A list of booleans is returned specifying whether each sequence (from get_seqnames()) is circular.

get_seqlengths()[source]
Return type:

List[int]

Returns:

A list of integers is returned containing the lengths of all sequences, in the same order as the sequence names from get_seqnames().

get_seqnames()[source]
Return type:

List[str]

Returns:

List of all chromosome names.

get_subset(subset)[source]

Subset SeqInfo, based on their indices or seqnames.

Parameters:

subset (Union[str, int, bool, Sequence]) –

Indices to be extracted. This may be an integer, boolean, string, or any sequence thereof, as supported by normalize_subscript(). Scalars are treated as length-1 sequences.

Strings may only be used if :py:attr:~seqnames are available (see get_seqnames()). The first occurrence of each string in the seqnames is used for extraction.

Return type:

SeqInfo

Returns:

A new SeqInfo object with the sequences of interest.

property is_circular: List[bool]
property seqlengths: List[int]
property seqnames: List[str]
set_genome(genome, in_place=False)[source]
Parameters:
  • genome (Union[str, Sequence[str], Dict[str, str], None]) – List of genomes, of length equal to the number of names in this SeqInfo object. Values may be None or strings.

  • in_place (bool) – Whether to modify the SeqInfo object in place.

Return type:

SeqInfo

Returns:

A modified SeqInfo object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_is_circular(is_circular, in_place=False)[source]
Parameters:
  • is_circular (Union[bool, Sequence[bool], Dict[str, bool], None]) –

    List of circular flags, of length equal to the number of names in this SeqInfo object. Values may be None or booleans.

    Alternatively, a dictionary where keys are the sequence names and values are the flags. Not all names need to be present in which case the flag is assumed to be None.

  • in_place (bool) – Whether to modify the SeqInfo object in place.

Return type:

SeqInfo

Returns:

A modified SeqInfo object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_seqlengths(seqlengths, in_place=False)[source]
Parameters:
  • seqlengths (Union[int, Sequence[int], Dict[str, int], None]) –

    List of sequence lengths, of length equal to the number of names in this SeqInfo object. Values may be None or non-negative integers.

    Alternatively, a dictionary where keys are the sequence names and values are the lengths. Not all names need to be present in which case the length is assumed to be None.

  • in_place (bool) – Whether to modify the SeqInfo object in place.

Return type:

SeqInfo

Returns:

A modified SeqInfo object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_seqnames(seqnames, in_place=False)[source]
Parameters:
  • seqnames (Sequence[str]) – List of sequence names, of length equal to the number of names in this SeqInfo object. All names should be unique strings.

  • in_place (bool) – Whether to modify the SeqInfo object in place.

Return type:

SeqInfo

Returns:

A modified SeqInfo object, either as a copy of the original or as a reference to the (in-place-modified) original.

class genomicranges.SeqInfo.SeqInfoIterator(obj)[source]

Bases: object

An iterator to a SeqInfo object.

__init__(obj)[source]

Initialize the iterator.

Parameters:

obj (SeqInfo) – Source object to iterate.

__iter__()[source]
__next__()[source]
genomicranges.SeqInfo.merge_SeqInfo(objects)[source]

Merge multiple SeqInfo objects, taking the union of all reference sequences. If the same reference sequence is present with the same details across objects, only a single instance is present in the final object; if details are contradictory, they are replaced with None.

Parameters:

objects (List[SeqInfo]) – List of SeqInfo objects.

Return type:

SeqInfo

Returns:

A single merged SeqInfo object.

genomicranges.utils module

genomicranges.utils.create_np_vector(intervals, with_reverse_map=False, force_size=None, dont_sum=False, value=1)[source]

Represent intervals and calculate coverage.

Parameters:
  • intervals (List[Tuple[int, int]]) – Input interval vector.

  • with_reverse_map (bool) – Return map of indices? Defaults to False.

  • force_size (Optional[int]) – Force size of the array.

  • dont_sum (bool) – Do not sum. Defaults to False.

  • value (int) – Default value to increment. Defaults to 1.

Return type:

Tuple[ndarray, Optional[List]]

Returns:

A numpy array representing coverage from the intervals and optionally a reverse index map.

genomicranges.utils.group_by_indices(groups)[source]
Return type:

dict

genomicranges.utils.sanitize_strand_vector(strand)[source]

Create a numpy representation for strand.

Mapping: 1 for “+” (forward strand), 0 for “*” (any strand) and -1 for “-” (reverse strand).

Parameters:

strand (Union[Sequence[str], Sequence[int], ndarray]) – List of strand.

Raises:

ValueError – If strand is None. If strand contains values other than +,- and *. If strand is not a numpy vector, string of integers or strings.

Return type:

ndarray

Returns:

A numpy vector.

genomicranges.utils.slide_intervals(start, end, width, step)[source]

Sliding intervals.

Parameters:
  • start (int) – Start interval.

  • end (int) – End interval.

  • step (int) – Step of each interval.

  • width (int) – Width of each interval.

Return type:

List

Returns:

List of intervals split into bins.

genomicranges.utils.split_intervals(start, end, step)[source]

Split an interval range into equal bins.

Parameters:
  • start (int) – Start interval.

  • end (int) – End interval.

  • step (int) – Width or step of each interval.

Return type:

List

Returns:

List of intervals split into bins.

Module contents