genomicranges package¶
Subpackages¶
Submodules¶
genomicranges.GenomicRanges module¶
- class genomicranges.GenomicRanges.GenomicRanges(seqnames, ranges, strand=None, names=None, mcols=None, seqinfo=None, metadata=None, validate=True)[source]¶
Bases:
object
GenomicRanges
provides a container class to represent and operate over genomic regions and annotations.Note: The documentation for some of the methods are derived from the GenomicRanges R/Bioconductor package.
- __getitem__(subset)[source]¶
Alias to
get_subset
.- Return type:
- __init__(seqnames, ranges, strand=None, names=None, mcols=None, seqinfo=None, metadata=None, validate=True)[source]¶
Initialize a
GenomicRanges
object.- Parameters:
seqnames (
Sequence
[str
]) – List of sequence or chromosome names.ranges (
IRanges
) – Genomic positions and widths of each position. Must have the same length asseqnames
.strand (
Union
[Sequence
[str
],Sequence
[int
],ndarray
,None
]) –Strand information for each genomic range. This should be 0 (any strand), 1 (forward strand) or -1 (reverse strand). If None, all genomic ranges are assumed to be 0.
May be provided as a list of strings representing the strand; “+” for forward strand, “-” for reverse strand, or “*” for any strand and will be mapped accordingly to 1, -1 or 0.
names (
Optional
[Sequence
[str
]]) – Names for each genomic range. Defaults to None, which means the ranges are unnamed.mcols (
Optional
[BiocFrame
]) – A ~py:class:~biocframe.BiocFrame.BiocFrame with the number of rows same as number of genomic ranges, containing per-range annotation. Defaults to None, in which case an empty BiocFrame object is created.seqinfo (
Optional
[SeqInfo
]) – Sequence information. Defaults to None, in which case aSeqInfo
object is created with the unique set of chromosome names fromseqnames
.metadata (
Optional
[dict
]) – Additional metadata. Defaults to None, and is assigned to an empty dictionary.validate (
bool
) – Internal use only.
- __setitem__(args, value)[source]¶
Alias to
set_subset
.This operation modifies object in-place.
- Return type:
- binned_average(scorename, bins, outname='binned_average', in_place=False)[source]¶
Calculate average for a column across all regions in
bins
, then set a column specified by ‘outname’ with those values.- Parameters:
scorename (
str
) – Score column to compute averages on.bins (
GenomicRanges
) – Bins you want to use.outname (
str
) – New column name to add to the object.in_place (
bool
) – Whether to modifybins
in place.
- Raises:
ValueError – If
scorename
column does not exist.scorename
is not all ints or floats.TypeError – If
bins
is not of type GenomicRanges.
- Return type:
- Returns:
A modified
bins
object with the computed averages, either as a copy of the original or as a reference to the (in-place-modified) original.
- copy()[source]¶
Alias for
__copy__()
.
- count_overlaps(query, query_type='any', max_gap=-1, min_overlap=1, ignore_strand=False)[source]¶
Count overlaps between subject (self) and a
query
GenomicRanges
object.- Parameters:
query (
GenomicRanges
) – Query GenomicRanges.query_type (
Literal
['any'
,'start'
,'end'
,'within'
]) –Overlap query type, must be one of
”any”: Any overlap is good
”start”: Overlap at the beginning of the intervals
”end”: Must overlap at the end of the intervals
”within”: Fully contain the query interval
Defaults to “any”.
max_gap (
int
) – Maximum gap allowed in the overlap. Defaults to -1 (no gap allowed).min_overlap (
int
) – Minimum overlap with query. Defaults to 1.ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.
- Raises:
TypeError – If
query
is not of type GenomicRanges.- Return type:
- Returns:
A list with the same length as
query
, containing number of overlapping indices.
- coverage(shift=0, width=None, weight=1)[source]¶
Calculate coverage for each chromosome, For each position, counts the number of ranges that cover it.
- Parameters:
- Return type:
- Returns:
A dictionary with chromosome names as keys and the coverage vector as value.
- disjoin(with_reverse_map=False, ignore_strand=False)[source]¶
Calculate disjoint genomic positions for each distinct (seqname, strand) pair.
- Parameters:
- Return type:
- Returns:
A new
GenomicRanges
containing disjoint ranges.
- distance(query)[source]¶
Compute the pair-wise distance with intervals in query.
- Parameters:
query (
Union
[GenomicRanges
,IRanges
]) – Query GenomicRanges or IRanges.- Return type:
- Returns:
Numpy vector containing distances for each interval in query.
- classmethod empty()[source]¶
Create an zero-length GenomicRanges object.
- Returns:
same type as caller, in this case a GenomicRanges.
- find_overlaps(query, query_type='any', select='all', max_gap=-1, min_overlap=1, ignore_strand=False)[source]¶
Find overlaps between subject (self) and a
query
GenomicRanges
object.- Parameters:
query (
GenomicRanges
) – Query GenomicRanges.query_type (
Literal
['any'
,'start'
,'end'
,'within'
]) –Overlap query type, must be one of
”any”: Any overlap is good
”start”: Overlap at the beginning of the intervals
”end”: Must overlap at the end of the intervals
”within”: Fully contain the query interval
Defaults to “any”.
select (
Literal
['all'
,'first'
,'last'
,'arbitrary'
]) – Determine what hit to choose when there are multiple hits for an interval insubject
.max_gap (
int
) – Maximum gap allowed in the overlap. Defaults to -1 (no gap allowed).min_overlap (
int
) – Minimum overlap with query. Defaults to 1.ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.
- Raises:
TypeError – If
query
is not of type GenomicRanges.- Return type:
- Returns:
A list with the same length as
query
, containing hits to overlapping indices.
- flank(width, start=True, both=False, ignore_strand=False, in_place=False)[source]¶
Compute flanking ranges for each range. The logic for this comes from the R/GenomicRanges & IRanges packages.
If
start
isTrue
for a given range, the flanking occurs at the start, otherwise the end. The widths of the flanks are given by thewidth
parameter.width
can be negative, in which case the flanking region is reversed so that it represents a prefix or suffix of the range.Usage:
gr.flank(3, True), where “x” indicates a range in
gr
and “-” indicates the resulting flanking region:—xxxxxxx
- If
start
wereFalse
, the range ingr
becomes xxxxxxx—
For negative width, i.e. gr.flank(x, -3, FALSE), where “*” indicates the overlap between “x” and the result:
xxxx***
If
both
isTrue
, then, for all ranges in “x”, the flanking regions are extended into (or out of, ifwidth
is negative) the range, so that the result straddles the given endpoint and has twice the width given by width.- Parameters:
width (
int
) – Width to flank by. May be negative.start (
bool
) – Whether to only flank starts. Defaults to True.both (
bool
) – Whether to flank both starts and ends. Defaults to False.ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.in_place (
bool
) – Whether to modify theGenomicRanges
object in place.
- Return type:
- Returns:
A modified
GenomicRanges
object with the flanked regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- If
- follow(query, select='all', ignore_strand=False)[source]¶
Search nearest positions only upstream that overlap with each range in
query
.- Parameters:
query (
GenomicRanges
) – QueryGenomicRanges
to find nearest positions.select (
Literal
['all'
,'arbitrary'
]) – Determine what hit to choose when there are multiple hits for an interval inquery
.ignore_strand (
bool
) – Whether to ignore strand. Defaults to False.
- Return type:
- Returns:
A List with the same length as
query
, containing hits to nearest indices.
- classmethod from_pandas(input)[source]¶
Create a
GenomicRanges
from aDataFrame
object.- Parameters:
input (pandas.DataFrame) – Input data. must contain columns ‘seqnames’, ‘starts’ and ‘widths’ or “ends”.
- Return type:
GenomicRanges
- Returns:
A
GenomicRanges
object.
- classmethod from_polars(input)[source]¶
Create a
GenomicRanges
from aDataFrame
object.- Parameters:
input (polars.DataFrame) – Input polars DataFrame. must contain columns ‘seqnames’, ‘starts’ and ‘widths’ or “ends”.
- Return type:
GenomicRanges
- Returns:
A
GenomicRanges
object.
- gaps(start=1, end=None, ignore_strand=False)[source]¶
Identify complemented ranges for each distinct (seqname, strand) pair.
- Parameters:
start (
int
) – Restrict chromosome start position. Defaults to 1.end (
Union
[int
,Dict
[str
,int
],None
]) – Restrict end position for each chromosome. Defaults to None. If None, extracts sequence information fromseqinfo
object if available.ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.
- Return type:
- Returns:
A new
GenomicRanges
with complement ranges.
- get_end()[source]¶
Get all end positions.
- Return type:
- Returns:
NumPy array of 32-bit signed integers containing the end positions for all ranges.
- get_mcols()[source]¶
- Return type:
- Returns:
A ~py:class:~biocframe.BiocFrame.BiocFrame containing per-range annotations.
- get_seqinfo()[source]¶
- Return type:
- Returns:
A ~py:class:~genomicranges.SeqInfo.SeqInfo containing sequence information.
- get_start()[source]¶
Get all start positions.
- Return type:
- Returns:
NumPy array of 32-bit signed integers containing the start positions for all ranges.
- get_strand(as_type='numpy')[source]¶
Access strand information.
- Parameters:
as_type (
Literal
['numpy'
,'factor'
,'list'
]) –- Access seqnames as factor codes, in which case, a numpy
vector is retuned.
- If
factor
, a tuple width levels as a dictionary and indices to
seqinfo.get_seqnames()
is returned.
If
list
, then codes are mapped to levels and returned.- Return type:
- Returns:
A numpy vector representing strand, 0 for any strand, -1 for reverse strand and 1 for forward strand.
A tuple of codes and levels.
A list of “+”, “-”, or “*” for each range.
- get_subset(subset)[source]¶
Subset
GenomicRanges
, based on their indices or names.- Parameters:
subset (
Union
[str
,int
,bool
,Sequence
]) –Indices to be extracted. This may be an integer, boolean, string, or any sequence thereof, as supported by
normalize_subscript()
. Scalars are treated as length-1 sequences.Strings may only be used if :py:attr:
~names
are available (seeget_names()
). The first occurrence of each string in the names is used for extraction.- Return type:
- Returns:
A new
GenomicRanges
object with the ranges of interest.
- get_width()[source]¶
Get width of each genomic range.
- Return type:
- Returns:
NumPy array of 32-bit signed integers containing the width for all ranges.
- intersect(other)[source]¶
Find intersecting genomic intervals with other.
- Parameters:
other (
GenomicRanges
) – The otherGenomicRanges
object.- Raises:
TypeError – If
other
is not aGenomicRanges
.- Return type:
- Returns:
A new
GenomicRanges
object with intersecting ranges.
- intersect_ncls(other)[source]¶
Find intersecting genomic intervals with other (uses NCLS index).
- Parameters:
other (
GenomicRanges
) – The otherGenomicRanges
object.- Raises:
TypeError – If
other
is not aGenomicRanges
.- Return type:
- Returns:
A new
GenomicRanges
object with intersecting ranges.
- invert_strand(in_place=False)[source]¶
Invert strand for each range.
- Conversion map:
“+” map to “-”
“-” becomes “+”
“*” stays the same
- Parameters:
in_place (
bool
) – Whether to modify the object in place. Defaults to False.- Return type:
- Returns:
A modified
GenomicRanges
object with the trimmed regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- match(query)[source]¶
Element wise comparison to find exact match ranges.
- Parameters:
query (
GenomicRanges
) – QueryGenomicRanges
to search for matches.- Raises:
TypeError – If
query
is not of typeGenomicRanges
.- Return type:
- Returns:
A List with the same length as
query
, containing hits to matching indices.
- property mcols: BiocFrame¶
Alias for
get_mcols()
.
- property metadata: dict¶
Alias for
get_metadata
.
- property names: Names¶
Alias for
get_names()
.
- narrow(start=None, width=None, end=None, in_place=False)[source]¶
Narrow genomic positions by provided
start
,width
andend
parameters.Important: these parameters are relative shift in positions for each range.
- Parameters:
start (
Union
[int
,List
[int
],ndarray
,None
]) – Relative start position. Defaults to None.width (
Union
[int
,List
[int
],ndarray
,None
]) – Relative end position. Defaults to None.end (
Union
[int
,List
[int
],ndarray
,None
]) – Relative width of the interval. Defaults to None.in_place (
bool
) – Whether to modify theGenomicRanges
object in place.
- Return type:
- Returns:
A modified
GenomicRanges
object with the trimmed regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- nearest(query, select='all', ignore_strand=False)[source]¶
Search nearest positions both upstream and downstream that overlap with each range in
query
.- Parameters:
query (
GenomicRanges
) – QueryGenomicRanges
to find nearest positions.select (
Literal
['all'
,'arbitrary'
]) – Determine what hit to choose when there are multiple hits for an interval inquery
.ignore_strand (
bool
) – Whether to ignore strand. Defaults to False.
- Return type:
- Returns:
A list with the same length as
query
, containing hits to nearest indices.
- order(decreasing=False)[source]¶
Get the order of indices for sorting.
Order orders the genomic ranges by chromosome and strand. Strand is ordered by reverse first (-1), any strand (0) and forward strand (-1). Then by the start positions and width if two regions have the same start.
- precede(query, select='all', ignore_strand=False)[source]¶
Search nearest positions only downstream that overlap with each range in
query
.- Parameters:
query (
GenomicRanges
) – QueryGenomicRanges
to find nearest positions.select (
Literal
['all'
,'arbitrary'
]) – Determine what hit to choose when there are multiple hits for an interval inquery
.ignore_strand (
bool
) – Whether to ignore strand. Defaults to False.
- Return type:
- Returns:
A List with the same length as
query
, containing hits to nearest indices.
- promoters(upstream=2000, downstream=200, in_place=False)[source]¶
Extend intervals to promoter regions.
Generates promoter ranges relative to the transcription start site (TSS), where TSS is start(x). The promoter range is expanded around the TSS according to the upstream and downstream arguments. Upstream represents the number of nucleotides in the 5’ direction and downstream the number in the 3’ direction. The full range is defined as, (start(x) - upstream) to (start(x) + downstream - 1).
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object with the promoter regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- range(with_reverse_map=False, ignore_strand=False)[source]¶
Calculate range bounds for each distinct (seqname, strand) pair.
- Parameters:
- Return type:
- Returns:
A new
GenomicRanges
object with the range bounds.
- property ranges: IRanges¶
Alias for
get_ranges()
.
- rank()[source]¶
Get rank of the
GenomicRanges
object.For each range identifies its position is a sorted order.
- reduce(with_reverse_map=False, drop_empty_ranges=False, min_gap_width=1, ignore_strand=False)[source]¶
Reduce orders the ranges, then merges overlapping or adjacent ranges.
- Parameters:
with_reverse_map (
bool
) – Whether to return map of indices back to original object. Defaults to False.drop_empty_ranges (
bool
) – Whether to drop empty ranges. Defaults to False.min_gap_width (
int
) – Ranges separated by a gap of at leastmin_gap_width
positions are not merged. Defaults to 1.ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.
- Return type:
- Returns:
A new
GenomicRanges
object with reduced intervals.
- resize(width, fix='start', ignore_strand=False, in_place=False)[source]¶
Resize ranges to the specified
width
where either thestart
,end
, orcenter
is used as an anchor.- Parameters:
width (
Union
[int
,List
[int
],ndarray
]) – Width to resize, cannot be negative!fix (
Literal
['start'
,'end'
,'center'
]) – Fix positions by “start”, “end”, or “center”. Defaults to “start”.ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.in_place (
bool
) – Whether to modify theGenomicRanges
object in place.
- Raises:
ValueError – If parameter
fix
is neither start, end, nor center. Ifwidth
is negative.- Return type:
- Returns:
A modified
GenomicRanges
object with the resized regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- restrict(start=None, end=None, keep_all_ranges=False, in_place=False)[source]¶
Restrict ranges to a given start and end positions.
- Parameters:
start (
Union
[int
,List
[int
],ndarray
,None
]) – Start position. Defaults to None.end (
Union
[int
,List
[int
],ndarray
,None
]) – End position. Defaults to None.keep_all_ranges (
bool
) – Whether to keep intervals that do not overlap with start and end. Defaults to False.in_place (
bool
) – Whether to modify theGenomicRanges
object in place.
- Return type:
- Returns:
A modified
GenomicRanges
object with the restricted regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- sample(k=5)[source]¶
Randomly sample
k
intervals.- Parameters:
k (
int
) – Number of intervals to sample. Defaults to 5.- Return type:
- Returns:
A new
GenomicRanges
with randomly sampled ranges.
- property seqinfo: ndarray¶
Alias for
get_seqinfo()
.
- property seqnames: ndarray | List[str]¶
Alias for
get_seqnames()
.
- set_mcols(mcols, in_place=False)[source]¶
Set new range metadata.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_metadata(metadata, in_place=False)[source]¶
Set additional metadata.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_names(names, in_place=False)[source]¶
Set new names.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_ranges(ranges, in_place=False)[source]¶
Set new ranges.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_seqinfo(seqinfo, in_place=False)[source]¶
Set new sequence information.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_seqnames(seqnames, in_place=False)[source]¶
Set new sequence names.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_strand(strand, in_place=False)[source]¶
Set new strand information.
- Parameters:
strand (
Union
[Sequence
[str
],Sequence
[int
],ndarray
,None
]) –Strand information for each genomic range. This should be 0 (any strand), 1 (forward strand) or -1 (reverse strand). If None, all genomic ranges are assumed to be 0.
May be provided as a list of strings representing the strand; “+” for forward strand, “-” for reverse strand, or “*” for any strand and will be mapped accordingly to 1, -1 or 0.
in_place (
bool
) – Whether to modify theGenomicRanges
object in place.
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_subset(args, value, in_place=False)[source]¶
Add or update positions (in-place operation).
- Parameters:
args (
Union
[Sequence
,int
,str
,bool
,slice
,range
]) – Integer indices, a boolean filter, or (if the current object is named) names specifying the ranges to be replaced, seenormalize_subscript()
.value (
GenomicRanges
) – AnGenomicRanges
object of length equal to the number of ranges to be replaced, as specified bysubset
.in_place (
bool
) – Whether to modify theGenomicRanges
object in place.
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- setdiff(other)[source]¶
Find set difference of genomic intervals with other.
- Parameters:
other (
GenomicRanges
) – The otherGenomicRanges
object.- Raises:
TypeError – If
other
is not of typeGenomicRanges
.- Return type:
- Returns:
A new
GenomicRanges
object with the diff ranges.
- shift(shift=0, in_place=False)[source]¶
Shift all intervals.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object with the shifted regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- sliding_windows(width, step=1)[source]¶
Slide along each range by
width
(intervals with equalwidth
) andstep
.Also, checkout
tile_genome()
for splitting a gneomic into chunks, ortile_by_range()
.- Parameters:
n – Number of intervals to split into. Defaults to None.
width (
int
) – Width of each interval. Defaults to None.
- Return type:
- Returns:
A new
GenomicRanges
with the sliding ranges.
- sort(decreasing=False, in_place=False)[source]¶
Get the order of indices for sorting.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object with the trimmed regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- split(groups)[source]¶
Split the GenomicRanges object into a
GenomicRangesList
.- Parameters:
groups (
list
) –A list specifying the groups or factors to split by.
Must have the same length as the number of genomic elements in the object.
- Return type:
GenomicRangesList
- Returns:
A GenomicRangesList containing the groups and their corresponding elements.
- property strand: ndarray¶
Alias for
get_strand()
.
- subset_by_overlaps(query, query_type='any', max_gap=-1, min_overlap=1, ignore_strand=False)[source]¶
Subset
subject
(self) with overlaps inquery
GenomicRanges object.- Parameters:
query (
GenomicRanges
) – Query GenomicRanges.query_type (
Literal
['any'
,'start'
,'end'
,'within'
]) –Overlap query type, must be one of
”any”: Any overlap is good
”start”: Overlap at the beginning of the intervals
”end”: Must overlap at the end of the intervals
”within”: Fully contain the query interval
Defaults to “any”.
max_gap (
int
) – Maximum gap allowed in the overlap. Defaults to -1 (no gap allowed).min_overlap (
int
) – Minimum overlap with query. Defaults to 1.ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.
- Raises:
TypeError – If
query
is not of type GenomicRanges.- Return type:
- Returns:
A
GenomicRanges
object containing overlapping ranges.
- subtract(x, min_overlap=1, ignore_strand=False)[source]¶
Subtract searches for features in
x
that overlapself
by at least the number of base pairs given bymin_overlap
.- Parameters:
- Return type:
GenomicRangesList
- Returns:
GenomicRangesList with the same size as
self
containing the subtracted regions.
- tile(n=None, width=None)[source]¶
Split each interval by
n
(number of sub intervals) orwidth
(intervals with equal width).Note: Either
n
orwidth
must be provided but not both.Also, checkout
tile_genome()
for splitting a genome into chunks, ortile_by_range()
.- Parameters:
- Raises:
ValueError – If both
n
andwidth
are provided.- Return type:
- Returns:
A new
GenomicRanges
with the split ranges.
- tile_by_range(n=None, width=None)[source]¶
Split each sequence length into chunks by
n
(number of intervals) orwidth
(intervals with equal width).Note: Either
n
orwidth
must be provided, but not both.Also, checkout
tile_genome()
for splitting the genome into chunks.- Parameters:
- Raises:
ValueError – If both
n
orwidth
are provided.- Return type:
- Returns:
A new
GenomicRanges
with the split ranges.
- classmethod tile_genome(seqlengths, n=None, width=None)[source]¶
Create a new
GenomicRanges
by partitioning a specified genome.If
n
is provided, the region is split inton
intervals. The last interval may not contain the same ‘width’ as the other regions.Alternatively,
width
may be provided for each interval. Similarly, the last region may be less thanwidth
.Either
n
orwidth
must be provided but not both.- Parameters:
seqlengths (
Union
[Dict
,SeqInfo
]) –Sequence lengths of each chromosome.
seqlengths
may be a dictionary, where keys specify the chromosome and the value is the length of each chromosome in the genome.Alternatively,
seqlengths
may be an instance ofSeqInfo
.n (
Optional
[int
]) – Number of intervals to split into. Defaults to None, then ‘width’ of each interval is computed fromseqlengths
.width (
Optional
[int
]) – Width of each interval. Defaults to None.
- Raises:
ValueError – If both
n
andwidth
are provided.- Return type:
- Returns:
A new
GenomicRanges
with the tiled regions.
- to_pandas()[source]¶
Convert this
GenomicRanges
object into aDataFrame
.- Return type:
pandas.DataFrame
- Returns:
A
DataFrame
object.
- to_polars()[source]¶
Convert this
GenomicRanges
object into aDataFrame
.- Return type:
polars.DataFrame
- Returns:
A
DataFrame
object.
- trim(in_place=False)[source]¶
Trim sequences outside of bounds for non-circular chromosomes.
- Parameters:
in_place (
bool
) – Whether to modify theGenomicRanges
object in place.- Return type:
- Returns:
A modified
GenomicRanges
object with the trimmed regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- union(other)[source]¶
Find union of genomic intervals with other.
- Parameters:
other (
GenomicRanges
) – The otherGenomicRanges
object.- Raises:
TypeError – If
other
is not aGenomicRanges
.- Return type:
- Returns:
A new
GenomicRanges
object with all ranges.
genomicranges.GenomicRangesList module¶
- class genomicranges.GenomicRangesList.GenomicRangesList(ranges, range_lengths=None, names=None, mcols=None, metadata=None, validate=True)[source]¶
Bases:
object
Just as it sounds, a GenomicRangesList is a named-list like object.
If you are wondering why you need this class, a GenomicRanges object lets us specify multiple genomic elements, usually where the genes start and end. Genes are themselves made of many sub regions, e.g. exons. GenomicRangesList allows us to represent this nested structure.
Currently, this class is limited in functionality, purely a read-only class with basic accessors.
Typical usage:
To construct a GenomicRangesList object, simply pass in a list of
genomicranges.GenomicRanges.GenomicRanges
objects and Optionallynames
.a = GenomicRanges( seqnames=[ "chr1", "chr2", "chr1", "chr3", ], ranges=IRanges( [ 1, 3, 2, 4, ], [ 10, 30, 50, 60, ], ), strand=[ "-", "+", "*", "+", ], mcols=BiocFrame( { "score": [ 1, 2, 3, 4, ] } ), ) b = GenomicRanges( seqnames=[ "chr2", "chr4", "chr5", ], ranges=IRanges( [3, 6, 4], [ 30, 50, 60, ], ), strand=[ "-", "+", "*", ], mcols=BiocFrame( { "score": [ 2, 3, 4, ] } ), ) grl = GenomicRangesList( ranges=[ gr1, gr2, ], names=[ "gene1", "gene2", ], )
Additionally, you may also provide metadata about the genomic elements in the dictionary using mcols attribute.
- __getitem__(args)[source]¶
Subset individual genomic elements.
- Parameters:
args (
Union
[str
,int
,tuple
,list
,slice
]) –Name of the genomic element to access.
Alternatively, if names of genomic elements are not available, you may provide an index position of the genomic element to access.
Alternatively,
args
may also specify a list of positions to slice specified either as alist
orslice
.A tuple may also be specified along each dimension. Currently if the tuple contains more than one dimension, its ignored.
- Raises:
TypeError – If
args
is not a supported slice argument.- Return type:
- Returns:
A new
GenomicRangesList
of the slice.
- __init__(ranges, range_lengths=None, names=None, mcols=None, metadata=None, validate=True)[source]¶
Initialize a GenomicRangesList object.
- Parameters:
ranges (
Union
[GenomicRanges
,List
[GenomicRanges
]]) – List of genomic elements. All elements in this list must begenomicranges.GenomicRanges.GenomicRanges
objects.range_lengths (
Optional
[Sequence
[int
]]) – Number of ranges within each genomic element. Defaults to None, and is inferred fromranges
.names (
Optional
[List
[str
]]) – Names of the genomic elements. The length of this must match the number of genomic elements inranges
. Defaults to None.mcols (
Optional
[BiocFrame
]) – Metadata about each genomic element. Defaults to None.metadata (
Optional
[dict
]) – Additional metadata. Defaults to None.validate (
bool
) – Internal use only.
- as_genomic_ranges()[source]¶
Coerce object to a
GenomicRanges
.- Return type:
- Returns:
A
GenomicRanges
object.
- copy()[source]¶
Alias for
__copy__()
.
- classmethod empty(n)[source]¶
Create an empty
n
-length GenomicRangesList object.- Returns:
same type as caller, in this case a GenomicRangesList.
- property end: Dict[str, List[int]]¶
Get all end positions.
- Returns:
A list with the same length as keys in the object, each element in the list contains another list values.
- classmethod from_dict(x)[source]¶
Create a GenomicRangesList object from
dict
.- Returns:
same type as caller, in this case a GenomicRangesList.
- get_mcols()[source]¶
- Return type:
- Returns:
A ~py:class:~biocframe.BiocFrame.BiocFrame containing per-genomic element annotations.
- groups(group)[source]¶
Get a genomic element by their name or index position.
- property is_circular: Dict[str, List[int]]¶
Get the circularity flag.
- Returns:
A list with the same length as keys in the object, each element in the list contains another list values.
- is_empty()[source]¶
Whether
GRangesList
has no elements or if all its elements are empty.- Return type:
- Returns:
True if the object has no elements.
- property mcols: BiocFrame¶
Alias for
get_mcols()
.
- property metadata: dict¶
Alias for
get_metadata
.
- property names: Names¶
Alias for
get_names()
.
- range()[source]¶
Calculate range bounds for each genomic element.
- Return type:
- Returns:
A new
GenomicRangesList
object with the range bounds.
- property range_lengths: dict¶
Alias for
get_range_lengths
.
- property ranges: GenomicRanges¶
Alias for
get_ranges()
.
- property seq_info: Dict[str, List[int]]¶
Get information about the underlying sequences.
- Returns:
A list with the same length as keys in the object, each element in the list contains another list values.
- property seqnames: Dict[str, List[str]]¶
Get all sequence names.
- Returns:
A list with the same length as keys in the object, each element in the list contains another list of sequence names.
- set_mcols(mcols, in_place=False)[source]¶
Set new range metadata.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRangesList
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_metadata(metadata, in_place=False)[source]¶
Set additional metadata.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRangesList
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_names(names, in_place=False)[source]¶
Set new names.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRangesList
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_ranges(ranges, in_place=False)[source]¶
Set new genomic ranges.
- Parameters:
ranges (
Union
[GenomicRanges
,List
[GenomicRanges
]]) – List of genomic elements. All elements in this list must begenomicranges.GenomicRanges.GenomicRanges
objects.in_place (
bool
) – Whether to modify theGenomicRangesList
object in place.
- Return type:
- Returns:
A modified
GenomicRangesList
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- property start: Dict[str, List[int]]¶
Get all start positions.
- Returns:
A list with the same length as keys in the object, each element in the list contains another list values.
- property strand: Dict[str, List[int]]¶
Get strand of all regions across all elements.
- Returns:
A list with the same length as keys in the object, each element in the list contains another list values.
genomicranges.SeqInfo module¶
- class genomicranges.SeqInfo.SeqInfo(seqnames, seqlengths=None, is_circular=None, genome=None, validate=True)[source]¶
Bases:
object
Information about the reference sequences, specifically the name and length of each sequence, whether it is a circular, and the identity of the genome from which it was derived.
- __getitem__(subset)[source]¶
Alias to
get_subset
.- Return type:
- __init__(seqnames, seqlengths=None, is_circular=None, genome=None, validate=True)[source]¶
- Parameters:
seqnames (
Sequence
[str
]) – Names of all reference sequences, should be unique.seqlengths (
Union
[int
,Sequence
[int
],Dict
[str
,int
],None
]) –Lengths of all sequences in base pairs. This should contain non-negative values and have the same number of elements as
seqnames
. Entries may also be None if no lengths are available for that sequence.Alternatively, a dictionary where keys are the sequence names and values are the lengths. If a name is missing from this dictionary, the length of the sequence is set to None.
Alternatively a single integer, if all sequences are of the same length.
Alternatively None, if no length information is available for any sequence.
is_circular (
Union
[bool
,Sequence
[bool
],Dict
[str
,bool
],None
]) –Whether each sequence is circular. This should have the same number of elements as
seqnames
. Entries may also be None if no information is available for that sequence.Alternatively, a dictionary where keys are the sequence names and values are the circular flags. If a name is missing from this dictionary, the flag for the sequence is set to None.
Alternatively a single boolean, if all sequences have the same circular flag.
Alternatively None, if no flags are available for any sequence.
genome (
Union
[str
,Sequence
[str
],Dict
[str
,str
],None
]) –The genome build containing each reference sequence. This should have the same number of elements as
seqnames
. Entries may also be None if no information is available.Alternatively, a dictionary where keys are the sequence names and values are the genomes. If a name is missing from this dictionary, the genome is set to None.
Alternatively a single string, if all sequences are derived from the same genome.
Alternatively None, if no genome information is available for any sequence.
validate (
bool
) – Whether to validate the arguments, internal use only.
- copy()[source]¶
Alias for
__copy__()
.
- classmethod empty()[source]¶
Create an zero-length SeqInfo object.
- Returns:
same type as caller, in this case a SeqInfo.
- get_genome()[source]¶
- Return type:
- Returns:
A list of strings is returned containing the genome identity for all sequences in
get_seqnames()
.
- get_is_circular()[source]¶
- Return type:
- Returns:
A list of booleans is returned specifying whether each sequence (from
get_seqnames()
) is circular.
- get_seqlengths()[source]¶
- Return type:
- Returns:
A list of integers is returned containing the lengths of all sequences, in the same order as the sequence names from
get_seqnames()
.
- get_subset(subset)[source]¶
Subset
SeqInfo
, based on their indices or seqnames.- Parameters:
subset (
Union
[str
,int
,bool
,Sequence
]) –Indices to be extracted. This may be an integer, boolean, string, or any sequence thereof, as supported by
normalize_subscript()
. Scalars are treated as length-1 sequences.Strings may only be used if :py:attr:
~seqnames
are available (seeget_seqnames()
). The first occurrence of each string in the seqnames is used for extraction.- Return type:
- Returns:
A new
SeqInfo
object with the sequences of interest.
- set_genome(genome, in_place=False)[source]¶
- Parameters:
- Return type:
- Returns:
A modified
SeqInfo
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_is_circular(is_circular, in_place=False)[source]¶
- Parameters:
is_circular (
Union
[bool
,Sequence
[bool
],Dict
[str
,bool
],None
]) –List of circular flags, of length equal to the number of names in this
SeqInfo
object. Values may be None or booleans.Alternatively, a dictionary where keys are the sequence names and values are the flags. Not all names need to be present in which case the flag is assumed to be None.
in_place (
bool
) – Whether to modify theSeqInfo
object in place.
- Return type:
- Returns:
A modified
SeqInfo
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_seqlengths(seqlengths, in_place=False)[source]¶
- Parameters:
seqlengths (
Union
[int
,Sequence
[int
],Dict
[str
,int
],None
]) –List of sequence lengths, of length equal to the number of names in this
SeqInfo
object. Values may be None or non-negative integers.Alternatively, a dictionary where keys are the sequence names and values are the lengths. Not all names need to be present in which case the length is assumed to be None.
in_place (
bool
) – Whether to modify theSeqInfo
object in place.
- Return type:
- Returns:
A modified
SeqInfo
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- class genomicranges.SeqInfo.SeqInfoIterator(obj)[source]¶
Bases:
object
An iterator to a
SeqInfo
object.
- genomicranges.SeqInfo.merge_SeqInfo(objects)[source]¶
Merge multiple
SeqInfo
objects, taking the union of all reference sequences. If the same reference sequence is present with the same details acrossobjects
, only a single instance is present in the final object; if details are contradictory, they are replaced with None.
genomicranges.utils module¶
- genomicranges.utils.create_np_vector(intervals, with_reverse_map=False, force_size=None, dont_sum=False, value=1)[source]¶
Represent intervals and calculate coverage.
- Parameters:
- Return type:
- Returns:
A numpy array representing coverage from the intervals and optionally a reverse index map.
- genomicranges.utils.sanitize_strand_vector(strand)[source]¶
Create a numpy representation for
strand
.Mapping: 1 for “+” (forward strand), 0 for “*” (any strand) and -1 for “-” (reverse strand).