genomicranges package¶
Subpackages¶
Submodules¶
genomicranges.GenomicRanges module¶
- class genomicranges.GenomicRanges.GenomicRanges(seqnames, ranges, strand=None, names=None, mcols=None, seqinfo=None, metadata=None, validate=True)[source]¶
Bases:
object
GenomicRanges
provides a container class to represent and operate over genomic regions and annotations.Note: The documentation for some of the methods are derived from the GenomicRanges R/Bioconductor package.
- __getitem__(subset)[source]¶
Alias to
get_subset
.- Return type:
- __init__(seqnames, ranges, strand=None, names=None, mcols=None, seqinfo=None, metadata=None, validate=True)[source]¶
Initialize a
GenomicRanges
object.- Parameters:
seqnames (
Sequence
[str
]) – List of sequence or chromosome names.ranges (
IRanges
) – Genomic positions and widths of each position. Must have the same length asseqnames
.strand (
Union
[Sequence
[str
],Sequence
[int
],ndarray
,None
]) –Strand information for each genomic range. This should be 0 (any strand), 1 (forward strand) or -1 (reverse strand). If None, all genomic ranges are assumed to be 0 (any strand).
May be provided as a list of strings representing the strand; “+” for forward strand, “-” for reverse strand, or “*” for any strand and will be mapped to 1, -1 or 0 respectively.
names (
Optional
[Sequence
[str
]]) – Names for each genomic range. Defaults to None, which means the ranges are unnamed.mcols (
Optional
[BiocFrame
]) – A ~py:class:~biocframe.BiocFrame.BiocFrame with the number of rows same as number of genomic ranges, containing per-range annotation. Defaults to None, in which case an empty BiocFrame object is created.seqinfo (
Optional
[SeqInfo
]) – Sequence information. Defaults to None, in which case aSeqInfo
object is created with the unique set of chromosome names fromseqnames
.metadata (
Optional
[dict
]) – Additional metadata. Defaults to None, and is assigned to an empty dictionary.validate (
bool
) – Internal use only.
- __setitem__(args, value)[source]¶
Alias to
set_subset
.This operation modifies object in-place.
- Return type:
- binned_average(scorename, bins, outname='binned_average', in_place=False)[source]¶
Calculate average for a column across all regions in
bins
, then set a column specified by ‘outname’ with those values.- Parameters:
scorename (
str
) – Score column to compute averages on.bins (
GenomicRanges
) – Bins you want to use.outname (
str
) – New column name to add to the object.in_place (
bool
) – Whether to modifybins
in place.
- Raises:
ValueError – If
scorename
column does not exist.scorename
is not all ints or floats.TypeError – If
bins
is not of type GenomicRanges.
- Return type:
- Returns:
A modified
bins
object with the computed averages, either as a copy of the original or as a reference to the (in-place-modified) original.
- copy()[source]¶
Alias for
__copy__()
.
- count_overlaps(query, query_type='any', max_gap=-1, min_overlap=0, ignore_strand=False)[source]¶
Count overlaps between subject (self) and a
query
GenomicRanges
object.- Parameters:
query (
GenomicRanges
) – Query GenomicRanges.query_type (
Literal
['any'
,'start'
,'end'
,'within'
]) –Overlap query type, must be one of
”any”: Any overlap is good
”start”: Overlap at the beginning of the intervals
”end”: Must overlap at the end of the intervals
”within”: Fully contain the query interval
Defaults to “any”.
max_gap (
int
) – Maximum gap allowed in the overlap. Defaults to -1 (no gap allowed).min_overlap (
int
) – Minimum overlap with query. Defaults to 0.ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.
- Raises:
TypeError – If
query
is not of type GenomicRanges.- Return type:
- Returns:
NumPy vector with length matching query, value represents the number of overlaps in self for each query.
- coverage(shift=0, width=None, weight=1)[source]¶
Calculate coverage for each chromosome, For each position, counts the number of ranges that cover it.
- Parameters:
- Return type:
- Returns:
A dictionary with chromosome names as keys and the coverage vector.
- disjoin(with_reverse_map=False, ignore_strand=False)[source]¶
Calculate disjoint genomic positions for each distinct (seqname, strand) pair.
- Parameters:
- Return type:
- Returns:
A new
GenomicRanges
containing disjoint ranges.
- disjoint_bins(ignore_strand=False)[source]¶
Split ranges into a set of bins so that the ranges in each bin are disjoint.
- distance(query)[source]¶
Compute the pair-wise distance with intervals in query.
- Parameters:
query (
Union
[GenomicRanges
,IRanges
]) – Query GenomicRanges or IRanges.- Return type:
- Returns:
Numpy vector containing distances for each interval in query.
- classmethod empty()[source]¶
Create an zero-length GenomicRanges object.
- Returns:
same type as caller, in this case a GenomicRanges.
- find_overlaps(query, query_type='any', select='all', max_gap=-1, min_overlap=0, ignore_strand=False)[source]¶
Find overlaps between subject (self) and a
query
GenomicRanges
object.- Parameters:
query (
GenomicRanges
) – Query GenomicRanges.query_type (
Literal
['any'
,'start'
,'end'
,'within'
]) –Overlap query type, must be one of
”any”: Any overlap is good
”start”: Overlap at the beginning of the intervals
”end”: Must overlap at the end of the intervals
”within”: Fully contain the query interval
Defaults to “any”.
select (
Literal
['all'
,'first'
,'last'
,'arbitrary'
]) – Determine what hit to choose when there are multiple hits for an interval insubject
.max_gap (
int
) – Maximum gap allowed in the overlap. Defaults to -1 (no gap allowed).min_overlap (
int
) – Minimum overlap with query. Defaults to 0.ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.
- Raises:
TypeError – If
query
is not of type GenomicRanges.- Returns:
query_hits: Indices into query ranges
self_hits: Corresponding indices into self ranges that are upstream
Each row represents a query-self pair where self overlaps query.
- Return type:
A BiocFrame with two columns
- flank(width, start=True, both=False, ignore_strand=False, in_place=False)[source]¶
Compute flanking ranges for each range. The logic for this comes from the R/GenomicRanges & IRanges packages.
If
start
isTrue
for a given range, the flanking occurs at the start, otherwise the end. The widths of the flanks are given by thewidth
parameter.width
can be negative, in which case the flanking region is reversed so that it represents a prefix or suffix of the range.Usage:
gr.flank(3, True), where “x” indicates a range in
gr
and “-” indicates the resulting flanking region:—xxxxxxx
- If
start
wereFalse
, the range ingr
becomes xxxxxxx—
For negative width, i.e. gr.flank(x, -3, FALSE), where “*” indicates the overlap between “x” and the result:
xxxx***
If
both
isTrue
, then, for all ranges in “x”, the flanking regions are extended into (or out of, ifwidth
is negative) the range, so that the result straddles the given endpoint and has twice the width given by width.- Parameters:
width (
int
) – Width to flank by. May be negative.start (
Union
[bool
,ndarray
,List
[bool
]]) –Whether to only flank starts. Defaults to True.
Alternatively, you may provide a list of start values, whose length is the same as the number of ranges.
both (
bool
) – Whether to flank both starts and ends. Defaults to False.ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.in_place (
bool
) – Whether to modify theGenomicRanges
object in place.
- Return type:
- Returns:
A modified
GenomicRanges
object with the flanked regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- If
- follow(query, select='last', ignore_strand=False)[source]¶
Search nearest positions only upstream that overlap with each range in
query
.- Parameters:
query (
GenomicRanges
) – QueryGenomicRanges
to find nearest positions.select (
Literal
['all'
,'last'
]) – Whether to return “all” hits or just “last”. Defaults to “last”.ignore_strand (
bool
) – Whether to ignore strand. Defaults to False.
- Returns:
- A numpy array of integers with length matching query, containing indices
into self for the closest downstream position of each query range. Value may be None if there are no matches.
- If select=”all”:
A BiocFrame with two columns: - query_hits: Indices into query ranges - self_hits: Corresponding indices into self ranges that are upstream Each row represents a query-self pair where self follows query.
- Return type:
If select=”last”
- classmethod from_pandas(input)[source]¶
Create an
GenomicRanges
object from aDataFrame
.- Parameters:
input – Input data. Must contain columns ‘seqnames’, ‘starts’ and ‘widths’ or “ends”.
- Return type:
- Returns:
A
GenomicRanges
object.
- classmethod from_polars(input)[source]¶
Create an
GenomicRanges
object from aDataFrame
.- Parameters:
input – Input polars DataFrame. Must contain columns ‘seqnames’, ‘starts’ and ‘widths’ or “ends”.
- Return type:
- Returns:
A
GenomicRanges
object.
- gaps(start=1, end=None, ignore_strand=False)[source]¶
Identify complemented ranges for each distinct (seqname, strand) pair.
- Parameters:
start (
int
) – Restrict chromosome start position. Defaults to 1.end (
Union
[int
,Dict
[str
,int
],None
]) –Restrict end position for each chromosome. Defaults to None. If None, extracts sequence information from
seqinfo
object if available.Alternatively, you may provide a dictionary with seqnames as keys and the values specifying the ends.
ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.
- Return type:
- Returns:
A new
GenomicRanges
with complement ranges.
- get_end()[source]¶
Get all end positions.
- Return type:
- Returns:
NumPy array of 32-bit signed integers containing the end positions for all ranges.
- get_mcols()[source]¶
- Return type:
- Returns:
A ~py:class:~biocframe.BiocFrame.BiocFrame containing per-range annotations.
- get_out_of_bound_index()[source]¶
Find indices of genomic ranges that are out of bounds.
- Return type:
- Returns:
A numpy array of integer indices where ranges are out of bounds.
- get_seqinfo()[source]¶
- Return type:
- Returns:
A ~py:class:~genomicranges.SeqInfo.SeqInfo containing sequence information.
- get_seqlengths()[source]¶
Get sequence lengths for each genomic range.
- Return type:
- Returns:
An ndarray containing the sequence lengths.
- get_seqnames(as_type='list')[source]¶
Access sequence names.
- Parameters:
as_type (
Literal
['factor'
,'list'
]) –Access seqnames as factor tuple, in which case, levels and codes are returned.
If
list
, then codes are mapped to levels and returned.- Return type:
- Returns:
A
biocutils.Factor
if as_type=”factor”. Otherwise a list of sequence names.
- get_start()[source]¶
Get all start positions.
- Return type:
- Returns:
NumPy array of 32-bit signed integers containing the start positions for all ranges.
- get_strand(as_type='numpy')[source]¶
Access strand information.
- Parameters:
as_type (
Literal
['numpy'
,'factor'
,'list'
]) –- Access seqnames as factor codes, in which case, a numpy
vector is retuned.
If
factor
, a tuple with codes as the strand vector and levels a dictionary containing the mapping.If
list
, then codes are mapped to levels and returned.- Return type:
- Returns:
A numpy vector representing strand, 0 for any strand, -1 for reverse strand and 1 for forward strand.
A tuple of codes and levels.
A list of “+”, “-”, or “*” for each range.
- get_subset(subset)[source]¶
Subset
GenomicRanges
, based on their indices or names.- Parameters:
subset (
Union
[str
,int
,bool
,Sequence
]) –Indices to be extracted. This may be an integer, boolean, string, or any sequence thereof, as supported by
normalize_subscript()
. Scalars are treated as length-1 sequences.Strings may only be used if :py:attr:
~names
are available (seeget_names()
). The first occurrence of each string in the names is used for extraction.- Return type:
- Returns:
A new
GenomicRanges
object with the ranges of interest.
- get_width()[source]¶
Get width of each genomic range.
- Return type:
- Returns:
NumPy array of 32-bit signed integers containing the width for all ranges.
- intersect(other, ignore_strand=False)[source]¶
Find intersecting genomic intervals with other.
- Parameters:
other (
GenomicRanges
) – The otherGenomicRanges
object.ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.
- Return type:
- Returns:
A new
GenomicRanges
object with intersecting ranges.
- intersect_ncls(other)[source]¶
Find intersecting genomic intervals with other (uses NCLS index).
- Parameters:
other (
GenomicRanges
) – The otherGenomicRanges
object.- Return type:
- Returns:
A new
GenomicRanges
object with intersecting ranges.
- invert_strand(in_place=False)[source]¶
Invert strand for each range.
- Conversion map:
“+” map to “-”
“-” becomes “+”
“*” stays the same
- Parameters:
in_place (
bool
) – Whether to modify the object in place. Defaults to False.- Return type:
- Returns:
A modified
GenomicRanges
object with the trimmed regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- is_disjoint(ignore_strand=False)[source]¶
Calculate disjoint genomic positions for each distinct (seqname, strand) pair.
- match(query, ignore_strand=False)[source]¶
Element wise comparison to find exact match ranges.
- Parameters:
query (
GenomicRanges
) – QueryGenomicRanges
to search for matches.ignore_strand (
bool
) – Whether to ignore strand. Defaults to False.
- Return type:
- Returns:
A NumPy array with length matching query containing the matched indices.
- property mcols: BiocFrame¶
Alias for
get_mcols()
.
- property metadata: dict¶
Alias for
get_metadata
.
- property names: Names¶
Alias for
get_names()
.
- narrow(start=None, width=None, end=None, in_place=False)[source]¶
Narrow genomic positions by provided
start
,width
andend
parameters.Important: these parameters are relative shift in positions for each range.
- Parameters:
start (
Union
[int
,List
[int
],ndarray
,None
]) – Relative start position. Defaults to None.width (
Union
[int
,List
[int
],ndarray
,None
]) – Relative end position. Defaults to None.end (
Union
[int
,List
[int
],ndarray
,None
]) – Relative width of the interval. Defaults to None.in_place (
bool
) – Whether to modify theGenomicRanges
object in place.
- Return type:
- Returns:
A modified
GenomicRanges
object with the trimmed regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- nearest(query, select='arbitrary', ignore_strand=False)[source]¶
Search nearest positions both upstream and downstream that overlap with each range in
query
.- Parameters:
query (
GenomicRanges
) – QueryGenomicRanges
to find nearest positions.select (
Literal
['all'
,'arbitrary'
]) – Determine what hit to choose when there are multiple hits for an interval inquery
.ignore_strand (
bool
) – Whether to ignore strand. Defaults to False.
- Returns:
- A numpy array of integers with length matching query, containing indices
into self for the closest for each query range. Value may be None if there are no matches.
- If select=”all”:
A BiocFrame with two columns: - query_hits: Indices into query ranges - self_hits: Corresponding indices into self ranges that are upstream Each row represents a query-self pair where subject is nearest to query.
- Return type:
If select=”arbitrary”
- order(decreasing=False)[source]¶
Get the order of indices for sorting.
Order orders the genomic ranges by chromosome and strand. Strand is ordered by reverse first (-1), any strand (0) and forward strand (-1). Then by the start positions and width if two regions have the same start.
- precede(query, select='first', ignore_strand=False)[source]¶
Search nearest positions only downstream that overlap with each range in
query
.- Parameters:
query (
GenomicRanges
) – QueryGenomicRanges
to find nearest positions.select (
Literal
['all'
,'first'
]) – Whether to return “all” hits or just “first”. Defaults to “first”.ignore_strand (
bool
) – Whether to ignore strand. Defaults to False.
- Returns:
- A numpy array of integers with length matching query, containing indices
into self for the closest upstream position of each query range. Value may be None if there are no matches.
- If select=”all”:
A BiocFrame with two columns: - query_hits: Indices into query ranges - self_hits: Corresponding indices into self ranges that are upstream Each row represents a query-self pair where self precedes query.
- Return type:
If select=”first”
- promoters(upstream=2000, downstream=200, in_place=False)[source]¶
Extend ranges to promoter regions.
Generates promoter ranges relative to the transcription start site (TSS).
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object with the promoter regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- range(with_reverse_map=False, ignore_strand=False)[source]¶
Calculate range bounds for each distinct (seqname, strand) pair.
- Parameters:
- Return type:
- Returns:
A new
GenomicRanges
object with the range bounds.
- property ranges: IRanges¶
Alias for
get_ranges()
.
- rank()[source]¶
Get rank of the
GenomicRanges
object.For each range identifies its position is a sorted order.
- reduce(with_reverse_map=False, drop_empty_ranges=False, min_gap_width=1, ignore_strand=False)[source]¶
Reduce orders the ranges, then merges overlapping or adjacent ranges.
- Parameters:
with_reverse_map (
bool
) – Whether to return map of indices back to original object. Defaults to False.drop_empty_ranges (
bool
) – Whether to drop empty ranges. Defaults to False.min_gap_width (
int
) – Ranges separated by a gap of at leastmin_gap_width
positions are not merged. Defaults to 1.ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.
- Return type:
- Returns:
A new
GenomicRanges
object with reduced intervals.
- resize(width, fix='start', ignore_strand=False, in_place=False)[source]¶
Resize ranges to the specified
width
where either thestart
,end
, orcenter
is used as an anchor.- Parameters:
width (
Union
[int
,List
[int
],ndarray
]) – Width to resize, cannot be negative!fix (
Union
[Literal
['start'
,'end'
,'center'
],List
[Literal
['start'
,'end'
,'center'
]]]) –Fix positions by “start”, “end”, or “center”. Defaults to “start”.
Alternatively, may provide a list of fix positions for each genomic range.
ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.in_place (
bool
) – Whether to modify theGenomicRanges
object in place.
- Raises:
ValueError – If parameter
fix
is neither start, end, nor center. Ifwidth
is negative.- Return type:
- Returns:
A modified
GenomicRanges
object with the resized regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- restrict(start=None, end=None, keep_all_ranges=False)[source]¶
Restrict ranges to a given start and end positions.
- Parameters:
start (
Union
[int
,Dict
[str
,int
],ndarray
,None
]) –Start position. Defaults to None.
Alternatively may provide a dictionary mapping sequence names to starts, or array of starts.
end (
Union
[int
,Dict
[str
,int
],ndarray
,None
]) –End position. Defaults to None.
Alternatively may provide a dictionary mapping sequence names to starts, or array of starts.
keep_all_ranges (
bool
) – Whether to keep intervals that do not overlap with start and end. Defaults to False.
- Return type:
- Returns:
A new
GenomicRanges
object with the restricted regions.
- sample(k=5)[source]¶
Randomly sample
k
intervals.- Parameters:
k (
int
) – Number of intervals to sample. Defaults to 5.- Return type:
- Returns:
A new
GenomicRanges
with randomly sampled ranges.
- property seqinfo: ndarray¶
Alias for
get_seqinfo()
.
- property seqnames: ndarray | List[str]¶
Alias for
get_seqnames()
.
- set_mcols(mcols, in_place=False)[source]¶
Set new range metadata.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_metadata(metadata, in_place=False)[source]¶
Set additional metadata.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_names(names, in_place=False)[source]¶
Set new names.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_ranges(ranges, in_place=False)[source]¶
Set new ranges.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_seqinfo(seqinfo, in_place=False)[source]¶
Set new sequence information.
- Parameters:
A ~py:class:~genomicranges.SeqInfo.SeqInfo object containing information about sequences in
seqnames
.May be None to remove sequence information. This would then generate a new sequence information based on the current sequence names.
in_place (
bool
) – Whether to modify theGenomicRanges
object in place.
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_seqnames(seqnames, in_place=False)[source]¶
Set new sequence names.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_strand(strand, in_place=False)[source]¶
Set new strand information.
- Parameters:
strand (
Union
[Sequence
[str
],Sequence
[int
],ndarray
,None
]) –Strand information for each genomic range. This should be 0 (any strand), 1 (forward strand) or -1 (reverse strand).
Alternatively, may provide a list of strings representing the strand; “+” for forward strand, “-” for reverse strand, or “*” for any strand and will be automatically mapped to 1, -1 or 0 respectively.
May be set to None; in which case all genomic ranges are assumed to be 0.
in_place (
bool
) – Whether to modify theGenomicRanges
object in place.
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_subset(args, value, in_place=False)[source]¶
Udate positions.
- Parameters:
args (
Union
[Sequence
,int
,str
,bool
,slice
,range
]) – Integer indices, a boolean filter, or (if the current object is named) names specifying the ranges to be replaced, seenormalize_subscript()
.value (
GenomicRanges
) – AnGenomicRanges
object of length equal to the number of ranges to be replaced, as specified bysubset
.in_place (
bool
) – Whether to modify theGenomicRanges
object in place.
- Return type:
- Returns:
A modified
GenomicRanges
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- setdiff(other, ignore_strand=False)[source]¶
Find set difference of genomic intervals with other.
- Parameters:
other (
GenomicRanges
) – The otherGenomicRanges
object.ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.
- Return type:
- Returns:
A new
GenomicRanges
object with the diff ranges.
- shift(shift=0, in_place=False)[source]¶
Shift all intervals.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object with the shifted regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- sliding_windows(width, step=1)[source]¶
Slide along each range by
width
(intervals with equalwidth
) andstep
.Also, checkout
tile_genome()
for splitting a gneomic into chunks, ortile_by_range()
.- Parameters:
n – Number of intervals to split into. Defaults to None.
width (
int
) – Width of each interval. Defaults to None.
- Return type:
- Returns:
A new
GenomicRanges
with the sliding ranges.
- sort(decreasing=False, in_place=False)[source]¶
Get the order of indices for sorting.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object with the trimmed regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- split(groups)[source]¶
Split the GenomicRanges object into a
GenomicRangesList
.- Parameters:
groups (
list
) –A list specifying the groups or factors to split by.
Must have the same length as the number of genomic elements in the object.
- Return type:
GenomicRangesList
- Returns:
A GenomicRangesList containing the groups and their corresponding elements.
- property strand: ndarray¶
Alias for
get_strand()
.
- subset_by_overlaps(query, query_type='any', max_gap=-1, min_overlap=0, ignore_strand=False)[source]¶
Subset
subject
(self) with overlaps inquery
GenomicRanges object.- Parameters:
query (
GenomicRanges
) – Query GenomicRanges.query_type (
Literal
['any'
,'start'
,'end'
,'within'
]) –Overlap query type, must be one of
”any”: Any overlap is good
”start”: Overlap at the beginning of the intervals
”end”: Must overlap at the end of the intervals
”within”: Fully contain the query interval
Defaults to “any”.
max_gap (
int
) – Maximum gap allowed in the overlap. Defaults to -1 (no gap allowed).min_overlap (
int
) – Minimum overlap with query. Defaults to 0.ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.
- Raises:
TypeError – If
query
is not of type GenomicRanges.- Return type:
- Returns:
A
GenomicRanges
object containing overlapping ranges.
- subtract(other, min_overlap=1, ignore_strand=False)[source]¶
Subtract searches for features in
x
that overlapself
by at least the number of base pairs given bymin_overlap
.- Parameters:
- Return type:
GenomicRangesList
- Returns:
A GenomicRangesList with the same size as
self
containing the subtracted regions.
- terminators(upstream=2000, downstream=200, in_place=False)[source]¶
Extend ranges to termiantor regions.
Generates terminator ranges relative to the transcription end site (TES).
- Parameters:
- Return type:
- Returns:
A modified
GenomicRanges
object with the promoter regions, either as a copy of the original or as a reference to the (in-place-modified) original.
- tile(n=None, width=None)[source]¶
Split each interval by
n
(number of sub intervals) orwidth
(intervals with equal width).Note: Either
n
orwidth
must be provided but not both.Also, checkout
tile_genome()
for splitting a genome into chunks.- Parameters:
- Raises:
ValueError – If both
n
andwidth
are provided.- Return type:
- Returns:
List of
GenomicRanges
with the split ranges.
- classmethod tile_genome(seqlengths, ntile=None, tilewidth=None, cut_last_tile_in_chrom=False)[source]¶
Tile genome into approximately equal-sized regions.
- Parameters:
- Return type:
- Returns:
GenomicRanges object with ranges and bin numbers in mcols.
- trim(in_place=False)[source]¶
Trim sequences outside of bounds for non-circular chromosomes.
- Parameters:
in_place (
bool
) – Whether to modify theGenomicRanges
object in place.- Return type:
- Returns:
A new
GenomicRanges
object with the trimmed regions.
- union(other, ignore_strand=False)[source]¶
Find union of genomic intervals with other.
- Parameters:
other (
GenomicRanges
) – The otherGenomicRanges
object.ignore_strand (
bool
) – Whether to ignore strands. Defaults to False.
- Return type:
- Returns:
A new
GenomicRanges
object with all ranges.
genomicranges.GenomicRangesList module¶
- class genomicranges.GenomicRangesList.GenomicRangesList(ranges, range_lengths=None, names=None, mcols=None, metadata=None, validate=True)[source]¶
Bases:
object
Just as it sounds, a GenomicRangesList is a named-list like object.
If you are wondering why you need this class, a GenomicRanges object lets us specify multiple genomic elements, usually where the genes start and end. Genes are themselves made of many sub regions, e.g. exons. GenomicRangesList allows us to represent this nested structure.
Currently, this class is limited in functionality, purely a read-only class with basic accessors.
Typical usage:
To construct a GenomicRangesList object, simply pass in a list of
genomicranges.GenomicRanges.GenomicRanges
objects and Optionallynames
.a = GenomicRanges( seqnames=[ "chr1", "chr2", "chr1", "chr3", ], ranges=IRanges( [ 1, 3, 2, 4, ], [ 10, 30, 50, 60, ], ), strand=[ "-", "+", "*", "+", ], mcols=BiocFrame( { "score": [ 1, 2, 3, 4, ] } ), ) b = GenomicRanges( seqnames=[ "chr2", "chr4", "chr5", ], ranges=IRanges( [3, 6, 4], [ 30, 50, 60, ], ), strand=[ "-", "+", "*", ], mcols=BiocFrame( { "score": [ 2, 3, 4, ] } ), ) grl = GenomicRangesList( ranges=[ gr1, gr2, ], names=[ "gene1", "gene2", ], )
Additionally, you may also provide metadata about the genomic elements in the dictionary using mcols attribute.
- __getitem__(args)[source]¶
Subset individual genomic elements.
- Parameters:
args (
Union
[str
,int
,tuple
,list
,slice
]) –Name of the genomic element to access.
Alternatively, if names of genomic elements are not available, you may provide an index position of the genomic element to access.
Alternatively,
args
may also specify a list of positions to slice specified either as alist
orslice
.A tuple may also be specified along each dimension. Currently if the tuple contains more than one dimension, its ignored.
- Raises:
TypeError – If
args
is not a supported slice argument.- Return type:
- Returns:
A new
GenomicRangesList
of the slice.
- __init__(ranges, range_lengths=None, names=None, mcols=None, metadata=None, validate=True)[source]¶
Initialize a GenomicRangesList object.
- Parameters:
ranges (
Union
[GenomicRanges
,List
[GenomicRanges
]]) – List of genomic elements. All elements in this list must begenomicranges.GenomicRanges.GenomicRanges
objects.range_lengths (
Optional
[Sequence
[int
]]) – Number of ranges within each genomic element. Defaults to None, and is inferred fromranges
.names (
Optional
[List
[str
]]) – Names of the genomic elements. The length of this must match the number of genomic elements inranges
. Defaults to None.mcols (
Optional
[BiocFrame
]) – Metadata about each genomic element. Defaults to None.metadata (
Optional
[dict
]) – Additional metadata. Defaults to None.validate (
bool
) – Internal use only.
- as_genomic_ranges()[source]¶
Coerce object to a
GenomicRanges
.- Return type:
- Returns:
A
GenomicRanges
object.
- copy()[source]¶
Alias for
__copy__()
.
- classmethod empty(n)[source]¶
Create an empty
n
-length GenomicRangesList object.- Returns:
same type as caller, in this case a GenomicRangesList.
- property end: Dict[str, List[int]]¶
Get all end positions.
- Returns:
A list with the same length as keys in the object, each element in the list contains another list values.
- classmethod from_dict(x)[source]¶
Create a GenomicRangesList object from
dict
.- Returns:
same type as caller, in this case a GenomicRangesList.
- get_mcols()[source]¶
- Return type:
- Returns:
A ~py:class:~biocframe.BiocFrame.BiocFrame containing per-genomic element annotations.
- groups(group)[source]¶
Get a genomic element by their name or index position.
- property is_circular: Dict[str, List[int]]¶
Get the circularity flag.
- Returns:
A list with the same length as keys in the object, each element in the list contains another list values.
- is_empty()[source]¶
Whether
GRangesList
has no elements or if all its elements are empty.- Return type:
- Returns:
True if the object has no elements.
- property mcols: BiocFrame¶
Alias for
get_mcols()
.
- property metadata: dict¶
Alias for
get_metadata
.
- property names: Names¶
Alias for
get_names()
.
- range()[source]¶
Calculate range bounds for each genomic element.
- Return type:
- Returns:
A new
GenomicRangesList
object with the range bounds.
- property range_lengths: dict¶
Alias for
get_range_lengths
.
- property ranges: GenomicRanges¶
Alias for
get_ranges()
.
- property seq_info: Dict[str, List[int]]¶
Get information about the underlying sequences.
- Returns:
A list with the same length as keys in the object, each element in the list contains another list values.
- property seqnames: Dict[str, List[str]]¶
Get all sequence names.
- Returns:
A list with the same length as keys in the object, each element in the list contains another list of sequence names.
- set_mcols(mcols, in_place=False)[source]¶
Set new range metadata.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRangesList
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_metadata(metadata, in_place=False)[source]¶
Set additional metadata.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRangesList
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_names(names, in_place=False)[source]¶
Set new names.
- Parameters:
- Return type:
- Returns:
A modified
GenomicRangesList
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_ranges(ranges, in_place=False)[source]¶
Set new genomic ranges.
- Parameters:
ranges (
Union
[GenomicRanges
,List
[GenomicRanges
]]) – List of genomic elements. All elements in this list must begenomicranges.GenomicRanges.GenomicRanges
objects.in_place (
bool
) – Whether to modify theGenomicRangesList
object in place.
- Return type:
- Returns:
A modified
GenomicRangesList
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- property start: Dict[str, List[int]]¶
Get all start positions.
- Returns:
A list with the same length as keys in the object, each element in the list contains another list values.
- property strand: Dict[str, List[int]]¶
Get strand of all regions across all elements.
- Returns:
A list with the same length as keys in the object, each element in the list contains another list values.
genomicranges.SeqInfo module¶
- class genomicranges.SeqInfo.SeqInfo(seqnames, seqlengths=None, is_circular=None, genome=None, validate=True)[source]¶
Bases:
object
Information about the reference sequences, specifically the name and length of each sequence, whether it is a circular, and the identity of the genome from which it was derived.
- __getitem__(subset)[source]¶
Alias to
get_subset
.- Return type:
- __init__(seqnames, seqlengths=None, is_circular=None, genome=None, validate=True)[source]¶
- Parameters:
seqnames (
Sequence
[str
]) – Names of all reference sequences, should be unique.seqlengths (
Union
[int
,Sequence
[int
],Dict
[str
,int
],None
]) –Lengths of all sequences in base pairs. This should contain non-negative values and have the same number of elements as
seqnames
. Entries may also be None if no lengths are available for that sequence.Alternatively, a dictionary where keys are the sequence names and values are the lengths. If a name is missing from this dictionary, the length of the sequence is set to None.
Alternatively a single integer, if all sequences are of the same length.
Alternatively None, if no length information is available for any sequence.
is_circular (
Union
[bool
,Sequence
[bool
],Dict
[str
,bool
],None
]) –Whether each sequence is circular. This should have the same number of elements as
seqnames
. Entries may also be None if no information is available for that sequence.Alternatively, a dictionary where keys are the sequence names and values are the circular flags. If a name is missing from this dictionary, the flag for the sequence is set to None.
Alternatively a single boolean, if all sequences have the same circular flag.
Alternatively None, if no flags are available for any sequence.
genome (
Union
[str
,Sequence
[str
],Dict
[str
,str
],None
]) –The genome build containing each reference sequence. This should have the same number of elements as
seqnames
. Entries may also be None if no information is available.Alternatively, a dictionary where keys are the sequence names and values are the genomes. If a name is missing from this dictionary, the genome is set to None.
Alternatively a single string, if all sequences are derived from the same genome.
Alternatively None, if no genome information is available for any sequence.
validate (
bool
) – Whether to validate the arguments, internal use only.
- copy()[source]¶
Alias for
__copy__()
.
- classmethod empty()[source]¶
Create an zero-length SeqInfo object.
- Returns:
same type as caller, in this case a SeqInfo.
- get_genome()[source]¶
- Return type:
- Returns:
A list of strings is returned containing the genome identity for all sequences in
get_seqnames()
.
- get_is_circular()[source]¶
- Return type:
- Returns:
A list of booleans is returned specifying whether each sequence (from
get_seqnames()
) is circular.
- get_seqlengths()[source]¶
- Return type:
- Returns:
A list of integers is returned containing the lengths of all sequences, in the same order as the sequence names from
get_seqnames()
.
- get_subset(subset)[source]¶
Subset
SeqInfo
, based on their indices or seqnames.- Parameters:
subset (
Union
[str
,int
,bool
,Sequence
]) –Indices to be extracted. This may be an integer, boolean, string, or any sequence thereof, as supported by
normalize_subscript()
. Scalars are treated as length-1 sequences.Strings may only be used if :py:attr:
~seqnames
are available (seeget_seqnames()
). The first occurrence of each string in the seqnames is used for extraction.- Return type:
- Returns:
A new
SeqInfo
object with the sequences of interest.
- set_genome(genome, in_place=False)[source]¶
- Parameters:
- Return type:
- Returns:
A modified
SeqInfo
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_is_circular(is_circular, in_place=False)[source]¶
- Parameters:
is_circular (
Union
[bool
,Sequence
[bool
],Dict
[str
,bool
],None
]) –List of circular flags, of length equal to the number of names in this
SeqInfo
object. Values may be None or booleans.Alternatively, a dictionary where keys are the sequence names and values are the flags. Not all names need to be present in which case the flag is assumed to be None.
in_place (
bool
) – Whether to modify theSeqInfo
object in place.
- Return type:
- Returns:
A modified
SeqInfo
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- set_seqlengths(seqlengths, in_place=False)[source]¶
- Parameters:
seqlengths (
Union
[int
,Sequence
[int
],Dict
[str
,int
],None
]) –List of sequence lengths, of length equal to the number of names in this
SeqInfo
object. Values may be None or non-negative integers.Alternatively, a dictionary where keys are the sequence names and values are the lengths. Not all names need to be present in which case the length is assumed to be None.
in_place (
bool
) – Whether to modify theSeqInfo
object in place.
- Return type:
- Returns:
A modified
SeqInfo
object, either as a copy of the original or as a reference to the (in-place-modified) original.
- class genomicranges.SeqInfo.SeqInfoIterator(obj)[source]¶
Bases:
object
An iterator to a
SeqInfo
object.
- genomicranges.SeqInfo.merge_SeqInfo(objects)[source]¶
Merge multiple
SeqInfo
objects, taking the union of all reference sequences. If the same reference sequence is present with the same details acrossobjects
, only a single instance is present in the final object; if details are contradictory, they are replaced with None.
genomicranges.utils module¶
- genomicranges.utils.compute_up_down(starts, ends, strands, upstream, downstream, site='TSS')[source]¶
Compute promoter or terminator regions for genomic ranges.
- Parameters:
x – GenomicRanges object
upstream – Number of bases upstream (scalar or array)
downstream – Number of bases downstream (scalar or array)
site (
str
) – “TSS” for promoters or “TES” for terminators
- Returns:
New starts and ends.
- genomicranges.utils.sanitize_strand_vector(strand)[source]¶
Create a numpy representation for
strand
.Mapping: 1 for “+” (forward strand), 0 for “*” (any strand) and -1 for “-” (reverse strand).