Tutorial¶
The package provide classes to represent genomic intervals and methods to perform interval based arithmetic operations.
Intervals are inclusive on both ends and starts at 1.
The implementation details for the classes follow the Bioconductor’s equivalent R/GenomicRanges package.
Construct a GenomicRanges
object¶
To construct a GenomicRanges object, simply pass in the column representation as a dictionary. This dictionary must contain “seqnames”, “starts”, “ends” columns and optionally specify “strand”. If “strand” column is not provided, “*” is used as the default value for each genomic interval.
Pandas DataFrame¶
A common representation in Python is a pandas DataFrame for all tabular datasets. One can convert this DataFrame into GenomicRanges
if it contains the necessary columns.
Note: The DataFrame must contain columns seqnames
, starts
and ends
to represent genomic coordinates.
from genomicranges import GenomicRanges
import pandas as pd
df = pd.DataFrame(
{
"seqnames": ["chr1", "chr2", "chr1", "chr3", "chr2"],
"starts": [101, 102, 103, 104, 109],
"ends": [112, 103, 128, 134, 111],
"strand": ["*", "-", "*", "+", "-"],
"score": range(0, 5),
"GC": [random() for _ in range(5)],
}
)
gr = GenomicRanges.from_pandas(df)
From UCSC or GTF file¶
Methods are available to download, parse and access genomes from UCSC or load a genome annotation from GTF file.
import genomicranges
gr = genomicranges.read_gtf(<PATH TO GTF>)
# OR
gr = genomicranges.read_ucsc(genome="hg19")
from IRanges
(Preferred way)¶
If you have all relevant information to create a GenomicRanges
object
from genomicranges import GenomicRanges
from iranges import IRanges
from biocframe import BiocFrame
from random import random
gr = GenomicRanges(
seqnames=[
"chr1",
"chr2",
"chr3",
"chr2",
"chr3",
],
ranges=IRanges([x for x in range(101, 106)], [11, 21, 25, 30, 5]),
strand=["*", "-", "*", "+", "-"],
mcols=BiocFrame(
{
"score": range(0, 5),
"GC": [random() for _ in range(5)],
}
),
)
Set sequence information¶
The package also provides a SeqInfo
class to update or modify sequence information stored in the object. Read more about this class in GenomeInfoDb package.
from genomicranges import SeqInfo
seq_obj = {
"seqnames": ["chr1", "chr2", "chr3",],
"seqlengths": range(100, 103),
"is_circular": [random() < 0.5 for _ in range(3)],
"genome": "hg19",
}
seq = SeqInfo(seq_obj)
gr.seq_info = seq
Get and Set methods¶
Getters are available to access various properties.
# access sequence names
gr.seqnames
# access all start positions
gr.start
# access annotation information if available
gr.seq_info
# compute and return the widths of each region
gr.width
# access metadata columns, everything other than genomic locations
gr.mcols
Setters¶
All property based setters are in_place
operations. Methods are available to get and set properties on GenomicRanges.
gr.mcols = gr.mcols.set_column("score", range(1,6))
# or use an in-place operation
gr.mcols.set_column("score", range(1,6), in_place=True)
Access any column¶
Aside from the default getters, column
methods provides a way to quickly access any column in the object.
gr.mcols.column("score")
Access ranges¶
ranges()
is a generic method to access only the genomic locations as dictionary, pandas DataFrame
or something else. you can use any container representation based on a dictionary.
gr.ranges
Slice operations¶
You can slice a GenomicRange
object using the subset ([]
) operator. This operation accepts different slice input types, you can either specify a boolean vector, a `slice`` object, a list of indices, or row/column names to subset.
# slice the first 3 rows
gr[:3]
# slice 1, 3 and 2nd rows
gr[[1,3,2]]
Iterate over intervals¶
You can iterate over the intervals of a GenomicRanges
object. rowname
is None if the object does not have any row names.
for rowname, row in gr:
print(rowname, row)
Intra-range transformations¶
For detailed description of these methods, refer to Bioconductor’s GenomicRanges documentation
flank: flank the intervals based on start or end or both.
shift: shifts all the ranges specified by the shift argument.
resize: resizes the ranges to the specified width where either the start, end, or center is used as an anchor
narrow: narrows the ranges
promoters: promoters generates promoter ranges for each range relative to the TSS.The promoter range is expanded around the TSS according to the upstream and downstream parameters.
restrict: restricts the ranges to the interval(s) specified by the start and end arguments
trim: trims out-of-bound ranges located on non-circular sequences whose length is not NA.
# flank
flanked_gr = gr.flank(width=10, start=False, both=True)
# shift
shifted_gr = gr.shift(shift=10)
# resize
resized_gr = gr.resize(width=10, fix="end", ignore_strand=True)
# narrow
narrow_gr = gr.narrow(end=4, width=3)
# promoters
prom_gr = gr.promoters()
# restrict
restrict_gr = gr.restrict(start=114, end=140, keep_all_ranges=True)
# trim
trimmed_gr = gr.trim()
Inter-range methods¶
range: returns a new GenomicRanges object containing range bounds for each distinct (seqname, strand) pairing.
reduce: returns a new GenomicRanges object containing reduced bounds for each distinct (seqname, strand) pairing.
gaps: Finds gaps in the GenomicRanges object for each distinct (seqname, strand) pairing
disjoin: Finds disjoint intervals across all locations for each distinct (seqname, strand) pairing.
# range
range_gr = gr.range()
# reduce
reduced_gr = gr.reduce(min_gap_width=10, with_reverse_map=True)
# gaps
gapped_gr = gr.gaps(start=103) # OR
gapped_gr = gr.gaps(end={"chr1": 120, "chr2": 120, "chr3": 120})
# disjoin
disjoin_gr = gr.disjoin()
Set operations on genomic ranges¶
union: compute union of intervals across object
intersect: compute intersection or finds overlapping intervals
setdiff: compute set difference
g_src = GenomicRanges(
seqnames = ["chr1", "chr2", "chr1", "chr3", "chr2"],
ranges = IRanges(start =[101, 102, 103, 104, 109], width=[112, 103, 128, 134, 111]),
strand = ["*", "-", "*", "+", "-"]
)
g_tgt = GenomicRanges(
seqnames = ["chr1","chr2","chr2","chr2","chr1","chr1","chr3","chr3","chr3","chr3"],
ranges = IRanges(start =range(101, 111), width=range(121, 131)),
strand = ["*", "-", "-", "*", "*", "+", "+", "+", "-", "-"]
)
# intersection
int_gr = g_src.intersect(g_tgt)
# set diff
diff_gr = g_src.setdiff(g_tgt)
# union
union_gr = g_src.union(g_tgt)
Compute over bins¶
Summary stats for column¶
one can use Pandas for this
pd.Series(gr.mcols.get_column("score")).describe()
binned_average
¶
Compute binned average for different positions
bins = pd.DataFrame({"seqnames": ["chr1"], "starts": [101], "ends": [109],})
bins_gr = GenomicRanges.from_pandas(bins)
subject = GenomicRanges(
seqnames= ["chr1","chr2","chr2","chr2","chr1","chr1","chr3","chr3","chr3","chr3"],
ranges=IRanges(range(101, 111), range(121, 131)),
strand= ["*", "-", "-", "*", "*", "+", "+", "+", "-", "-"],
mcols=BiocFrame({
"score": range(0, 10),
})
)
# Compute binned average
binned_avg_gr = subject.binned_average(bins=bins_gr, scorename="score", outname="binned_score")
binned_avg_gr
now you might wonder how can I generate these bins?
Generate tiles or bins from GenomicRanges
¶
tile: Splits each genomic region by n (number of regions) or by width (maximum width of each tile)
sliding_windows: Generates sliding windows within each range, by width and step.
# tiles
tiles = gr.tile(n=2)
# slidingwindows
tiles = gr.sliding_windows(width=10)
Generate tiles from Genome¶
tile_genome
returns a set of genomic regions that form a partitioning of the specified genome.
seqlengths = {"chr1": 100, "chr2": 75, "chr3": 200}
tiles = GenomicRanges.tile_genome(seqlengths=seqlengths, n=10)
Coverage¶
Computes number of ranges that overlap for each position in the range.
res_vector = gr.coverage(shift=10, width=5)
Overlap based methods¶
find_overlaps: find overlaps between two
GenomicRanges
objectcount_overlaps: count overlaps between two
GenomicRanges
objectsubset_by_overlaps: subset a
GenomicRanges
object if it overlaps with the ranges in the query
subject = GenomicRanges(
seqnames= ["chr1","chr2","chr2","chr2","chr1","chr1","chr3","chr3","chr3","chr3"],
ranges=IRanges(range(101, 111), range(121, 131)),
strand= ["*", "-", "-", "*", "*", "+", "+", "+", "-", "-"],
mcols=BiocFrame({
"score": range(0, 10),
})
)
df_query = pd.DataFrame(
{"seqnames": ["chr2",], "starts": [4], "ends": [6], "strand": ["+"]}
)
query = GenomicRanges.from_pandas(df_query)
# findOverlaps
res = subject.find_overlaps(query, query_type="within")
# countOverlaps
res = subject.count_overlaps(query)
# subsetByOverlaps
res = subject.subset_by_overlaps(query)
Search operations¶
nearest: Performs nearest neighbor search along any direction (both upstream and downstream)
follow: Performs nearest neighbor search only along downstream
precede: Performs nearest neighbor search only along upstream
find_regions = GenomicRanges(
seqnames= ["chr1", "chr2", "chr3"],
ranges=IRanges([200, 105, 1190],[203, 106, 1200]),
)
query_hits = gr.nearest(find_regions)
query_hits = gr.precede(find_regions)
query_hits = gr.follow(find_regions)
Comparison, rank and order operations¶
match: Element wise comparison to find exact match intervals.
order: Get the order of indices for sorting.
sort: Sort the GenomicRanges object.
rank: for each interval identifies its position is a sorted order
# match
query_hits = gr.match(gr[2:5])
# order
order = gr.order()
# sort
sorted_gr = gr.sort()
# rank
rank = gr.rank()
Combine GenomicRanges
objects by rows¶
Use the combine
generic from biocutils to concatenate multiple GenomicRanges objects.
from biocutils.combine import combine
combined_gr = combine(gr, gr1, gr2, ...)
or use the `combine`` method,
combined_gr = gr.combine(gr1, gr2, gr3, ...)
Misc operations¶
invert_strand: flip the strand for each interval
sample: randomly choose k intervals
# invert strand
inv_gr = gr.invert_strand()
# sample
samp_gr = gr.sample(k=4)
Construct a GenomicRangesList
object.¶
Just as it sounds, a GenomicRangesList
is a named-list like object.
If you are wondering why you need this class, a GenomicRanges
object lets us specify multiple
genomic elements, usually where the genes start and end. Genes are themselves made of many sub
regions, e.g. exons. GenomicRangesList
allows us to represent this nested structure.
Currently, this class is limited in functionality, purely a read-only class with basic accessors.
Note: This is a work in progress and the functionality is limited.
from genomicranges import GenomicRangesList
a = GenomicRanges(
seqnames=["chr1", "chr2", "chr1", "chr3"],
ranges=IRanges([1, 3, 2, 4], [10, 30, 50, 60]),
strand=["-", "+", "*", "+"],
mcols=BiocFrame({"score": [1, 2, 3, 4]}),
)
b = GenomicRanges(
seqnames=["chr2", "chr4", "chr5"],
ranges=IRanges([3, 6, 4], [30, 50, 60]),
strand=["-", "+", "*"],
mcols=BiocFrame({"score": [2, 3, 4]}),
)
grl = GenomicRangesList(ranges=[a,b], names=["gene1", "gene2"])
grl
Properties¶
grl.start
grl.width
Combine GenomicRangeslist
object¶
Similar to the combine function from GenomicRanges,
grla = GenomicRangesList(ranges=[a], names=["a"])
grlb = GenomicRangesList(ranges=[b, a], names=["b", "c"])
# or use the combine generic
from biocutils.combine import combine
cgrl = combine(grla, grlb)
and that’s all for now! Check back later for more updates.