Tutorial 1: Perform range-based analyses using GenomicRanges
Genomic range operations are fundamental to many bioinformatics analyses. They allow us to work with intervals of genomic coordinates, which is crucial for understanding the relationships between different genomic features such as genes, regulatory elements, and experimental data like ChIP-seq peaks. In this tutorial, we’ll explore how to work with genomic interval data using BiocPy’s GenomicRanges package, which provides a Python implementation of the R/Bioconductor GenomicRanges package.
Outline
In this workshop, we’ll walk through several aspects of working with genomic ranges in Python:
Load genomic data: we’ll start by reading in genomic data from RDS files, including exon positions grouped by transcripts.
Basic genomic operations: we’ll cover fundamental operations like finding transcription start sites (TSS) and promoter regions.
Overlap analysis: we’ll learn how to find overlaps between different genomic features, a common task in many analyses.
Advanced operations: we’ll explore more complex operations like finding peaks within specific regions and resizing genomic intervals.
Prerequisites
Before we begin, please ensure that you have the following prerequisites installed:
Python 3.8 or later with dependencies listed here.
Insead of reinventing the wheel to access references in Python, we’ll use existing available Bioconductor resources that provide access to genome annotations. AnnotationHub is a great resource providing access to genomic reference annotations. It’s super convenient to search for a reference from AnnotationHub and download the genome of interest.
Let’s search the latest ensembl database for the human reference genome using Bioconductor’s AnnotationHub.
suppressMessages(library(AnnotationHub))ah <-AnnotationHub()ensdb <-query(ah, "Ensembl 112 EnsDb for Homo sapiens")[[1]]
We will then extract the exon positions and group them by transcript. Additionally, we provide a list of column names we would like to be available in mcols for our analysis. For the purpose of this tutorial, we’ll limit ourselves to the exons from chromosome 22.
from rds2py import read_rdshg38_robject = read_rds("./hg38_exons_by_tx.rds")# Only printing the keysprint("Keys of the object:", hg38_robject.keys())print("Class name of the object:", hg38_robject["class_name"], "from package:", hg38_robject["package_name"])
Keys of the object: dict_keys(['data', 'package_name', 'class_name', 'attributes'])
Class name of the object: CompressedGRangesList from package: GenomicRanges
This dictionary object (hg38_robject) contains 4 keys:
class_name: class name of the object
package_name: name of the package containing the class definition
data: contains the value if the object is a scalar
attributes: if the object is an S4 class, contains various attributes and their values
This dictionary can then be coerced into a Python GenomicRangesList class.
Currently this is a two step process, we are working on simplifying this to a single step for supported Bioconductor classes.
3. Define promoters and TSS
Now, let’s perform some basic operations like finding transcription start sites (TSS) and promoter regions. These operations help us identify key regulatory regions of the genome.
3.1 Find transcription start sites (TSS)
Transcription start sites (TSS) are the locations where transcription of a gene begins. Identifying TSS is crucial for understanding gene regulation, as many regulatory elements are located near the TSS.
First, we use the range() method to get the full extent of each transcript, i.e. from the start of the first exon to the end of the last exon. This should give us exactly one range per transcript.
Here, we’re defining promoters as the region 2000 base pairs upstream to 200 base pairs downstream of each TSS. This definition can vary depending on the specific analysis, but this range often captures important regulatory elements.
A common task in genomic analysis is finding overlaps between different genomic features. This helps us understand the relationships between various elements in the genome and can provide insights into gene regulation and function.
4.1 Load ChIP-seq peaks
ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) is a method used to identify binding sites of DNA-associated proteins. The peaks represent regions where a protein of interest is likely bound to the DNA.
For the purpose of this tutorial, let’s download a bed file containing peaks from a ChIP-seq experiment on human B cells to identify EZH2 binding sites (from ENCODE) and catalogued in bedbase.org.
from geniml.bbclient import BBClientbbclient = BBClient(cache_folder="cache", bedbase_api="https://api.bedbase.org")bedfile_id ="be4054acf6e3feeb4dc490e6430e358e"bedfile = bbclient.load_bed(bedfile_id)
Our friends at bedbase (Nathan Sheffield et al.) provide methods to easily coerce these objects to GenomicRanges. Again, we’re focusing on chromosome 22 for this example to keep the dataset manageable.
Here, we are identifying ChIP-seq peaks that overlap with TSS. This analysis can help us understand if the protein of interest tends to bind near the start of genes, which could suggest a role in transcription initiation.
overlaps = peaks_chr22.find_overlaps(tss)print("Peak indices that overlap with TSS between 30-40:")print(overlaps[30:40])
Peak indices that overlap with TSS between 30-40:
[[], [], [], [], [], [], [], [], [1157], []]
Note
find_overlaps returns a list with the same length as TSS, indicating which indices from peaks overlap with each of the TSS. Ideally, we would want to return a Hits object similar to the Bioconductor implementation.
TODO: Future plans to convert this into a Hits object.
Let’s identify the peaks that overlap with any TSS.
This operation finds ChIP-seq peaks that overlap with any of our defined promoter regions. If a significant number of peaks fall within promoters, it might suggest that the protein plays a role in gene regulation.
# Find peaks overlapping with any exonpeaks_by_exons = peaks_chr22.subset_by_overlaps(all_exons)print("Peaks overlapping with exons:")print(peaks_by_exons)# Calculate the percentage of peaks that overlap with exonspercent_overlapping = (len(peaks_by_exons) /len(peaks_chr22)) *100print(f"Percentage of peaks overlapping with exons: {percent_overlapping:.2f}%")
Let’s explore some more complex operations that are often used in genomic analyses.
5.1 Compare exonic vs. intronic binding
Let’s first identify intronic regions. There are two ways to find introns:
Find introns for each gene, i.e. regions within each gene’s transcript body that do not overlap any of that gene’s exons (using psetdiff in R/Bioconductor).
Find intronic regions globally, i.e. regions that do not overlap with any exon (using subtract) for any gene. To find these positions, we ignore strand information, because there could be genes that overlap on different strands.
We will find intronic regions globally (2) for our tutorial today.
Let’s first get all transcript ranges, following the steps in Section 3.1:
# Get the full extent of each transcripttx_ranges = by_tx.range().as_genomic_ranges()
We now subtract any exons that overlaps within each transcript by ignoring the strand. The result is a GenomicRangesList containing intronic regions for each transcript. We simplify this by coercing this into a GenomicRanges object.
# Find peaks overlapping with intronspeaks_by_introns = peaks_chr22.subset_by_overlaps(introns)print("Peaks overlapping with introns:")print(peaks_by_introns)# Calculate percentagespercent_exonic = (len(peaks_by_exons) /len(peaks_chr22)) *100percent_intronic = (len(peaks_by_introns) /len(peaks_chr22)) *100print(f"Percentage of peaks overlapping with exons: {percent_exonic:.2f}%")print(f"Percentage of peaks overlapping with introns: {percent_intronic:.2f}%")
Peaks overlapping with introns:
GenomicRanges with 1000 ranges and 0 metadata columns
seqnames ranges strand
<str> <IRanges> <ndarray[int8]>
[0] chr22 19766788 - 19767078 *
[1] chr22 17369888 - 17370178 *
[2] chr22 19756445 - 19756735 *
... ... ...
[997] chr22 36626768 - 36627058 *
[998] chr22 48833277 - 48833567 *
[999] chr22 49663362 - 49663652 *
------
seqinfo(46 sequences): chr1 chr10 chr11 ... chrUn_KI270750v1 chrUn_KI270752v1 chrX
Percentage of peaks overlapping with exons: 19.36%
Percentage of peaks overlapping with introns: 69.40%
Note
These percentages may or may not add up to 100%. Some peaks may overlap both introns and exons depending on how wide they are. In our case its because of ignoring strands and finding global intronic regions. Ideally, you may want to filter the peaks based on preference as you annotate them with TSS, promoters, etc.
This comparison can help determine if the protein of interest shows a preference for binding in exonic or intronic regions, which could suggest different functional roles (e.g., splicing regulation for exonic binding vs. potential enhancer activity for intronic binding).
5.2 Find overlaps with the first exon
Note
The rationale for this analysis may vary, but we are mostly showcasing complex genomic operations that can be performed.
Let’s first put together a GenomicRanges object containing the first exon for each transcript.
Then we combine all the individual genomic elements. The biocutils package provides utilities for convenient aspects of R that aren’t provided by base Python and generics. One of these generics is the combine_sequences operation that merges or concatenates 1-dimensional Bioconductor classes.
from biocutils import combine_sequencesfirst_exons = combine_sequences(*all_first)print(first_exons)
Resizing and shifting genomic ranges can be useful in various contexts. For example:
Narrowing peaks might help focus on the center of ChIP-seq binding sites.
Shifting ranges can be used to look at regions adjacent to your features of interest. e.g., defining the predicted CRISPR cleavage site based on the position of the CRISPR gRNA sequence.
narrow_peaks = peaks_chr22.narrow(start=10, width=100)shifted_peaks = narrow_peaks.shift(10)print("Narrowed and Shifted Peaks:")print(shifted_peaks)
These operations demonstrate the flexibility of genomic range manipulations, which can be useful for fine-tuning analyses or testing hypotheses about the spatial relationships between genomic features.
6. Exercises
Split the input genome reference by gene_name, e.g. a field that contains gene symbols.
Calculate the average width of the ChIP-seq peaks on chromosome 22.
Compute the percentage of promoter regions that have at least one overlapping ChIP-seq peak.
Conclusion
In this tutorial, we’ve explored how to use BiocPy’s genomic ranges functionality to perform various genomic analyses. These tools and techniques provide a powerful way to work with genomic interval data in Python, mirroring the capabilities from Bioconductor. They form the foundation for many more complex genomic analyses and can be applied to a wide range of biological questions.
Note
Refer to the BiocPy documentation for more detailed information on these packages and their functionalities.