Tutorial 1: Perform range-based analyses using GenomicRanges

Genomic range operations are fundamental to many bioinformatics analyses. They allow us to work with intervals of genomic coordinates, which is crucial for understanding the relationships between different genomic features such as genes, regulatory elements, and experimental data like ChIP-seq peaks. In this tutorial, we’ll explore how to work with genomic interval data using BiocPy’s GenomicRanges package, which provides a Python implementation of the R/Bioconductor GenomicRanges package.

Outline

In this workshop, we’ll walk through several aspects of working with genomic ranges in Python:

  1. Load genomic data: we’ll start by reading in genomic data from RDS files, including exon positions grouped by transcripts.
  2. Basic genomic operations: we’ll cover fundamental operations like finding transcription start sites (TSS) and promoter regions.
  3. Overlap analysis: we’ll learn how to find overlaps between different genomic features, a common task in many analyses.
  4. Advanced operations: we’ll explore more complex operations like finding peaks within specific regions and resizing genomic intervals.

Prerequisites

Before we begin, please ensure that you have the following prerequisites installed:

  • Python 3.8 or later with dependencies listed here.
  • R 4.4.0 and Bioconductor packages listed here.

Install the Python packages from PyPI:

pip install -U biocutils genomicranges rds2py numpy pandas geniml

Install the R packages using BiocManager:

BiocManager::install(c("AnnotationHub"))

1. Download reference annotation

Insead of reinventing the wheel to access references in Python, we’ll use existing available Bioconductor resources that provide access to genome annotations. AnnotationHub is a great resource providing access to genomic reference annotations. It’s super convenient to search for a reference from AnnotationHub and download the genome of interest.

Let’s search the latest ensembl database for the human reference genome using Bioconductor’s AnnotationHub.

suppressMessages(library(AnnotationHub))

ah <- AnnotationHub()
ensdb <- query(ah, "Ensembl 112 EnsDb for Homo sapiens")[[1]]

We will then extract the exon positions and group them by transcript. Additionally, we provide a list of column names we would like to be available in mcols for our analysis. For the purpose of this tutorial, we’ll limit ourselves to the exons from chromosome 22.

exons_by_tx <- exonsBy(ensdb, 
    by = "tx", filter = SeqNameFilter(c("22")), 
    columns= c("exon_id", "tx_name", "tx_id", "gene_name", "gene_id"))

Finally, save the object as an RDS file.

saveRDS(exons_by_tx, "hg38_exons_by_tx.rds")

2. Load genome annotation in Python

We now read the above RDS annotation object into our Python session using the rds2py Python package. This is a two step process.

The first step represents the data stored in the RDS file as a python dictionary

from rds2py import read_rds
hg38_robject = read_rds("./hg38_exons_by_tx.rds")

# Only printing the keys
print("Keys of the object:", hg38_robject.keys())
print("Class name of the object:", hg38_robject["class_name"], "from package:", hg38_robject["package_name"])
Keys of the object: dict_keys(['data', 'package_name', 'class_name', 'attributes'])
Class name of the object: CompressedGRangesList from package: GenomicRanges

This dictionary object (hg38_robject) contains 4 keys:

  1. class_name: class name of the object
  2. package_name: name of the package containing the class definition
  3. data: contains the value if the object is a scalar
  4. attributes: if the object is an S4 class, contains various attributes and their values

This dictionary can then be coerced into a Python GenomicRangesList class.

from rds2py.granges import as_granges_list
by_tx = as_granges_list(hg38_robject)

print("Exons by transcript:")
print(by_tx)
Exons by transcript:
GenomicRangesList with 5387 ranges and 0 metadata columns
 
Name: ENST00000006251 
GenomicRanges with 9 ranges and 6 metadata columns
    seqnames              ranges          strand           exon_id         tx_name           tx_id gene_name         gene_id exon_rank
       <str>           <IRanges> <ndarray[int8]>            <list>          <list>          <list>    <list>          <list>    <list>
[0]    chr22 44677057 - 44677241               + | ENSE00001838743 ENST00000006251 ENST00000006251      PRR5 ENSG00000186654         1
[1]    chr22 44702492 - 44702609               + | ENSE00003647870 ENST00000006251 ENST00000006251      PRR5 ENSG00000186654         2
[2]    chr22 44714591 - 44714672               + | ENSE00003614159 ENST00000006251 ENST00000006251      PRR5 ENSG00000186654         3
[3]    chr22 44725244 - 44725293               + | ENSE00003568825 ENST00000006251 ENST00000006251      PRR5 ENSG00000186654         4
[4]    chr22 44726577 - 44726635               + | ENSE00003465556 ENST00000006251 ENST00000006251      PRR5 ENSG00000186654         5
[5]    chr22 44731730 - 44731822               + | ENSE00003642381 ENST00000006251 ENST00000006251      PRR5 ENSG00000186654         6
[6]    chr22 44732251 - 44732392               + | ENSE00003658491 ENST00000006251 ENST00000006251      PRR5 ENSG00000186654         7
[7]    chr22 44735027 - 44735163               + | ENSE00003692865 ENST00000006251 ENST00000006251      PRR5 ENSG00000186654         8
[8]    chr22 44736772 - 44737681               + | ENSE00001846334 ENST00000006251 ENST00000006251      PRR5 ENSG00000186654         9
------
seqinfo(1 sequences): chr22
 
Name: ENST00000008876 
GenomicRanges with 10 ranges and 6 metadata columns
    seqnames              ranges          strand           exon_id         tx_name           tx_id gene_name         gene_id exon_rank
       <str>           <IRanges> <ndarray[int8]>            <list>          <list>          <list>    <list>          <list>    <list>
[0]    chr22 50603133 - 50603499               + | ENSE00003608148 ENST00000008876 ENST00000008876  MAPK8IP2 ENSG00000008735         1
[1]    chr22 50603626 - 50603720               + | ENSE00003768317 ENST00000008876 ENST00000008876  MAPK8IP2 ENSG00000008735         2
[2]    chr22 50603841 - 50605065               + | ENSE00003772801 ENST00000008876 ENST00000008876  MAPK8IP2 ENSG00000008735         3
[3]    chr22 50605368 - 50605444               + | ENSE00003773674 ENST00000008876 ENST00000008876  MAPK8IP2 ENSG00000008735         4
[4]    chr22 50605562 - 50605735               + | ENSE00003765641 ENST00000008876 ENST00000008876  MAPK8IP2 ENSG00000008735         5
[5]    chr22 50605825 - 50605935               + | ENSE00003763622 ENST00000008876 ENST00000008876  MAPK8IP2 ENSG00000008735         6
[6]    chr22 50606658 - 50606766               + | ENSE00003773228 ENST00000008876 ENST00000008876  MAPK8IP2 ENSG00000008735         7
[7]    chr22 50606921 - 50606992               + | ENSE00003769486 ENST00000008876 ENST00000008876  MAPK8IP2 ENSG00000008735         8
[8]    chr22 50610212 - 50610311               + | ENSE00003772161 ENST00000008876 ENST00000008876  MAPK8IP2 ENSG00000008735         9
[9]    chr22 50610707 - 50613982               + | ENSE00003731955 ENST00000008876 ENST00000008876  MAPK8IP2 ENSG00000008735        10
------
seqinfo(1 sequences): chr22
 
Name: ENST00000043402 
GenomicRanges with 2 ranges and 6 metadata columns
    seqnames              ranges          strand           exon_id         tx_name           tx_id gene_name         gene_id exon_rank
       <str>           <IRanges> <ndarray[int8]>            <list>          <list>          <list>    <list>          <list>    <list>
[0]    chr22 20268071 - 20268319               - | ENSE00001358408 ENST00000043402 ENST00000043402     RTN4R ENSG00000040608         1
[1]    chr22 20241415 - 20243111               - | ENSE00001557601 ENST00000043402 ENST00000043402     RTN4R ENSG00000040608         2
------
seqinfo(1 sequences): chr22
 
Name:  
GenomicRanges with 15 ranges and 6 metadata columns
     seqnames              ranges          strand        exon_id   tx_name     tx_id gene_name gene_id exon_rank
        <str>           <IRanges> <ndarray[int8]>         <list>    <list>    <list>    <list>  <list>    <list>
 [0]    chr22 33919995 - 33920477               - |  LRG_856t1e1 LRG_856t2 LRG_856t2    LARGE1 LRG_856         1
 [1]    chr22 33761371 - 33761559               - |  LRG_856t1e3 LRG_856t2 LRG_856t2    LARGE1 LRG_856         2
 [2]    chr22 33650367 - 33650669               - |  LRG_856t1e4 LRG_856t2 LRG_856t2    LARGE1 LRG_856         3
          ...                 ...             ... |          ...       ...       ...       ...     ...       ...
[12]    chr22 33283202 - 33283349               - | LRG_856t1e14 LRG_856t2 LRG_856t2    LARGE1 LRG_856        13
[13]    chr22 33277060 - 33277256               - | LRG_856t1e15 LRG_856t2 LRG_856t2    LARGE1 LRG_856        14
[14]    chr22 33272509 - 33274625               - | LRG_856t1e16 LRG_856t2 LRG_856t2    LARGE1 LRG_856        15
------
seqinfo(1 sequences): chr22
 
Name: LRG_856t2 
GenomicRanges with 7 ranges and 6 metadata columns
    seqnames              ranges          strand      exon_id  tx_name    tx_id gene_name gene_id exon_rank
       <str>           <IRanges> <ndarray[int8]>       <list>   <list>   <list>    <list>  <list>    <list>
[0]    chr22 37244114 - 37244266               - | LRG_97t1e1 LRG_97t1 LRG_97t1      RAC2  LRG_97         1
[1]    chr22 37241587 - 37241659               - | LRG_97t1e2 LRG_97t1 LRG_97t1      RAC2  LRG_97         2
[2]    chr22 37232801 - 37232919               - | LRG_97t1e3 LRG_97t1 LRG_97t1      RAC2  LRG_97         3
[3]    chr22 37231932 - 37231995               - | LRG_97t1e4 LRG_97t1 LRG_97t1      RAC2  LRG_97         4
[4]    chr22 37231231 - 37231391               - | LRG_97t1e5 LRG_97t1 LRG_97t1      RAC2  LRG_97         5
[5]    chr22 37226671 - 37226804               - | LRG_97t1e6 LRG_97t1 LRG_97t1      RAC2  LRG_97         6
[6]    chr22 37225270 - 37226040               - | LRG_97t1e7 LRG_97t1 LRG_97t1      RAC2  LRG_97         7
------
seqinfo(1 sequences): chr22
 
Name: LRG_97t1 
GenomicRanges with 21 ranges and 6 metadata columns
     seqnames              ranges          strand        exon_id   tx_name     tx_id gene_name gene_id exon_rank
        <str>           <IRanges> <ndarray[int8]>         <list>    <list>    <list>    <list>  <list>    <list>
 [0]    chr22 20982297 - 20982572               + |  LRG_989t1e1 LRG_989t1 LRG_989t1     LZTR1 LRG_989         1
 [1]    chr22 20983027 - 20983090               + |  LRG_989t1e2 LRG_989t1 LRG_989t1     LZTR1 LRG_989         2
 [2]    chr22 20985841 - 20985898               + |  LRG_989t1e3 LRG_989t1 LRG_989t1     LZTR1 LRG_989         3
          ...                 ...             ... |          ...       ...       ...       ...     ...       ...
[18]    chr22 20996696 - 20996802               + | LRG_989t1e19 LRG_989t1 LRG_989t1     LZTR1 LRG_989        19
[19]    chr22 20996886 - 20996967               + | LRG_989t1e20 LRG_989t1 LRG_989t1     LZTR1 LRG_989        20
[20]    chr22 20997232 - 20999033               + | LRG_989t1e21 LRG_989t1 LRG_989t1     LZTR1 LRG_989        21
------
seqinfo(1 sequences): chr22
 
Note

Currently this is a two step process, we are working on simplifying this to a single step for supported Bioconductor classes.

3. Define promoters and TSS

Now, let’s perform some basic operations like finding transcription start sites (TSS) and promoter regions. These operations help us identify key regulatory regions of the genome.

3.1 Find transcription start sites (TSS)

Transcription start sites (TSS) are the locations where transcription of a gene begins. Identifying TSS is crucial for understanding gene regulation, as many regulatory elements are located near the TSS.

First, we use the range() method to get the full extent of each transcript, i.e. from the start of the first exon to the end of the last exon. This should give us exactly one range per transcript.

ranges_by_tx = by_tx.range()

print("Transcript ranges:")
print(ranges_by_tx)
Transcript ranges:
GenomicRangesList with 5387 ranges and 0 metadata columns
 
Name: ENST00000006251 
GenomicRanges with 1 range and 0 metadata columns
    seqnames              ranges          strand
       <str>           <IRanges> <ndarray[int8]>
[0]    chr22 44677057 - 44737681               +
------
seqinfo(1 sequences): chr22
 
Name: ENST00000008876 
GenomicRanges with 1 range and 0 metadata columns
    seqnames              ranges          strand
       <str>           <IRanges> <ndarray[int8]>
[0]    chr22 50603133 - 50613982               +
------
seqinfo(1 sequences): chr22
 
Name: ENST00000043402 
GenomicRanges with 1 range and 0 metadata columns
    seqnames              ranges          strand
       <str>           <IRanges> <ndarray[int8]>
[0]    chr22 20241415 - 20268319               -
------
seqinfo(1 sequences): chr22
 
Name:  
GenomicRanges with 1 range and 0 metadata columns
    seqnames              ranges          strand
       <str>           <IRanges> <ndarray[int8]>
[0]    chr22 33272509 - 33920477               -
------
seqinfo(1 sequences): chr22
 
Name: LRG_856t2 
GenomicRanges with 1 range and 0 metadata columns
    seqnames              ranges          strand
       <str>           <IRanges> <ndarray[int8]>
[0]    chr22 37225270 - 37244266               -
------
seqinfo(1 sequences): chr22
 
Name: LRG_97t1 
GenomicRanges with 1 range and 0 metadata columns
    seqnames              ranges          strand
       <str>           <IRanges> <ndarray[int8]>
[0]    chr22 20982297 - 20999033               +
------
seqinfo(1 sequences): chr22
 

Since the range() gives us exactly one range per transcript, so we can simplify our list to a GenomicRanges object. This is similar to unlist in R.

gr_by_tx = ranges_by_tx.as_genomic_ranges()

print("as GenomicRanges:")
print(gr_by_tx)
as GenomicRanges:
GenomicRanges with 5387 ranges and 0 metadata columns
                seqnames              ranges          strand
                   <str>           <IRanges> <ndarray[int8]>
ENST00000006251    chr22 44677057 - 44737681               +
ENST00000008876    chr22 50603133 - 50613982               +
ENST00000043402    chr22 20241415 - 20268319               -
                     ...                 ...             ...
      LRG_856t2    chr22 33272509 - 33920477               -
       LRG_97t1    chr22 37225270 - 37244266               -
      LRG_989t1    chr22 20982297 - 20999033               +
------
seqinfo(1 sequences): chr22

Then we resize to a width of 1 base pair at the start of each range to pinpoint the TSS.

tss = gr_by_tx.resize(width=1, fix="start")

print("Transcript Start Sites:")
print(tss)
Transcript Start Sites:
GenomicRanges with 5387 ranges and 0 metadata columns
                seqnames              ranges          strand
                   <str>           <IRanges> <ndarray[int8]>
ENST00000006251    chr22 44677057 - 44677058               +
ENST00000008876    chr22 50603133 - 50603134               +
ENST00000043402    chr22 20268318 - 20268319               -
                     ...                 ...             ...
      LRG_856t2    chr22 33920476 - 33920477               -
       LRG_97t1    chr22 37244265 - 37244266               -
      LRG_989t1    chr22 20982297 - 20982298               +
------
seqinfo(1 sequences): chr22

3.2 Define promoter regions

Here, we’re defining promoters as the region 2000 base pairs upstream to 200 base pairs downstream of each TSS. This definition can vary depending on the specific analysis, but this range often captures important regulatory elements.

promoters = tss.promoters(upstream=2000, downstream=200)

print("Promoter Regions:")
print(promoters)
Promoter Regions:
GenomicRanges with 5387 ranges and 0 metadata columns
                seqnames              ranges          strand
                   <str>           <IRanges> <ndarray[int8]>
ENST00000006251    chr22 44675057 - 44677257               +
ENST00000008876    chr22 50601133 - 50603333               +
ENST00000043402    chr22 20268119 - 20270319               -
                     ...                 ...             ...
      LRG_856t2    chr22 33920277 - 33922477               -
       LRG_97t1    chr22 37244066 - 37246266               -
      LRG_989t1    chr22 20980297 - 20982497               +
------
seqinfo(1 sequences): chr22

4. Overlap with ChIP-seq peaks

A common task in genomic analysis is finding overlaps between different genomic features. This helps us understand the relationships between various elements in the genome and can provide insights into gene regulation and function.

4.1 Load ChIP-seq peaks

ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) is a method used to identify binding sites of DNA-associated proteins. The peaks represent regions where a protein of interest is likely bound to the DNA.

For the purpose of this tutorial, let’s download a bed file containing peaks from a ChIP-seq experiment on human B cells to identify EZH2 binding sites (from ENCODE) and catalogued in bedbase.org.

from geniml.bbclient import BBClient

bbclient = BBClient(cache_folder="cache", bedbase_api="https://api.bedbase.org")
bedfile_id = "be4054acf6e3feeb4dc490e6430e358e" 
bedfile = bbclient.load_bed(bedfile_id)

Our friends at bedbase (Nathan Sheffield et al.) provide methods to easily coerce these objects to GenomicRanges. Again, we’re focusing on chromosome 22 for this example to keep the dataset manageable.

peaks = bedfile.to_granges()

filter_chr22 = [x == "chr22" for x in peaks.get_seqnames()]
peaks_chr22 = peaks[filter_chr22]

print(peaks_chr22)
GenomicRanges with 1441 ranges and 0 metadata columns
       seqnames              ranges          strand
          <str>           <IRanges> <ndarray[int8]>
   [0]    chr22 19766788 - 19767078               *
   [1]    chr22 17369888 - 17370178               *
   [2]    chr22 19756445 - 19756735               *
            ...                 ...             ...
[1438]    chr22 27212058 - 27212348               *
[1439]    chr22 49201359 - 49201649               *
[1440]    chr22 49663362 - 49663652               *
------
seqinfo(46 sequences): chr1 chr10 chr11 ... chrUn_KI270750v1 chrUn_KI270752v1 chrX

4.2 Find overlaps with TSS

Here, we are identifying ChIP-seq peaks that overlap with TSS. This analysis can help us understand if the protein of interest tends to bind near the start of genes, which could suggest a role in transcription initiation.

overlaps = peaks_chr22.find_overlaps(tss)

print("Peak indices that overlap with TSS between 30-40:")
print(overlaps[30:40])
Peak indices that overlap with TSS between 30-40:
[[], [], [], [], [], [], [], [], [1157], []]
Note

find_overlaps returns a list with the same length as TSS, indicating which indices from peaks overlap with each of the TSS. Ideally, we would want to return a Hits object similar to the Bioconductor implementation.

TODO: Future plans to convert this into a Hits object.

Let’s identify the peaks that overlap with any TSS.

import itertools

all_indices = list(set(itertools.chain.from_iterable(overlaps)))
peaks_by_tss = peaks_chr22[all_indices]
print(peaks_by_tss)
GenomicRanges with 75 ranges and 0 metadata columns
     seqnames              ranges          strand
        <str>           <IRanges> <ndarray[int8]>
 [0]    chr22 19756445 - 19756735               *
 [1]    chr22 36816145 - 36816435               *
 [2]    chr22 38467935 - 38468225               *
          ...                 ...             ...
[72]    chr22 50270553 - 50270843               *
[73]    chr22 19131257 - 19131547               *
[74]    chr22 19014170 - 19014460               *
------
seqinfo(46 sequences): chr1 chr10 chr11 ... chrUn_KI270750v1 chrUn_KI270752v1 chrX

Alternatively, we can use subset_by_overlaps method to more conveniently subset the peaks that overlap with any TSS:

peaks_by_tss2 = peaks_chr22.subset_by_overlaps(tss)
print(peaks_by_tss2)
GenomicRanges with 75 ranges and 0 metadata columns
     seqnames              ranges          strand
        <str>           <IRanges> <ndarray[int8]>
 [0]    chr22 19756445 - 19756735               *
 [1]    chr22 36816145 - 36816435               *
 [2]    chr22 38467935 - 38468225               *
          ...                 ...             ...
[72]    chr22 50270553 - 50270843               *
[73]    chr22 19131257 - 19131547               *
[74]    chr22 19014170 - 19014460               *
------
seqinfo(46 sequences): chr1 chr10 chr11 ... chrUn_KI270750v1 chrUn_KI270752v1 chrX

4.3 Find overlaps with promoters

This operation finds ChIP-seq peaks that overlap with any of our defined promoter regions. If a significant number of peaks fall within promoters, it might suggest that the protein plays a role in gene regulation.

peaks_by_promoters = peaks_chr22.subset_by_overlaps(promoters)

print("Peaks Overlapping with Promoters:")
print(peaks_by_promoters)
Peaks Overlapping with Promoters:
GenomicRanges with 344 ranges and 0 metadata columns
      seqnames              ranges          strand
         <str>           <IRanges> <ndarray[int8]>
  [0]    chr22 19756445 - 19756735               *
  [1]    chr22 37427967 - 37428257               *
  [2]    chr22 19169462 - 19169752               *
           ...                 ...             ...
[341]    chr22 42368884 - 42369174               *
[342]    chr22 21630789 - 21631079               *
[343]    chr22 17368148 - 17368438               *
------
seqinfo(46 sequences): chr1 chr10 chr11 ... chrUn_KI270750v1 chrUn_KI270752v1 chrX

4.4 Find overlaps with exons

Let’s find overlaps with any exon. We unlist our GenomicRangesList object to get all exon positions.

# Combine all exons into a single GenomicRanges object
all_exons = by_tx.as_granges()

print("All exons:")
print(all_exons)
All exons:
GenomicRanges with 34967 ranges and 6 metadata columns
                seqnames              ranges          strand           exon_id         tx_name           tx_id gene_name         gene_id exon_rank
                   <str>           <IRanges> <ndarray[int8]>            <list>          <list>          <list>    <list>          <list>    <list>
ENST00000006251    chr22 44677057 - 44677241               + | ENSE00001838743 ENST00000006251 ENST00000006251      PRR5 ENSG00000186654         1
ENST00000006251    chr22 44702492 - 44702609               + | ENSE00003647870 ENST00000006251 ENST00000006251      PRR5 ENSG00000186654         2
ENST00000006251    chr22 44714591 - 44714672               + | ENSE00003614159 ENST00000006251 ENST00000006251      PRR5 ENSG00000186654         3
                     ...                 ...             ... |             ...             ...             ...       ...             ...       ...
      LRG_989t1    chr22 20996696 - 20996802               + |    LRG_989t1e19       LRG_989t1       LRG_989t1     LZTR1         LRG_989        19
      LRG_989t1    chr22 20996886 - 20996967               + |    LRG_989t1e20       LRG_989t1       LRG_989t1     LZTR1         LRG_989        20
      LRG_989t1    chr22 20997232 - 20999033               + |    LRG_989t1e21       LRG_989t1       LRG_989t1     LZTR1         LRG_989        21
------
seqinfo(1 sequences): chr22

We can then find peaks that overlap with any of these regions:

# Find peaks overlapping with any exon
peaks_by_exons = peaks_chr22.subset_by_overlaps(all_exons)

print("Peaks overlapping with exons:")
print(peaks_by_exons)

# Calculate the percentage of peaks that overlap with exons
percent_overlapping = (len(peaks_by_exons) / len(peaks_chr22)) * 100

print(f"Percentage of peaks overlapping with exons: {percent_overlapping:.2f}%")
Peaks overlapping with exons:
GenomicRanges with 279 ranges and 0 metadata columns
      seqnames              ranges          strand
         <str>           <IRanges> <ndarray[int8]>
  [0]    chr22 19766788 - 19767078               *
  [1]    chr22 17369888 - 17370178               *
  [2]    chr22 19756445 - 19756735               *
           ...                 ...             ...
[276]    chr22 29307104 - 29307394               *
[277]    chr22 35552420 - 35552710               *
[278]    chr22 37931897 - 37932187               *
------
seqinfo(46 sequences): chr1 chr10 chr11 ... chrUn_KI270750v1 chrUn_KI270752v1 chrX
Percentage of peaks overlapping with exons: 19.36%

5. Advanced Operations

Let’s explore some more complex operations that are often used in genomic analyses.

5.1 Compare exonic vs. intronic binding

Let’s first identify intronic regions. There are two ways to find introns:

  1. Find introns for each gene, i.e. regions within each gene’s transcript body that do not overlap any of that gene’s exons (using psetdiff in R/Bioconductor).
  2. Find intronic regions globally, i.e. regions that do not overlap with any exon (using subtract) for any gene. To find these positions, we ignore strand information, because there could be genes that overlap on different strands.

We will find intronic regions globally (2) for our tutorial today.

Let’s first get all transcript ranges, following the steps in Section 3.1:

# Get the full extent of each transcript
tx_ranges = by_tx.range().as_genomic_ranges()

We now subtract any exons that overlaps within each transcript by ignoring the strand. The result is a GenomicRangesList containing intronic regions for each transcript. We simplify this by coercing this into a GenomicRanges object.

introns = tx_ranges.subtract(all_exons, ignore_strand=True).as_granges()

print("Intron regions:")
print(introns)
Intron regions:
GenomicRanges with 5403 ranges and 0 metadata columns
                seqnames              ranges          strand
                   <str>           <IRanges> <ndarray[int8]>
ENST00000006251    chr22 44677057 - 44737681               +
ENST00000008876    chr22 50603133 - 50613982               +
ENST00000043402    chr22 20241415 - 20268319               -
                     ...                 ...             ...
      LRG_856t2    chr22 33272509 - 33920477               -
       LRG_97t1    chr22 37225270 - 37244266               -
      LRG_989t1    chr22 20982297 - 20999033               +
------
seqinfo(1 sequences): chr22

We can compare the proportion of peaks overlapping with exons to those overlapping with introns:

# Find peaks overlapping with introns
peaks_by_introns = peaks_chr22.subset_by_overlaps(introns)

print("Peaks overlapping with introns:")
print(peaks_by_introns)

# Calculate percentages
percent_exonic = (len(peaks_by_exons) / len(peaks_chr22)) * 100
percent_intronic = (len(peaks_by_introns) / len(peaks_chr22)) * 100

print(f"Percentage of peaks overlapping with exons: {percent_exonic:.2f}%")
print(f"Percentage of peaks overlapping with introns: {percent_intronic:.2f}%")
Peaks overlapping with introns:
GenomicRanges with 1000 ranges and 0 metadata columns
      seqnames              ranges          strand
         <str>           <IRanges> <ndarray[int8]>
  [0]    chr22 19766788 - 19767078               *
  [1]    chr22 17369888 - 17370178               *
  [2]    chr22 19756445 - 19756735               *
           ...                 ...             ...
[997]    chr22 36626768 - 36627058               *
[998]    chr22 48833277 - 48833567               *
[999]    chr22 49663362 - 49663652               *
------
seqinfo(46 sequences): chr1 chr10 chr11 ... chrUn_KI270750v1 chrUn_KI270752v1 chrX
Percentage of peaks overlapping with exons: 19.36%
Percentage of peaks overlapping with introns: 69.40%
Note

These percentages may or may not add up to 100%. Some peaks may overlap both introns and exons depending on how wide they are. In our case its because of ignoring strands and finding global intronic regions. Ideally, you may want to filter the peaks based on preference as you annotate them with TSS, promoters, etc.

This comparison can help determine if the protein of interest shows a preference for binding in exonic or intronic regions, which could suggest different functional roles (e.g., splicing regulation for exonic binding vs. potential enhancer activity for intronic binding).

5.2 Find overlaps with the first exon

Note

The rationale for this analysis may vary, but we are mostly showcasing complex genomic operations that can be performed.

Let’s first put together a GenomicRanges object containing the first exon for each transcript.

all_first = []
for txid, grl in by_tx:
    strand = grl.get_strand(as_type = "list")[0]
    if strand == "-":
        all_first.append(grl.sort()[-1])
    else:
        all_first.append(grl.sort()[0])

print(all_first[:3])
[GenomicRanges(number_of_ranges=1, seqnames=[0], ranges=IRanges(start=array([44677057], dtype=int32), width=array([184], dtype=int32)), strand=[1], mcols=BiocFrame(data={'exon_id': ['ENSE00001838743'], 'tx_name': ['ENST00000006251'], 'tx_id': ['ENST00000006251'], 'gene_name': ['PRR5'], 'gene_id': ['ENSG00000186654'], 'exon_rank': [1]}, number_of_rows=1, row_names=['0'], column_names=['exon_id', 'tx_name', 'tx_id', 'gene_name', 'gene_id', 'exon_rank']), seqinfoSeqInfo(number_of_seqnames=1, seqnames=['chr22'], seqlengths=[50818468], is_circular=[False], genome=['GRCh38'])), GenomicRanges(number_of_ranges=1, seqnames=[0], ranges=IRanges(start=array([50603133], dtype=int32), width=array([366], dtype=int32)), strand=[1], mcols=BiocFrame(data={'exon_id': ['ENSE00003608148'], 'tx_name': ['ENST00000008876'], 'tx_id': ['ENST00000008876'], 'gene_name': ['MAPK8IP2'], 'gene_id': ['ENSG00000008735'], 'exon_rank': [1]}, number_of_rows=1, row_names=['9'], column_names=['exon_id', 'tx_name', 'tx_id', 'gene_name', 'gene_id', 'exon_rank']), seqinfoSeqInfo(number_of_seqnames=1, seqnames=['chr22'], seqlengths=[50818468], is_circular=[False], genome=['GRCh38'])), GenomicRanges(number_of_ranges=1, seqnames=[0], ranges=IRanges(start=array([20268071], dtype=int32), width=array([248], dtype=int32)), strand=[-1], mcols=BiocFrame(data={'exon_id': ['ENSE00001358408'], 'tx_name': ['ENST00000043402'], 'tx_id': ['ENST00000043402'], 'gene_name': ['RTN4R'], 'gene_id': ['ENSG00000040608'], 'exon_rank': [1]}, number_of_rows=1, row_names=['19'], column_names=['exon_id', 'tx_name', 'tx_id', 'gene_name', 'gene_id', 'exon_rank']), seqinfoSeqInfo(number_of_seqnames=1, seqnames=['chr22'], seqlengths=[50818468], is_circular=[False], genome=['GRCh38']))]

Then we combine all the individual genomic elements. The biocutils package provides utilities for convenient aspects of R that aren’t provided by base Python and generics. One of these generics is the combine_sequences operation that merges or concatenates 1-dimensional Bioconductor classes.

from biocutils import combine_sequences
first_exons = combine_sequences(*all_first)

print(first_exons)
GenomicRanges with 5387 ranges and 6 metadata columns
       seqnames              ranges          strand           exon_id         tx_name           tx_id gene_name         gene_id exon_rank
          <str>           <IRanges> <ndarray[int8]>            <list>          <list>          <list>    <list>          <list>    <list>
   [0]    chr22 44677057 - 44677241               + | ENSE00001838743 ENST00000006251 ENST00000006251      PRR5 ENSG00000186654         1
   [1]    chr22 50603133 - 50603499               + | ENSE00003608148 ENST00000008876 ENST00000008876  MAPK8IP2 ENSG00000008735         1
   [2]    chr22 20268071 - 20268319               - | ENSE00001358408 ENST00000043402 ENST00000043402     RTN4R ENSG00000040608         1
            ...                 ...             ... |             ...             ...             ...       ...             ...       ...
[5384]    chr22 33919995 - 33920477               - |     LRG_856t1e1       LRG_856t2       LRG_856t2    LARGE1         LRG_856         1
[5385]    chr22 37244114 - 37244266               - |      LRG_97t1e1        LRG_97t1        LRG_97t1      RAC2          LRG_97         1
[5386]    chr22 20982297 - 20982572               + |     LRG_989t1e1       LRG_989t1       LRG_989t1     LZTR1         LRG_989         1
------
seqinfo(1 sequences): chr22

We can now subset peaks that overlap with the first exon:

peaks_with_first_exons = peaks_chr22.subset_by_overlaps(first_exons)
print(peaks_with_first_exons)
GenomicRanges with 153 ranges and 0 metadata columns
      seqnames              ranges          strand
         <str>           <IRanges> <ndarray[int8]>
  [0]    chr22 17369888 - 17370178               *
  [1]    chr22 19756445 - 19756735               *
  [2]    chr22 45975507 - 45975797               *
           ...                 ...             ...
[150]    chr22 49500975 - 49501265               *
[151]    chr22 19131257 - 19131547               *
[152]    chr22 29307104 - 29307394               *
------
seqinfo(46 sequences): chr1 chr10 chr11 ... chrUn_KI270750v1 chrUn_KI270752v1 chrX

5.3 Resize and shift peaks

Resizing and shifting genomic ranges can be useful in various contexts. For example:

  • Narrowing peaks might help focus on the center of ChIP-seq binding sites.
  • Shifting ranges can be used to look at regions adjacent to your features of interest. e.g., defining the predicted CRISPR cleavage site based on the position of the CRISPR gRNA sequence.
narrow_peaks = peaks_chr22.narrow(start=10, width=100)
shifted_peaks = narrow_peaks.shift(10)

print("Narrowed and Shifted Peaks:")
print(shifted_peaks)
Narrowed and Shifted Peaks:
GenomicRanges with 1441 ranges and 0 metadata columns
       seqnames              ranges          strand
          <str>           <IRanges> <ndarray[int8]>
   [0]    chr22 19766807 - 19766907               *
   [1]    chr22 17369907 - 17370007               *
   [2]    chr22 19756464 - 19756564               *
            ...                 ...             ...
[1438]    chr22 27212077 - 27212177               *
[1439]    chr22 49201378 - 49201478               *
[1440]    chr22 49663381 - 49663481               *
------
seqinfo(46 sequences): chr1 chr10 chr11 ... chrUn_KI270750v1 chrUn_KI270752v1 chrX

These operations demonstrate the flexibility of genomic range manipulations, which can be useful for fine-tuning analyses or testing hypotheses about the spatial relationships between genomic features.

6. Exercises

  1. Split the input genome reference by gene_name, e.g. a field that contains gene symbols.
  2. Calculate the average width of the ChIP-seq peaks on chromosome 22.
  3. Compute the percentage of promoter regions that have at least one overlapping ChIP-seq peak.

Conclusion

In this tutorial, we’ve explored how to use BiocPy’s genomic ranges functionality to perform various genomic analyses. These tools and techniques provide a powerful way to work with genomic interval data in Python, mirroring the capabilities from Bioconductor. They form the foundation for many more complex genomic analyses and can be applied to a wide range of biological questions.

Note

Refer to the BiocPy documentation for more detailed information on these packages and their functionalities.

Back to top