orgdb¶

OrgDb provides an interface to access and query Organism Database (OrgDb) SQLite files in Python. It mirrors functionality from the R/Bioconductor AnnotationDbi package, enabling seamless integration of organism-wide gene annotation into Python workflows.

[!NOTE]

If you are looking to access TxDb databases, check out the txdb package.

Install¶

To get started, install the package from PyPI

pip install orgdb

Usage¶

Using OrgDbRegistry¶

The registry download the AnnotationHub’s metadata sqlite file and filters for all available OrgDb databases. You can fetch standard organism databases via the registry (backed by AnnotationHub).

from orgdb import OrgDbRegistry

# Initialize registry and list available organisms
registry = OrgDbRegistry()
available = registry.list_orgdb()
print(available[:5])
# ["org.'Caballeronia_concitans'.eg", "org.'Chlorella_vulgaris'_C-169.eg", ...]

# Load the database for Homo sapiens (downloads and caches automatically)
db = registry.load_db("org.Hs.eg.db")
print(db.species)
# 'Homo sapiens'

Inspecting metadata¶

Explore the available columns and key types in the database.

# List available columns (and keytypes)
cols = db.columns()
print(cols[:5])
# ['ENTREZID', 'PFAM', 'IPI', 'PROSITE', 'ACCNUM']

# Check available keys for a specific keytype
entrez_ids = db.keys("ENTREZID")
print(entrez_ids[:5])
# ['1', '2', '9', '10', '11']

Querying Annotations (using `select`)¶

The select method retrieves data as a BiocFrame. It automatically handles complex joins across tables.

# Retrieve Gene Symbols and Gene Names for a list of Entrez IDs
res = db.select(
    keys=["1", "10"],
    columns=["SYMBOL", "GENENAME"],
    keytype="ENTREZID"
)

print(res)
# BiocFrame with 2 rows and 3 columns
                   GENENAME ENTREZID SYMBOL
                     <list>   <list> <list>
# [0] alpha-1-B glycoprotein        1   A1BG
# [1]  N-acetyltransferase 2       10   NAT2

[!NOTE]

If you request “GO” columns, the result will automatically expand to include “EVIDENCE” and “ONTOLOGY” columns, matching Bioconductor behavior.

go_res = db.select(
    keys="1",
    columns=["GO"],
    keytype="ENTREZID"
)
# BiocFrame with 12 rows and 4 columns
       ONTOLOGY ENTREZID         GO EVIDENCE
         <list>   <list>     <list>   <list>
#  [0]       BP        1 GO:0002764      IBA
#  [1]       CC        1 GO:0005576      HDA
#  [2]       CC        1 GO:0005576      IDA
#           ...      ...        ...      ...
#  [9]       CC        1 GO:0070062      HDA
# [10]       CC        1 GO:0072562      HDA
# [11]       CC        1 GO:1904813      TAS

Accessing Genomic Ranges¶

Extract gene coordinates as a GenomicRanges object (requires the chromosome_locations table in the OrgDb database).

gr = db.genes()
print(gr)
# GenomicRanges with 52232 ranges and 1 metadata column
#           seqnames                ranges          strand     gene_id
#              <str>             <IRanges> <ndarray[int8]>      <list>
#         1       19 -58345182 - -58336872               * |         1
#         2       12   -9067707 - -9019495               * |         2
#         2       12   -9067707 - -9019185               * |         2
#                ...                   ...             ... |       ...
# 116804918       11 121024101 - 121191490               * | 116804918
# 117779438        1   20154213 - 20160568               * | 117779438
# 118142757        6   42155405 - 42180056               * | 118142757
# ------
# seqinfo(369 sequences): 1 10 10_GL383545v1_alt ... X_KI270913v1_alt Y Y_KZ208924v1_fix

Note¶

This project has been set up using BiocSetup and PyScaffold.