PyPI-Server Unit tests

orgdb

OrgDb provides an interface to access and query Organism Database (OrgDb) SQLite files in Python. It mirrors functionality from the R/Bioconductor AnnotationDbi package, enabling seamless integration of organism-wide gene annotation into Python workflows.

[!NOTE]

If you are looking to access TxDb databases, check out the txdb package.

Install

To get started, install the package from PyPI

pip install orgdb

Usage

Using OrgDbRegistry

The registry download the AnnotationHub’s metadata sqlite file and filters for all available OrgDb databases. You can fetch standard organism databases via the registry (backed by AnnotationHub).

from orgdb import OrgDbRegistry

# Initialize registry and list available organisms
registry = OrgDbRegistry()
available = registry.list_orgdb()
print(available[:5])
# ["org.'Caballeronia_concitans'.eg", "org.'Chlorella_vulgaris'_C-169.eg", ...]

# Load the database for Homo sapiens (downloads and caches automatically)
db = registry.load_db("org.Hs.eg.db")
print(db.species)
# 'Homo sapiens'

Inspecting metadata

Explore the available columns and key types in the database.

# List available columns (and keytypes)
cols = db.columns()
print(cols[:5])
# ['ENTREZID', 'PFAM', 'IPI', 'PROSITE', 'ACCNUM']

# Check available keys for a specific keytype
entrez_ids = db.keys("ENTREZID")
print(entrez_ids[:5])
# ['1', '2', '9', '10', '11']

Querying Annotations (using select)

The select method retrieves data as a BiocFrame. It automatically handles complex joins across tables.

# Retrieve Gene Symbols and Gene Names for a list of Entrez IDs
res = db.select(
    keys=["1", "10"],
    columns=["SYMBOL", "GENENAME"],
    keytype="ENTREZID"
)

print(res)
# BiocFrame with 2 rows and 3 columns
                   GENENAME ENTREZID SYMBOL
                     <list>   <list> <list>
# [0] alpha-1-B glycoprotein        1   A1BG
# [1]  N-acetyltransferase 2       10   NAT2

[!NOTE]

If you request “GO” columns, the result will automatically expand to include “EVIDENCE” and “ONTOLOGY” columns, matching Bioconductor behavior.

go_res = db.select(
    keys="1",
    columns=["GO"],
    keytype="ENTREZID"
)
# BiocFrame with 12 rows and 4 columns
       ONTOLOGY ENTREZID         GO EVIDENCE
         <list>   <list>     <list>   <list>
#  [0]       BP        1 GO:0002764      IBA
#  [1]       CC        1 GO:0005576      HDA
#  [2]       CC        1 GO:0005576      IDA
#           ...      ...        ...      ...
#  [9]       CC        1 GO:0070062      HDA
# [10]       CC        1 GO:0072562      HDA
# [11]       CC        1 GO:1904813      TAS

Accessing Genomic Ranges

Extract gene coordinates as a GenomicRanges object (requires the chromosome_locations table in the OrgDb database).

gr = db.genes()
print(gr)
# GenomicRanges with 52232 ranges and 1 metadata column
#           seqnames                ranges          strand     gene_id
#              <str>             <IRanges> <ndarray[int8]>      <list>
#         1       19 -58345182 - -58336872               * |         1
#         2       12   -9067707 - -9019495               * |         2
#         2       12   -9067707 - -9019185               * |         2
#                ...                   ...             ... |       ...
# 116804918       11 121024101 - 121191490               * | 116804918
# 117779438        1   20154213 - 20160568               * | 117779438
# 118142757        6   42155405 - 42180056               * | 118142757
# ------
# seqinfo(369 sequences): 1 10 10_GL383545v1_alt ... X_KI270913v1_alt Y Y_KZ208924v1_fix

Note

This project has been set up using BiocSetup and PyScaffold.