BiocPy: Facilitate Bioconductor Workflows in Python

Author
Published

January 16, 2024

Welcome

Bioconductor is an open-source software project that provides tools for the analysis and comprehension of genomic data. One of the main advantages of Bioconductor is the availability of standard data representations and large number of analysis tools tailored for genomic experiments. These tools allow researchers to seamlessly store, manipulate, and analyze data across multiple tools and workflows in R.

Inspired by Bioconductor, BiocPy aims to facilitate bioinformatics workflows in Python. To achieve this goal, we developed several core data structures that align closely to the Bioconductor implementations. These structures include BiocFrame, providing a Bioconductor-like data frame class, and GenomicRanges which aids in representing genomic regions and facilitating analysis. They serve as essential and foundational data structures, acting as the building blocks for extensive and complex representations. For instance, container classes like SummarizedExperiment, SingleCellExperiment and MultiAssayExperiment represent single or multi-omic experimental data and metadata.

Moreover, BiocPy introduces a diverse range of data type classes designed to support the representation of atomic entities, including float, string, int lists, and named lists. These generics and utilities are provided through BiocUtils package and the delayed and file-backed array operations in the DelayedArray and their derivatives (HDF5Array, TileDbArray). To our knowledge, BiocPy is the first Python framework to provide seamless, well-integrated data structures and representations for genomic data analysis.

For convenient access to experimental data stored in RDS files, the rds2py package provides bindings to the rds2cpp library. This enables direct reading of RDS files in Python, eliminating the requirement for additional data conversion tools or intermediate formats. The package’s functionality streamlines the transition between Python and R, facilitating seamless analysis.

Although not covered by this tutorial, BiocPy provides bindings to libscran and various other single-cell analysis methods incorporated into the scranpy package to support analysis of multi-modal single-cell datasets. It also features integration with the singleR algorithm to annotate cell types by matching cells to known references based on their expression profiles.

All packages within the BiocPy ecosystem are published to Python’s Package Index (PyPI).

Selected packages

For complete list of all packages, please visit the GitHub:BiocPy repository.

Core representations:

  • BiocFrame (GitHub, Docs): Bioconductor-like dataframes in Python.
  • IRanges (GitHub, Docs): Python implementation of the IRanges package to support interval arithmetic.
  • GenomicRanges (GitHub, Docs, BioC): Container class to represent genomic locations and support genomic analysis.
  • SummarizedExperiment (GitHub, Docs): Container class to represent genomic experiments, following Bioconductor’s SummarizedExperiment.
  • SingleCellExperiment (GitHub, Docs): Container class to represent single-cell experiments; follows Bioconductor’s SingleCellExperiment.
  • MultiAssayExperiment (GitHub, Docs): Container class to represent multiple experiments and assays performed over a set of samples. follows Bioconductor’s MAE R/Bioc Package.

Analysis packages

  • scranpy(GitHub, Docs): Python bindings to the single-cell analysis methods from libscran and related C++ libraries.
  • singler(GitHub, Docs): Python bindings to the singleR algorithm to annotate cell types from known references.

Interoperability with R

  • rds2py (GitHub, Docs): Read RDS files directly in Python. Supports Bioconductor’s SummarizedExperiment and SingleCellExperiment in addition to matrices, data frames and vectors.

Utility packages

  • BiocUtils (GitHub, Docs): Common utilities for use across packages, mostly to mimic convenient aspects of base R.
  • mopsy (GitHub, Docs): Helper functions to perform row or column operations over numpy and scipy matrices. Provides an interface similar to base R matrix methods/MatrixStats methods.
  • pyBiocFileCache (GitHub, Docs): File system based cache for resources & metadata.

Further reading

Many online resources offer detailed information on Bioconductor data structures, namely:

Notes

This is a reproducible Quarto book with reusable snippets. To learn more about Quarto books visit https://quarto.org/docs/books. Check out Reproduce me for more information.