--- file_format: mystnb kernelspec: name: python --- # `BiocFrame` - Bioconductor-like data frames `BiocFrame` class is a Bioconductor-friendly data frame class. Its primary advantage lies in not making assumptions about the types of the columns - as long as an object has a length (`__len__`) and supports slicing methods (`__getitem__`), it can be used inside a `BiocFrame`. This flexibility allows us to accept arbitrarily complex objects as columns, which is often the case in Bioconductor objects. Also check out Bioconductor's [**S4Vectors**](https://bioconductor.org/packages/S4Vectors) package, which implements the `DFrame` class on which `BiocFrame` was based. These classes follow a functional paradigm for accessing or setting properties, with further details discussed in [functional paradigm](https://biocpy.github.io/tutorial/chapters/philosophy.html) section. # Advantages of `BiocFrame` One of the core principles guiding the implementation of the `BiocFrame` class is "**_what you put is what you get_**". Unlike Pandas `DataFrame`, `BiocFrame` makes no assumptions about the types of the columns provided as input. Some key differences to highlight the advantages of using `BiocFrame` are especially in terms of modifications to column types and handling nested dataframes. ## Inadvertent modification of types As an example, Pandas `DataFrame` modifies the types of the input data. These assumptions may cause issues when interoperating between R and python. ```{code-cell} import pandas as pd import numpy as np from array import array df = pd.DataFrame({ "numpy_vec": np.zeros(10), "list_vec": [1]* 10, "native_array_vec": array('d', [3.14] * 10) # less used but native python arrays }) print("type of numpy_vector column:", type(df["numpy_vec"]), df["numpy_vec"].dtype) print("type of list_vector column:", type(df["list_vec"]), df["list_vec"].dtype) print("type of native_array_vector column:", type(df["native_array_vec"]), df["native_array_vec"].dtype) print(df) ``` With `BiocFrame`, no assumptions are made, and the input data is not cast into (un)expected types: ```{code-cell} from biocframe import BiocFrame import numpy as np from array import array bframe_types = BiocFrame({ "numpy_vec": np.zeros(10), "list_vec": [1]* 10, "native_array_vec": array('d', [3.14] * 10) }) print("type of numpy_vector column:", type(bframe_types["numpy_vec"])) print("type of list_vector column:", type(bframe_types["list_vec"])) print("type of native_array_vector column:", type(bframe_types["native_array_vec"])) print(bframe_types) ``` This behavior remains consistent when extracting, slicing, combining, or performing any supported operation on `BiocFrame` objects. ## Handling complex nested frames Pandas `DataFrame` does not support nested structures; therefore, running the snippet below will result in an error: ```python df = pd.DataFrame({ "ensembl": ["ENS00001", "ENS00002", "ENS00002"], "symbol": ["MAP1A", "BIN1", "ESR1"], "ranges": pd.DataFrame({ "chr": ["chr1", "chr2", "chr3"], "start": [1000, 1100, 5000], "end": [1100, 4000, 5500] }), }) print(df) ``` However, it is handled seamlessly with `BiocFrame`: ```{code-cell} bframe_nested = BiocFrame({ "ensembl": ["ENS00001", "ENS00002", "ENS00002"], "symbol": ["MAP1A", "BIN1", "ESR1"], "ranges": BiocFrame({ "chr": ["chr1", "chr2", "chr3"], "start": [1000, 1100, 5000], "end": [1100, 4000, 5500] }), }) print(bframe_nested) ``` This behavior remains consistent when extracting, slicing, combining, or performing any other supported operations on `BiocFrame` objects. # Construction Creating a `BiocFrame` object is straightforward; just provide the `data` as a dictionary. ```{code-cell} from biocframe import BiocFrame obj = { "ensembl": ["ENS00001", "ENS00002", "ENS00003"], "symbol": ["MAP1A", "BIN1", "ESR1"], } bframe = BiocFrame(obj) print(bframe) ``` You can specify complex objects as columns, as long as they have some "length" equal to the number of rows. For example, we can embed a `BiocFrame` within another `BiocFrame`. ```{code-cell} obj = { "ensembl": ["ENS00001", "ENS00002", "ENS00002"], "symbol": ["MAP1A", "BIN1", "ESR1"], "ranges": BiocFrame({ "chr": ["chr1", "chr2", "chr3"], "start": [1000, 1100, 5000], "end": [1100, 4000, 5500] }), } bframe2 = BiocFrame(obj, row_names=["row1", "row2", "row3"]) print(bframe2) ``` The `row_names` parameter is analogous to index in the pandas world and should not contain missing strings. Additionally, you may provide: - `column_data`: A `BiocFrame`object containing metadata about the columns. This must have the same number of rows as the numbers of columns. - `metadata`: Additional metadata about the object, usually a dictionary. - `column_names`: If different from the keys in the `data`. If not provided, this is automatically extracted from the keys in the `data`. # With other `DataFrame` libraries # Pandas `BiocFrame` is intended for accurate representation of Bioconductor objects for interoperability with R, many users may prefer working with **pandas** `DataFrame` objects for their actual analyses. This conversion is easily achieved: ```{code-cell} from biocframe import BiocFrame bframe3 = BiocFrame( { "foo": ["A", "B", "C", "D", "E"], "bar": [True, False, True, False, True] } ) df = bframe3.to_pandas() print(type(df)) print(df) ``` Converting back to a `BiocFrame` is similarly straightforward: ```{code-cell} out = BiocFrame.from_pandas(df) print(out) ``` ## Polars Similarly, you can easily go back and forth between `BiocFrame` and a polars `DataFrame`: ```{code-cell} from biocframe import BiocFrame bframe3 = BiocFrame( { "foo": ["A", "B", "C", "D", "E"], "bar": [True, False, True, False, True] } ) pl = bframe3.to_polars() print(pl) ``` # Extracting data BiocPy classes follow a functional paradigm for accessing or setting properties, with further details discussed in [functional paradigm](https://biocpy.github.io/tutorial/chapters/philosophy.html#functional-discipline) section. Properties can be directly accessed from the object: ```{code-cell} print("shape:", bframe.shape) print("column names (functional style):", bframe.get_column_names()) print("column names (as property):", bframe.column_names) # same as above ``` We can fetch individual columns: ```{code-cell} print("functional style:", bframe.get_column("ensembl")) print("w/ accessor", bframe["ensembl"]) ``` And we can get individual rows as a dictionary: ```{code-cell} bframe.get_row(2) ``` ::: {.callout} To retrieve a subset of the data in the `BiocFrame`, we use the subset (`[]`) operator. This operator accepts different subsetting arguments, such as a boolean vector, a `slice` object, a sequence of indices, or row/column names. ::: ```{code-cell} sliced_with_bools = bframe[1:2, [True, False, False]] print("Subset using booleans: \n", sliced_with_bools) sliced_with_names = bframe[[0,2], ["symbol", "ensembl"]] print("\nSubset using column names: \n", sliced_with_names) # Short-hand to get a single column: print("\nShort-hand to get a single column: \n", bframe["ensembl"]) ``` # Setting data ## Preferred approach For setting properties, we encourage a **functional style** of programming to avoid mutating the object directly. This helps prevent inadvertent modifications of `BiocFrame` instances within larger data structures. ```{code-cell} modified = bframe.set_column_names(["column1", "column2"]) print(modified) ``` Now let's check the column names of the original object, ```{code-cell} # Original is unchanged: print(bframe.get_column_names()) ``` To add new columns, or replace existing ones: ```{code-cell} modified = bframe.set_column("symbol", ["A", "B", "C"]) print(modified) modified = bframe.set_column("new_col_name", range(2, 5)) print(modified) ``` Change the row or column names: ```{code-cell} modified = bframe.\ set_column_names(["FOO", "BAR"]).\ set_row_names(['alpha', 'bravo', 'charlie']) print(modified) ``` ***The functional style allows you to chain multiple operations.*** We also support Bioconductor's metadata concepts, either along the columns or for the entire object: ```{code-cell} modified = bframe.\ set_metadata({ "author": "Jayaram Kancherla" }).\ set_column_data(BiocFrame({"column_source": ["Ensembl", "HGNC" ]})) print(modified) ``` ## The not-preferred-way Properties can also be set by direct assignment for in-place modification. We prefer not to do it this way as it can silently mutate ``BiocFrame`` instances inside other data structures. Nonetheless: ```{code-cell} testframe = BiocFrame({ "A": [1,2,3], "B": [4,5,6] }) testframe.column_names = ["column1", "column2" ] print(testframe) ``` Warnings are raised when properties are directly mutated. These assignments are the same as calling the corresponding `set_*()` methods with `in_place = True`. It is best to do this only if the `BiocFrame` object is not being used anywhere else; otherwise, it is safer to just create a (shallow) copy via the default `in_place = False`. Similarly, we could set or replace columns directly: ```{code-cell} testframe["column2"] = ["A", "B", "C"] testframe[1:3, ["column1","column2"]] = BiocFrame({"x":[4, 5], "y":["E", "F"]}) ``` # Iterate over rows You can iterate over the rows of a `BiocFrame` object. `name` is `None` if the object does not contain any `row_names`. To iterate over the first two rows: ```{code-cell} for name, row in bframe[:2,]: print(name, row) ``` # Combining objects `BiocFrame` implements methods for the various `combine` generics from [**BiocUtils**](https://github.com/BiocPy/biocutils). For example, to combine by row: ```{code-cell} import biocutils bframe1 = BiocFrame({ "odd": [1, 3, 5, 7, 9], "even": [0, 2, 4, 6, 8], }) bframe2 = BiocFrame({ "odd": [11, 33, 55, 77, 99], "even": [0, 22, 44, 66, 88], }) combined = biocutils.combine_rows(bframe1, bframe2) print(combined) ``` Similarly, to combine by column: ```{code-cell} bframe3 = BiocFrame({ "foo": ["A", "B", "C", "D", "E"], "bar": [True, False, True, False, True] }) combined = biocutils.combine_columns(bframe1, bframe3) print(combined) ``` ## Relaxed combine operation By default, the combine methods assume that the number and identity of columns (for `combine_rows()`) or rows (for `combine_columns()`) are the same across objects. In situations where this is not the case, such as having different columns across objects, we can use `relaxed_combine_rows()` instead: ```{code-cell} from biocframe import relaxed_combine_rows modified2 = bframe2.set_column("foo", ["A", "B", "C", "D", "E"]) combined = biocutils.relaxed_combine_rows(bframe1, modified2) print(combined) ``` ## Sql-like join operation Similarly, if the rows are different, we can use `BiocFrame`'s `merge` function. This function uses the *row_names* as the index to perform this operation; you can specify an alternative set of keys through the `by` parameter. ```{code-cell} from biocframe import merge modified1 = bframe1.set_row_names(["A", "B", "C", "D", "E"]) modified3 = bframe3.set_row_names(["C", "D", "E", "F", "G"]) combined = merge([modified1, modified3], by=None, join="outer") print(combined) ``` # Empty Frames We can create empty `BiocFrame` objects that only specify the number of rows. This is beneficial in scenarios where `BiocFrame` objects are incorporated into larger data structures but do not contain any data themselves. ```{code-cell} empty = BiocFrame(number_of_rows=100) print(empty) ``` Most operations detailed in this document can be performed on an empty `BiocFrame` object. ```{code-cell} print("Column names:", empty.column_names) subset_empty = empty[1:10,:] print("\nSubsetting an empty BiocFrame: \n", subset_empty) ``` ## Further reading - Explore more details [the reference documentation](https://biocpy.github.io/BiocFrame/). - Additionally, take a look at Bioconductor's [**S4Vectors**](https://bioconductor.org/packages/S4Vectors) package, which implements the `DFrame` class upon which `BiocFrame` was built.