biocframe package

Subpackages

Submodules

biocframe.BiocFrame module

class biocframe.BiocFrame.BiocFrame(data=None, number_of_rows=None, row_names=None, column_names=None, column_data=None, metadata=None, validate=True)[source]

Bases: object

BiocFrame is an alternative to DataFrame, with support for nested and flexible column types. Inspired by the DFrame class from Bioconductor’s S4Vectors package. Any object may be used as a column, provided it has:

  • Some concept of “height”, as defined by get_height() from BiocUtils. This defaults to the length as defined by __len__.

  • The ability to be sliced by integer indices, as implemented by subset() from BiocUtils. This defaults to calling __getitem__.

  • The ability to be combined with other objects, as implemented in combine() from BiocUtils.

  • The ability to perform an assignment, as implemented in assign() from BiocUtils.

This allows BiocFrame to accept arbitrarily complex classes (such as nested BiocFrame instances) as columns.

__array_ufunc__(func, method, *inputs, **kwargs)[source]

Interface for NumPy array methods.

Note: This is a very primitive implementation and needs tests to support different types.

Return type:

BiocFrame

Returns:

An object with the same type as the caller.

__copy__()[source]
Returns:

A shallow copy of the current BiocFrame.

__deepcopy__(memo=None, _nil=[])[source]
Returns:

A deep copy of the current BiocFrame.

__delitem__(name)[source]

Alias for remove_column with in_place = True.

As this mutates the original object, a warning is raised.

__getitem__(args)[source]

Wrapper around get_column and get_slice to obtain a slice of a BiocFrame or any of its columns.

Parameters:

args (Union[int, str, Sequence, tuple]) –

A sequence or a scalar integer or string, specifying the columns to retain based on their names or indices.

Alternatively a tuple of length 1. The first entry specifies the rows to retain based on their names or indices.

Alternatively a tuple of length 2. The first entry specifies the rows to retain, while the second entry specifies the columns to retain, based on their names or indices.

Return type:

Union[BiocFrame, Any]

Returns:

If args is a scalar, the specified column is returned. This is achieved internally by calling get_column.

If args is a sequence, a new BiocFrame is returned containing only the specified columns. This is achieved by just calling get_slice with no row slicing.

If args is a tuple of length 1, a new BiocFrame is returned containing the specified rows. This is achieved by just calling get_slice with no column slicing.

If args is a tuple of length 2, a new BiocFrame is returned containing the specified rows and columns. This is achieved by just calling get_slice with the specified arguments.

__init__(data=None, number_of_rows=None, row_names=None, column_names=None, column_data=None, metadata=None, validate=True)[source]

Initialize a BiocFrame object from columns.

Parameters:
  • data (Mapping) –

    Dictionary of column names as keys and their values. All columns must have the same length. Defaults to an empty dictionary.

    Alternatively may provide a Mapping object, for example a NamedList that can be coerced into a dictionary.

  • number_of_rows (Optional[int]) – Number of rows. If not specified, inferred from data. This needs to be provided if data is empty and row_names are not present.

  • row_names (Optional[List]) – Row names. This should not contain missing strings.

  • column_names (Optional[List[str]]) – Column names. If not provided, inferred from the data. This may be in a different order than the keys of data. This should not contain missing strings.

  • column_data (Optional[BiocFrame]) – Metadata about columns. Must have the same number of rows as the length of column_names. Defaults to None.

  • metadata (Optional[dict]) – Additional metadata. Defaults to an empty dictionary.

  • validate (bool) – Internal use only.

__iter__()[source]

Iterator over rows.

Return type:

BiocFrameIter

__len__()[source]
Return type:

int

Returns:

Number of rows.

__repr__()[source]
Return type:

str

Returns:

A string representation of this BiocFrame.

__setitem__(args, value)[source]

Wrapper around set_column and set_slice to modify a slice of a BiocFrame or any of its columns. As this modified the original object in place, a warning is raise.

If args is a string, it is assumed to be a column name and value is expected to be the column contents; these are passed onto set_column with in_place = True.

If args is a tuple, it is assumed to contain row and column indices. value is expected to be a BiocFrame containing replacement values. These are passed to set_slice with in_place = True.

property colnames: Names

Alias for get_column_names, provided for back-compatibility only.

column(column)[source]

Alias for get_column(), provided for back-compatibility only.

Return type:

Any

property column_data: None | BiocFrame

Alias for get_column_data.

property column_names: Names

Alias for get_column_names.

property columns: Names

Alias for get_column_names, provided for compatibility with pandas.

combine(*other)[source]

Wrapper around relaxed_combine_rows(), provided for back-compatibility only.

combine_columns(*other)[source]

Wrapper around combine_columns().

combine_rows(*other)[source]

Wrapper around combine_rows().

copy()[source]

Alias for __copy__().

property data: Dict[str, Any]

Alias for get_data.

property dims: Tuple[int, int]

Alias for shape.

flatten(as_type='dict', delim='.')[source]

Flatten a nested BiocFrame object.

Parameters:
  • as_type (Literal['dict', 'biocframe']) – Return type of the result. Either a dict or a BiocFrame object.

  • delim (str) – Delimiter to join nested column names. Defaults to “.”.

Return type:

BiocFrame

Returns:

An object with the type specified by as_type argument. If as_type is dict, an additional column “rownames” is added if the object contains rownames.

classmethod from_pandas(input)[source]

Create a BiocFrame from a DataFrame object.

Parameters:

input (pandas.DataFrame) – Input data.

Return type:

BiocFrame

Returns:

A BiocFrame object.

classmethod from_polars(input)[source]

Create a BiocFrame from a DataFrame object.

Parameters:

input (polars.DataFrame) – Input data.

Return type:

BiocFrame

Returns:

A BiocFrame object.

get_column(column)[source]
Parameters:

column (Union[str, int]) –

Name of the column, which must exist in get_column_names.

Alternatively, the integer index of the column of interest.

Return type:

Any

Returns:

The contents of the specified column.

get_column_data(with_names=True)[source]
Parameters:

with_names (bool) – Whether to set the column names of this BiocFrame as the row names of the column data BiocFrame.

Return type:

Optional[BiocFrame]

Returns:

The annotations for each column. This may be None if no annotation is present, or is a BiocFrame where each row corresponds to a column and contains that column’s metadata.

get_column_names()[source]
Return type:

Names

Returns:

A list of column names.

get_data()[source]
Return type:

Dict[str, Any]

Returns:

Dictionary of columns and their values.

get_metadata()[source]
Return type:

dict

Returns:

Dictionary of metadata for this object.

get_row(row)[source]
Parameters:

row (Union[str, int]) –

Integer index of the row to access.

If row names are available (see get_row_names), a string may be supplied instead. The first occurrence of the string in the row names is used.

Return type:

dict

Returns:

A dictionary where the keys are column names and the values are the contents of the columns at the specified row.

get_row_names()[source]
Return type:

Optional[Names]

Returns:

List of row names, or None if no row names are available.

get_slice(rows, columns)[source]

Slice BiocFrame along the rows and/or columns, based on their indices or names.

Parameters:
  • rows (Union[str, int, bool, Sequence]) –

    Rows to be extracted. This may be an integer, boolean, string, or any sequence thereof, as supported by normalize_subscript(). Scalars are treated as length-1 sequences.

    Strings may only be used if row names are available (see get_row_names). The first occurrence of each string in the row names is used for extraction.

  • columns (Union[str, int, bool, Sequence]) – Columns to be extracted. This may be an integer, boolean, string, or any sequence thereof, as supported by normalize_subscript(). Scalars are treated as length-1 sequences.

Return type:

BiocFrame

Returns:

A BiocFrame with the specified rows and columns.

has_column(name)[source]
Parameters:

name (str) – Name of the column.

Return type:

bool

Returns:

Whether a column with the specified name exists in this object.

property index: Names | None

Alias to get_row_names, provided for compatibility with pandas.

merge(*other, by=None, join='left', rename_duplicate_columns=False)[source]

Wrapper around merge().

property metadata: dict

Alias for get_metadata.

relaxed_combine_columns(*other)[source]

Wrapper around relaxed_combine_columns().

relaxed_combine_rows(*other)[source]

Wrapper around relaxed_combine_rows().

remove_column(column, in_place=False)[source]

Remove a column. This is a convenience wrapper around remove_columns.

Parameters:
  • column (Union[int, str]) – Name or positional index of the column to remove.

  • in_place (bool) – Whether to modify the object in place. Defaults to False.

Return type:

BiocFrame

Returns:

A modified BiocFrame object, either as a copy of the original or as a reference to the (in-place-modified) original.

remove_columns(columns, in_place=False)[source]

Remove any number of existing columns.

Parameters:
  • columns (Sequence[Union[int, str]]) – Names or indices of the columns to remove.

  • in_place (bool) – Whether to modify the object in place. Defaults to False.

Returns:

A modified BiocFrame object, either as a copy of the original or as a reference to the (in-place-modified) original.

Return type:

BiocFrame

row(row)[source]

Alias for get_row, provided for back-compatibility only.

Return type:

dict

property row_names: Names | None

Alias for get_row_names.

property rownames: Names | None

Alias for get_row_names, provided for back-compatibility.

set_column(column, value, in_place=False)[source]

Modify an existing column or add a new column. This is a convenience wrapper around set_columns.

Parameters:
  • column (Union[int, str]) – Name of an existing or new column. Alternatively, an index specifying the position of an existing column.

  • value (Any) – Value of the new column. This should have the same height as the number of rows in the current object.

  • in_place (bool) – Whether to modify the object in place.

Return type:

BiocFrame

Returns:

A modified BiocFrame object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_column_data(column_data, in_place=False)[source]
Parameters:
  • column_data (Optional[BiocFrame]) – New column data. This should either be a BiocFrame with the numbero of rows equal to the number of columns in the current object, or None to remove existing column data.

  • in_place (bool) – Whether to modify the BiocFrame object in place.

Return type:

BiocFrame

Returns:

A modified BiocFrame object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_column_names(names, in_place=False)[source]
Parameters:
  • names (List[str]) – List of unique strings, of length equal to the number of columns in this BiocFrame.

  • in_place (bool) – Whether to modify the BiocFrame object in place.

Return type:

BiocFrame

Returns:

A modified BiocFrame object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_columns(columns, in_place=False)[source]

Modify existing columns or add new columns.

Parameters:
  • columns (Dict[str, Any]) – Contents of the columns to set. Keys may be strings containing new or existing column names, or integers containing the position of the column. Values should be the contents of each column.

  • in_place (bool) – Whether to modify the object in place. Defaults to False.

Return type:

BiocFrame

Returns:

A modified BiocFrame object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_metadata(metadata, in_place=False)[source]
Parameters:
  • metadata (dict) – New metadata for this object.

  • in_place (bool) – Whether to modify the BiocFrame object in place.

Return type:

BiocFrame

Returns:

A modified BiocFrame object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_row_names(names, in_place=False)[source]
Parameters:
  • names (Optional[List]) – List of strings. This should have length equal to the number of rows in the current BiocFrame.

  • in_place (bool) – Whether to modify the BiocFrame object in place.

Return type:

BiocFrame

Returns:

A modified BiocFrame object, either as a copy of the original or as a reference to the (in-place-modified) original.

set_slice(rows, columns, value, in_place=True)[source]

Replace a slice of the BiocFrame given the row and columns of the slice.

Parameters:
  • rows (Union[int, str, bool, Sequence]) –

    Rows to be replaced. This may be any sequence of strings, integers, or booleans (or mixture thereof), as supported by normalize_subscript(). Scalars are treated as length-1 sequences.

    Strings may only be used if row names are available (see get_row_names). The first occurrence of each string in the row names is used for extraction.

  • columns (Union[int, str, bool, Sequence]) – Columns to be replaced. This may be any sequence of strings, integers, or booleans (or mixture thereof), as supported by normalize_subscript(). Scalars are treated as length-1 sequences.

  • value (BiocFrame) – A BiocFrame containing replacement values. Each row corresponds to a row in rows, while each column corresponds to a column in columns. Note that the replacement is based on position, so row and column names in value are ignored.

  • in_place (bool) – Whether to modify the BiocFrame object in place.

Return type:

BiocFrame

Returns:

A modified BiocFrame object, either as a copy of the original or as a reference to the (in-place-modified) original.

property shape: Tuple[int, int]

Returns: Tuple containing the number of rows and columns in this BiocFrame.

slice(rows, columns)[source]

Alias for __getitem__, for back-compatibility.

Return type:

BiocFrame

split(column_name, only_indices=False)[source]

Split the object by a column.

Parameters:
  • column_name (str) – Name of the column to split by.

  • only_indices (bool) – Whether to only return indices. Defaults to False

Return type:

Dict[str, Union[BiocFrame, List[int]]]

Returns:

A dictionary of biocframe objects, with names representing the group and the value the sliced frames.

if only_indices is True, the values contain the row indices that map to the same group.

to_pandas()[source]

Convert the BiocFrame into a DataFrame object.

Returns:

A DataFrame object. Column names of the resulting dataframe may be different is the BiocFrame is nested.

to_polars()[source]

Convert the BiocFrame into a DataFrame object.

Returns:

A DataFrame object. Column names of the resulting dataframe may be different is the BiocFrame is nested.

class biocframe.BiocFrame.BiocFrameIter(obj)[source]

Bases: object

An iterator to a BiocFrame object.

Parameters:

obj (BiocFrame) – Source object to iterate.

__init__(obj)[source]

Initialize the iterator.

Parameters:

obj (BiocFrame) – source object to iterate.

__iter__()[source]
__next__()[source]
biocframe.BiocFrame.merge(x, by=None, join='left', rename_duplicate_columns=False)[source]

Merge multiple BiocFrame` objects together by common columns or row names, yielding a combined object with a union of columns across all objects.

Parameters:
  • x (Sequence[BiocFrame]) – Sequence of BiocFrame objects. Each object may have any number and identity of rows and columns.

  • by (Union[None, str, Sequence]) –

    If string, the name of column containing the keys. Each entry of x is assumed to have this column.

    If integer, the index of column containing the keys. The same index is used for each entry of x.

    If None, keys are assumed to be present in the row names.

    Alternatively a sequence of strings, integers or None, specifying the location of the keys in each entry of x.

  • join (Literal['inner', 'left', 'right', 'outer']) – Strategy for the merge. For left and right joins, we consider the keys for the first and last object in x, respectively.

  • rename_duplicate_columns (bool) – Whether duplicated non-key columns across x should be automatically renamed in the merged object. If False, an error is raised instead.

Returns:

A BiocFrame containing the merged contents.

If by = None, the keys are stored in the row names.

If by is a string, keys are stored in the column of the same name.

If by is a sequence, keys are stored in the row names if by[0] = None, otherwise they are stored in the column named by[0].

Return type:

BiocFrame

biocframe.BiocFrame.relaxed_combine_columns(*x)[source]

Wrapper around merge() that performs a left join on the row names.

Return type:

BiocFrame

biocframe.BiocFrame.relaxed_combine_rows(*x)[source]

A relaxed version of the combine_rows() method for BiocFrame objects. Whereas combine_rows expects that all objects have the same columns, relaxed_combine_rows allows for different columns. Absent columns in any object are filled in with appropriate placeholder values before combining.

Parameters:

x (BiocFrame) – One or more BiocFrame objects, possibly with differences in the number and identity of their columns.

Return type:

BiocFrame

Returns:

A BiocFrame that combines all x along their rows and contains the union of all columns. Columns absent in any x are filled in with placeholders consisting of Nones or masked NumPy values.

Module contents