Tutorial: Structure and Client Interface¶

This tutorial explains the directory structure generated by wobbegongify and how to query the data efficiently.

1. The Wobbegong Format¶

Wobbegong flattens complex Bioconductor objects into simple binary files accompanied by a JSON summary.

summary.json: Contains metadata (dimensions, types) and byte offsets. This is the map clients use to figure out where data lives.
content: A binary file containing compressed chunks of data.
stats: (For matrices) A binary file containing pre-calculated statistics like row sums.

When you run wobbegongify(obj, "dir"), it creates a structured hierarchy:

my_study/
├── summary.json          # Top-level metadata
├── assays/               # Matrix data
│   ├── 0/
│   │   ├── summary.json
│   │   ├── content
│   │   └── stats
│   └── ...
└── reduced_dimensions/   # Reduced dims (stored as DataFrames)
    ├── 0/
    │   ├── summary.json
    │   └── content
    └── ...

2. Supported Objects¶

BiocFrame¶

Saved as a series of compressed columns.

df = BiocFrame({"gene": ["A", "B"], "val": [1, 2]})
wobbegongify(df, "data/df")

Matrices (Dense & Sparse)¶

Matrices are saved row-wise. This is optimized for genomic viewers that need to show expression of a specific gene across all cells.

Dense: Rows are written sequentially.
Sparse: Values and Indices (delta-encoded) are written for each row.

SingleCellExperiment¶

Recursively converts all supported components:

assays -> Matrices
row_data / col_data -> BiocFrames
reduced_dims -> BiocFrames (Column-wise)
alternative_experiments -> Nested SingleCellExperiments

3. Client Interface¶

The wobbegong.load() function acts as a factory, returning the appropriate reader object based on the summary.json.

Accessing Matrices:

mat = wobbegong.load("data/matrix")

# Get expression for the 5th gene
row_vec = mat.get_row(4)

# Get pre-calculated statistics (instant access)
total_counts = mat.get_statistic("row_sum")

Accessing DataFrames:

df = wobbegong.load("data/metadata")

# Get a specific column
cell_ids = df.get_column("cell_id")

Check out the R package for more details.