Developer notes#

As mentioned elsewhere: ideally, we would use dask directly, as it supports all of the desired operations and provides a NumPy-compatible interface. We could then reach into the .dask attribute to recover the HighLevelGraph and parse it to determine the sequence of delayed operations. This would save us the trouble of writing the DelayedArray classes.

Unfortunately, parsing the call graph exposes BiocPy packages to dask internals like the task specification. While this is not too complicated (usually), it’s a risk that could cause our packages to break with different versions of dask. Indeed, from their own documentation:

Working directly with dask graphs is rare, though, unless you intend to develop new modules with Dask. Even then, dask.delayed is often a better choice. If you are a core developer, then you should start here.

… which does not give me confidence that third-party developers parsing the call graph will be a priority for the dask team when they’re thinking about backwards compatibility. That’s fair enough.

Similarly, the deeper parts of the task specification are not well-documented. For example, each task is represented by a callable object, but these are dask-internal objects that do not have clear descriptions of their attributes. We need to do some further digging to actually get the function involved (log2); who knows what the __dask_blockwise__0 means.

import numpy
import dask.array as da
X = da.from_array(numpy.random.rand(100, 20))
Y = numpy.log2(X)

# Finding the log2 step:
steps = list(Y.dask.get_all_external_keys())
key = None
for x in steps:
    if x[0].startswith("log2-"):
        key = x

key
## ('log2-3999c8c7866c0ec86d68c8ed4da7d784', 0, 0)

# Retrieving the callable:
logstep = Y.dask[key]
logstep[0]
## subgraph_callable-38677753-1f25-4846-ac5c-0fd63671121f

# Actually unpacking the callable:
logstep[0].dsk
{'log2-3999c8c7866c0ec86d68c8ed4da7d784': (<ufunc 'log2'>, '__dask_blockwise__0')}

Also, there are occasions when the dask does something magical to the call graph, usually when we’re operating on multiple arrays. In the example below, the all_sums array is silently split and distributed to the two submatrices (i.e., the rechunk-* operations) when it is used to divide the combined matrix. This is transparent to end users but surprising for developers who expect to recover the same sequence of delayed operations that they put in.

import numpy
import dask.array as da
X = da.from_array(numpy.random.rand(100, 20))
Y = da.from_array(numpy.random.rand(100, 20))
combined = numpy.concatenate((X, Y), axis=1)
all_sums = combined.sum(axis=0).compute()
normalized = combined / all_sums

for x in normalized.dask.keys():
    print(x)
## ('truediv-78719db305246fd474d0136dd804a738', 0, 0)
## ('truediv-78719db305246fd474d0136dd804a738', 0, 1)
## ('concatenate-fea8d498137cb283a1ccd462d1f978aa', 0, 0)
## ('concatenate-fea8d498137cb283a1ccd462d1f978aa', 0, 1)
## ('array-eb5604e3cb2e184e5a2e5b6769b87871', 0, 0)
## ('array-81bcf7d160e27bf3266f48572db16a6e', 0, 0)
## ('array-1f6097f7707042f07edf65a6a9eb020e', 0)
## ('rechunk-merge-22938068fb344dacd51a37e23fefd574', 0)
## ('rechunk-merge-22938068fb344dacd51a37e23fefd574', 1)
## ('rechunk-split-22938068fb344dacd51a37e23fefd574', 0)
## ('rechunk-split-22938068fb344dacd51a37e23fefd574', 1)

These experiences were the motivation for the DelayedArray package, which captures delayed operations in a much simpler format for robust BiocPy package development. And hey, at least we control the representation, so we can coordinate changes across the BiocPy suite of packages. Note that we still use dask to do some Python-level compute (e.g., sums, variances) via the NumPy interface; this should be fine as it’s more stable than the internals.