hdf5 bootstrap representation #44

mtmorgan · 2015-05-19T22:33:21Z

Bootstrap names left-padded with zeros bs00001, etc, have a natural sort order, currently these sort as bs0, bs1, bs10, bs11, ..., bs2, bs20, ...

/bootstrap as a 2-dimensional array would be more readily subset, especially by id. This would seem to be a common use case.

ttriche · 2015-05-19T23:25:15Z

that's cool but if you do

hdf5 <- paste0(path.expand(path), "/", h5file)
bootstraps <- h5read(hdf5, "aux/num_bootstrap")

if bootstraps are found, summarize them...

if (bootstraps > 0) {
message("Found ", bootstraps, " bootstraps for ", path, ",
summarizing...")
# grab a column vector for each bootstrap, then bind them all into a
Matrix
bsidx <- paste0("bootstrap/bs", seq(0, bootstraps - 1))
boots <- do.call(cbind, lapply(bsidx, function(bs) h5read(hdf5, bs)))
res <- Matrix(cbind(rowMedians(boots),
rowMads(boots),
h5read(hdf5, "aux/eff_lengths")),
dimnames=list(transcript=h5read(hdf5, "aux/ids"),
c("est_count", "eff_length",
"est_count_mad")))
} else {
message("No bootstraps found for ", path, ", using recorded
est_counts...")
res <- Matrix(cbind(h5read(hdf5, "est_counts"),
h5read(hdf5, "aux/eff_lengths")),
dimnames=list(transcript=h5read(hdf5, "aux/ids"),
c("est_count", "eff_length")))
}

it's pretty easy to handle them the way they are

Also if you plot a few of the bootstraps their error typically (not
always) ends up looking Gaussian at most detectable thresholds. Sometimes
there are obvious multimodal distributions, perhaps transcript-level mclust
would usefully inform downstream analysis. Anyways... it is not so hard
as-is

Statistics is the grammar of science.
Karl Pearson http:https://en.wikipedia.org/wiki/The_Grammar_of_Science

On Tue, May 19, 2015 at 3:33 PM, Martin Morgan [email protected]
wrote:

Bootstrap names left-padded with zeros bs00001, etc, have a natural sort
order, currently these sort as bs1, bs10, bs11, ..., bs2, bs20, ...

/bootstrap as a 2-dimensional array would be more readily subset,
especially by id. This would seem to be a common use case.

—
Reply to this email directly or view it on GitHub
#44.

mtmorgan · 2015-05-20T00:10:20Z

simplify2array(h5read("foo.h5", "/bootstrap")) imports all bootstraps as a matrix so no need to iterate through (see readBootstrap); one is interested in subsets when managing larger data.

pimentel · 2015-05-20T23:19:37Z

Hi guys,

Thanks for the input.

@mtmorgan I don't see a super compelling reason (other than it being more visually appealing) to use the left padded numbers, and as @ttriche this is why we store num_bootstrap in aux. Is there another reason other than being more visually appealing when sorted?

When we were developing kallisto, I did some non-comprehensive testing and there was little to no difference in storing them in vectors vs storing them in a huge matrix and reading into R. The existing C code is much simpler as currently written.

Which direction do you want to subset the HDF5 in (e.g. by transcript or bootstrap round)?

BTW, I'm actually on vacation right now, but when I return next week I'll post the code we have for reading bootstraps and aggregating kallisto results (this is all part of the sleuth project).

mtmorgan · 2015-05-20T23:42:41Z

I guess I'm just scared by having my chromosomes sort chr1, chr10, instead of chr1, chr2, ... One can easily imagine thinking that the results of the third bootstrap were interesting, but then confusing bs2 with bs3 with bs10, all of which are 'third' in some sense (if the main consumer of these files is envisioned to be R, then it would also make sense to number from 1 rather than 0).

I think one would often access by bootstrap for some purposes (e.g., normalization-like activities) and by transcript for others (e.g., drilling down on 'interesting' results); this two-dimensional access and the homogeneity of data type makes one think of a matrix, rather than list-of-vectors. It also seems less error-prone to rely on existing software to subset an hdf5 matrix in hdf5, rather than to come up with ad hoc solutions (like here) for sub-setting a list-of-vectors in a client language. A matrix would also make padding bootstrap ids moot.

Obviously these are not earth-shaking features, so please ignore if you like...

pimentel added the feature label May 20, 2015

pimentel added the discussion label Jun 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hdf5 bootstrap representation #44

hdf5 bootstrap representation #44

mtmorgan commented May 19, 2015

ttriche commented May 19, 2015

mtmorgan commented May 20, 2015

pimentel commented May 20, 2015

mtmorgan commented May 20, 2015

hdf5 bootstrap representation #44

hdf5 bootstrap representation #44

Comments

mtmorgan commented May 19, 2015

ttriche commented May 19, 2015

if bootstraps are found, summarize them...

mtmorgan commented May 20, 2015

pimentel commented May 20, 2015

mtmorgan commented May 20, 2015