Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdf5 bootstrap representation #44

Open
mtmorgan opened this issue May 19, 2015 · 4 comments
Open

hdf5 bootstrap representation #44

mtmorgan opened this issue May 19, 2015 · 4 comments

Comments

@mtmorgan
Copy link

Bootstrap names left-padded with zeros bs00001, etc, have a natural sort order, currently these sort as bs0, bs1, bs10, bs11, ..., bs2, bs20, ...

/bootstrap as a 2-dimensional array would be more readily subset, especially by id. This would seem to be a common use case.

@ttriche
Copy link

ttriche commented May 19, 2015

that's cool but if you do

hdf5 <- paste0(path.expand(path), "/", h5file)
bootstraps <- h5read(hdf5, "aux/num_bootstrap")

if bootstraps are found, summarize them...

if (bootstraps > 0) {
message("Found ", bootstraps, " bootstraps for ", path, ",
summarizing...")
# grab a column vector for each bootstrap, then bind them all into a
Matrix
bsidx <- paste0("bootstrap/bs", seq(0, bootstraps - 1))
boots <- do.call(cbind, lapply(bsidx, function(bs) h5read(hdf5, bs)))
res <- Matrix(cbind(rowMedians(boots),
rowMads(boots),
h5read(hdf5, "aux/eff_lengths")),
dimnames=list(transcript=h5read(hdf5, "aux/ids"),
c("est_count", "eff_length",
"est_count_mad")))
} else {
message("No bootstraps found for ", path, ", using recorded
est_counts...")
res <- Matrix(cbind(h5read(hdf5, "est_counts"),
h5read(hdf5, "aux/eff_lengths")),
dimnames=list(transcript=h5read(hdf5, "aux/ids"),
c("est_count", "eff_length")))
}

it's pretty easy to handle them the way they are

Also if you plot a few of the bootstraps their error typically (not
always) ends up looking Gaussian at most detectable thresholds. Sometimes
there are obvious multimodal distributions, perhaps transcript-level mclust
would usefully inform downstream analysis. Anyways... it is not so hard
as-is

Statistics is the grammar of science.
Karl Pearson http:https://en.wikipedia.org/wiki/The_Grammar_of_Science

On Tue, May 19, 2015 at 3:33 PM, Martin Morgan [email protected]
wrote:

Bootstrap names left-padded with zeros bs00001, etc, have a natural sort
order, currently these sort as bs1, bs10, bs11, ..., bs2, bs20, ...

/bootstrap as a 2-dimensional array would be more readily subset,
especially by id. This would seem to be a common use case.


Reply to this email directly or view it on GitHub
#44.

@mtmorgan
Copy link
Author

simplify2array(h5read("foo.h5", "/bootstrap")) imports all bootstraps as a matrix so no need to iterate through (see readBootstrap); one is interested in subsets when managing larger data.

@pimentel
Copy link
Contributor

Hi guys,

Thanks for the input.

@mtmorgan I don't see a super compelling reason (other than it being more visually appealing) to use the left padded numbers, and as @ttriche this is why we store num_bootstrap in aux. Is there another reason other than being more visually appealing when sorted?

When we were developing kallisto, I did some non-comprehensive testing and there was little to no difference in storing them in vectors vs storing them in a huge matrix and reading into R. The existing C code is much simpler as currently written.

Which direction do you want to subset the HDF5 in (e.g. by transcript or bootstrap round)?

BTW, I'm actually on vacation right now, but when I return next week I'll post the code we have for reading bootstraps and aggregating kallisto results (this is all part of the sleuth project).

@mtmorgan
Copy link
Author

I guess I'm just scared by having my chromosomes sort chr1, chr10, instead of chr1, chr2, ... One can easily imagine thinking that the results of the third bootstrap were interesting, but then confusing bs2 with bs3 with bs10, all of which are 'third' in some sense (if the main consumer of these files is envisioned to be R, then it would also make sense to number from 1 rather than 0).

I think one would often access by bootstrap for some purposes (e.g., normalization-like activities) and by transcript for others (e.g., drilling down on 'interesting' results); this two-dimensional access and the homogeneity of data type makes one think of a matrix, rather than list-of-vectors. It also seems less error-prone to rely on existing software to subset an hdf5 matrix in hdf5, rather than to come up with ad hoc solutions (like here) for sub-setting a list-of-vectors in a client language. A matrix would also make padding bootstrap ids moot.

Obviously these are not earth-shaking features, so please ignore if you like...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants