Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing a nan value should leave a gap #132

Open
rcasero opened this issue Sep 27, 2022 · 7 comments
Open

Writing a nan value should leave a gap #132

rcasero opened this issue Sep 27, 2022 · 7 comments

Comments

@rcasero
Copy link

rcasero commented Sep 27, 2022

Hi,

Not sure whether to consider this a bug, but given that bw.values("chr1", start, end) produces nan values for missing values, it seems that bw.addEntry() with nan values should produce gaps in the output file, rather than writing them as nan as it currently does.

There's a way to create the gaps in compact notation for a numpy.array vector x that contains nan values

valid_idx = ~np.isnan(x)
bw.addEntries(chrom, np.where(valid_idx)[0], values=x[valid_idx], span=1)

but perhaps this is something that should be enforced for consistency?

@dpryan79
Copy link
Collaborator

That seems reasonable. Under the hood I have to iterate over the vector anyway, so testing for and skipping NaNs shouldn't be a big deal (just need to look up the C API for numpy and find how to do that).

@rcasero
Copy link
Author

rcasero commented Sep 29, 2022

On a related note, if zeros are provided in addEntries(), are they written or also left as gaps? I imagine they have to be written because otherwise they'd be read as nans.

@dpryan79
Copy link
Collaborator

They're written, exactly for the reason you suspected.

@rcasero
Copy link
Author

rcasero commented Sep 29, 2022

Some more details. So, PyBigWig can save the whole vector of a chromosome,

bw.addEntries(chrom, 0, values=x[chrom], span=1, step=1)

but then it saves the NaNs as np.nan in the file, which then can break applications like peak callers.
Alternatively, it can save only the non-NaN values with their locations (indices),

valid_idx = ~np.isnan(x[chrom])
bw.addEntries(chrom, np.where(valid_idx)[0], values=x[chrom][valid_idx], span=1)

which is relatively fast, but this produces a larger bigwig file.

It is also possible to look for segments g of consecutive values that are not NaNs in the chromosome, and write those with

bw.addEntries(chrom, pos, values=g, span=1, step=1)

This is much slower, but the resulting file is smaller too. An example of times/file size:

Method Filesize Processing time
Using indices of non-NaN values 9.6G 14 min 30 s (+flush time)
Iterating segments of non-NaN values 5.3G 1 h 8 min (+flush time)

@dpryan79
Copy link
Collaborator

I'm surprised there's a difference in size and that the performance difference is so large. The actual data should be the same in either case as are the zoom levels created when the file is closed. I don't fully understand what's going on there. Do you have any data handy that I could use for testing? I have some higher priority things on my plate at the moment but this is vexing me.

@ahertweck
Copy link

Following up from @rcasero's observation, I'm providing a reproducible example showing that even when the indices of the intervals with non-NaN values are passed to bw.addEntries they are filled up by NaN values.

  1. Save array by creating gaps for NaN values
import pyBigWig
import numpy as np

# Initialise array as fake genomic signal for one chromosome
data = np.ones(1000)
# Replace half of the array with NaNs
data[500:1000] = np.nan
# Create dictionary from array
data_dict = dict()
data_dict['chr1'] = data

def to_bigwig_create_gaps(x, filename):
    """Save dictionary to bigWig by creating gaps
    """
    with pyBigWig.open(filename, 'wb') as bw:
        # add header (list of (chromosome, length) tuples)
        bw.addHeader([(k, len(v)) for k, v in x.items()])
        
        for chrom in x.keys():
            valid_idx = ~np.isnan(x[chrom])
            for chrom in x.keys():
                if np.all(valid_idx):
                    # no NaN values, so it can be written in one go
                    bw.addEntries(chrom, 0, values=x[chrom], span=1, step=1)
                else:
                    # we skip writing NaN values by giving the location of every non-nan value, and considering it a 
                    # segment of span=1
                    bw.addEntries(chrom, np.where(valid_idx)[0], values=x[chrom][valid_idx], span=1)

to_bigwig_create_gaps(data_dict, "by_gaps.bigwig")

As pointed out by @rcasero above, the array can also be saved segment-wise resulting in files without NaN values but on cost of speed.

  1. Saving array segment-wise
import itertools

def to_bigwig_segment_wise(x, filename):
    """Save dictionary to bigWig segment-wise
    """
    with pyBigWig.open(filename, 'wb') as bw:
        # add header (list of (chromosome, length) tuples)
        bw.addHeader([(k, len(v)) for k, v in x.items()])
        
        for chrom in x.keys():
            # first segment is at the beginning of the chromosome
            pos = 0
            # group values into segments using as criterion whether they are NaN or not
            for k, g in itertools.groupby(x[chrom], lambda a: ~np.isnan(a)):
                # values in current segment
                g = np.array(list(g))

                # if this is a segment of valid numbers, write it to file
                if k:
                    bw.addEntries(chrom, pos, values=g, span=1, step=1)
                # move pointer to the beginning of next segment
                pos += len(g)

to_bigwig_segment_wise(data_dict, "segment_wise.bigwig")

The resulting file size of by_gaps.bigwig is 1,528 bytes while segment_wise.bigwig is 757 bytes in size.

@rcasero
Copy link
Author

rcasero commented Dec 2, 2022

Another colleagued figured out what was going on with the different file sizes. Basically, the different versions of bw.addEntries() write different types of wiggle blocks even if they are in binary form, which consist of a certain header and then the values. Depending on the type of data you have (same value repeats in segments vs. lots of different values), you want to choose the bw.addEntries() that would be more efficient for that data. This, in practice, means writing a function that transverses the data, finds segments of repeat values and segments of changing values, and then chooses bw.addEntries() accordingly.

I think the package would benefit from that being documented, specifying what kind of wiggle block each bw.addEntries() syntax writes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants