Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running block=TRUE crashes during permutations with "Error in .start_as_unnamed_integer(start)" #64

Open
mpiersonsmela opened this issue Jan 19, 2024 · 8 comments

Comments

@mpiersonsmela
Copy link

I am consistently getting the following error while running permutations. It doesn't always happen on the same permutation.
Example output:

Beginning permutation 1
...Chromosome chr1: Smoothed (1.26 min). 38 CpG(s) excluded due to zero coverage. 72 regions scored (0.07 min).
...Chromosome chr2: Smoothed (1.87 min). 10 regions scored (0.04 min).
...Chromosome chr3: Smoothed (1.34 min). 6 regions scored (0.03 min).
...Chromosome chr4: Smoothed (0.68 min). 11 regions scored (0.04 min).
...Chromosome chr5: Smoothed (1.65 min). 2 CpG(s) excluded due to zero coverage. Error in .start_as_unnamed_integer(start) :
each range must have a start that is < 2^31 and > - 2^31
Calls: dmrseq ... .new_IRanges_from_start_end -> .start_as_unnamed_integer
In addition: Warning messages:
1: In FUN(X[[i]], ...) : no non-missing arguments to min; returning Inf
2: In FUN(X[[i]], ...) : no non-missing arguments to max; returning -Inf
Execution halted

I am using the latest version of all packages.

@mpiersonsmela
Copy link
Author

The input command was:
naive_new_reg <- dmrseq(bs = sort(oog_naive_new), testCovariate=testCovariate, block=TRUE, minInSpan=150, bpSpan=5000, maxGapSmooth=10000)

I previously ran it with default values for minInSpan, bpSpan, and maxGapSmooth but it gave a warning to increase them. (Also, what values are recommended for running with block=TRUE?)

@kdkorthauer
Copy link
Owner

Hi @mpiersonsmela,

Thanks for reporting this issue. I'm not able to determine the root cause from the output alone. Can you let me know what (if any) filtering you've done on the input object (e.g. removing loci with low coverage across samples)? If you're able to share a small portion of your input data that reproduces the output (e.g. chromosome 5 in the example above - you can use set.seed() to provide a reproducible example), that would greatly help. You may upload the small example input here.

Regarding your question about parameter settings when using block=TRUE, you may refer to the documentation section here which has some guidance on how to interpret the parameters. The bpSpan parameter governs the size of the window over which methylation values are smoothed, as long as they have at least minInSpan covered loci that are spaced apart my a max of maxGapSmooth. Your settings seem like a reasonable starting point, but in general I'd recommend examining the output candidate regions after scaling up with blocks - if you still see many candidate regions that are relatively close together, where you'd otherwise interpret them as one contiguous region, then I'd recommend to scale up all of those parameters further to make sure the smoothing is at a more relevant scale. Hope that helps!

@mpiersonsmela
Copy link
Author

OK, I'll upload the example data tomorrow. Regarding the filtering, I'm removing all the loci that have zero reads for all samples in a group. This filtering strategy worked OK when not running block=TRUE.

This is the filtering code I'm using:

#Read in packages
suppressPackageStartipMessages(library(tidyverse))
suppressPackageStartipMessages(library(magrittr))
suppressPackageStartipMessages(library(bsseq))
suppressPackageStartipMessages(library(dmrseq))
suppressPackageStartipMessages(library(BiocParallel))
suppressPackageStartipMessages(library(argparse))
register(MulticoreParam(16)) 

# Make sure metadata has the following structure:
#    Files   | Group
# -----------------------
# <file_name>| <int 1 or 2>

args = commandArgs(trailingOnly=TRUE)

metadata_path <- args[1]

# Read in data
metadata <- read.csv(metadata_path)

# get file names
cov_files <- metadata %>%
  pull(Files)

# make BSseq object
bismark_cov <- read.bismark(files = cov_files, rmZeroCov = TRUE)

# trt = vector of condition labels for each sample
trt <- metadata %>%
  pull(Group)
pData(bismark_cov)$Condition <- trt

# filter for loci that are not 0 in all samples of a condition
## get group 1 rows that are not 0 for the group
sample_1 <- which(pData(bismark_cov)$Condition == 1)
cov_temp_1 <- bismark_cov[, sample_1]
loci_1_keep <- which(DelayedMatrixStats::rowSums2(getCoverage(cov_temp_1, type="Cov")) > 0)

## get group 2 rows that are not 0 for the group
sample_1 <- which(pData(bismark_cov)$Condition == 2)
cov_temp_2 <- bismark_cov[, sample_2]
loci_2_keep <- which(DelayedMatrixStats::rowSums2(getCoverage(cov_temp_2, type="Cov")) > 0)

# only keep loci that have at least 1 read in both sample groups 
loci.idx <- sort(unique(intersect(loci_1_keep, loci_2_keep)))
bismark_cov_filt <- bismark_cov[loci.idx,]

testCovariate <- "Condition"
bismark_cov_dmr <- dmrseq(bs = sort(bismark_cov_filt),
                          testCovariate = testCovariate, minInSpan=150, bpSpan=5000, maxGapSmooth=10000, block=TRUE)

@mpiersonsmela
Copy link
Author

OK, I uploaded the files to your Dropbox link. I wasn't able to figure out how to take a small portion that replicates the results, so I just uploaded the whole thing.

@kdkorthauer
Copy link
Owner

Thanks again for reporting this issue and for sending me your files to debug. It took me a while to track it down, but I found the implicated line that didn't properly filter loci with zero coverage in all samples of one permuted condition in the permutations during the construction of candidate blocks. I suspect I never encountered this before because I typically use more stringent filters for coverage. I pushed a fix, and was able to run your code successfully. You can access this version on the devel branch of GitHub immediately (BiocManager::install_github("kdkorthauer/dmrseq", "devel"), and it will also be available in Bioconductor Devel (version 3.19) in either today or Monday's build. Assuming it passes all checks, I'll port it over to the current Bioc release version (version 3.18) early next week.

As a side note, I noticed that the number of loci in your dataset and the high number of consecutive genomic coordinates look like you may be including loci on both strands separately. You may consider collapsing strands. Happy to provide some more information on that if you're interested.

Thanks again for helping me to improve the software.

@mpiersonsmela
Copy link
Author

Thanks! I'll follow up with you by email about collapsing strands.

@mpiersonsmela
Copy link
Author

mpiersonsmela commented Feb 17, 2024 via email

@kdkorthauer
Copy link
Owner

Yes, exactly - collapsing would treat each CpG as one observation. If you are interested in hemimethylation, then it would not be appropriate to collapse.

kdkorthauer added a commit that referenced this issue Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants