Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

md5checksum shows example dataset analysis fails #13

Open
RishiDeKayne opened this issue Mar 16, 2021 · 12 comments
Open

md5checksum shows example dataset analysis fails #13

RishiDeKayne opened this issue Mar 16, 2021 · 12 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@RishiDeKayne
Copy link

RishiDeKayne commented Mar 16, 2021

Hi, I've been trying to use dentist on the provided example dataset but a number of the md5 check sums after it finishes running are failing with no other errors that I can find.

I installed snakemake v6.0.0 and singularity v3.6.3 through conda and ran through the example dataset as follows:

wget https://bds.mpi-cbg.de/hillerlab/DENTIST/dentist-example.v1.0.1.tar.gz
tar -xzf ./dentist-example.v1.0.1.tar.gz
cd dentist-example

# run the workflow
SKIP_LACHECK=1 snakemake --configfile=snakemake.yaml --use-singularity --cores=4 

# validate the files
md5sum -c checksum.md5

but the checksum output was as follows:

gap-closed.fasta: FAILED
workdir/.assembly-test.bps: OK
workdir/.assembly-test.dentist-reads.anno: OK
workdir/.assembly-test.dentist-reads.data: OK
workdir/.assembly-test.dentist-self.anno: OK
workdir/.assembly-test.dentist-self.data: OK
workdir/.assembly-test.dust.anno: OK
workdir/.assembly-test.dust.data: OK
workdir/.assembly-test.hdr: OK
workdir/.assembly-test.idx: OK
workdir/.assembly-test.tan.anno: OK
workdir/.assembly-test.tan.data: OK
workdir/.gap-closed-preliminary.bps: FAILED
workdir/.gap-closed-preliminary.dentist-self.anno: FAILED
workdir/.gap-closed-preliminary.dentist-self.data: FAILED
workdir/.gap-closed-preliminary.dentist-weak-coverage.anno: FAILED
workdir/.gap-closed-preliminary.dentist-weak-coverage.data: FAILED
workdir/.gap-closed-preliminary.dust.anno: FAILED
workdir/.gap-closed-preliminary.dust.data: FAILED
workdir/.gap-closed-preliminary.hdr: OK
workdir/.gap-closed-preliminary.idx: FAILED
workdir/.gap-closed-preliminary.tan.anno: FAILED
workdir/.gap-closed-preliminary.tan.data: FAILED
workdir/.reads.bps: OK
workdir/.reads.idx: OK
workdir/assembly-test.assembly-test.las: OK
workdir/assembly-test.dam: OK
workdir/assembly-test.reads.las: OK
workdir/gap-closed-preliminary.dam: FAILED
workdir/gap-closed-preliminary.fasta: FAILED
workdir/gap-closed-preliminary.gap-closed-preliminary.las: FAILED
workdir/gap-closed-preliminary.reads.las: FAILED
workdir/reads.db: OK
md5sum: WARNING: 15 computed checksums did NOT match

any advice on how to get the example dataset running would be greatly appreciated,
Thanks,
Rishi

@a-ludi
Copy link
Owner

a-ludi commented Mar 16, 2021

Hi Rishi, could you share one of the logs/process.*.log files? Somebody else experienced failing md5sums like you do and the reason was that one of the auxiliary tools crashed in most of the calls for a yet unknown reason. Could you also share some more information about your system?

lsb_release -a
free -h

@a-ludi a-ludi self-assigned this Mar 16, 2021
@a-ludi a-ludi added bug Something isn't working help wanted Extra attention is needed labels Mar 16, 2021
@RishiDeKayne
Copy link
Author

Sure, I have attached process.1.log and the system info is as follows:

lsb_release -a

output:

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic
free -h

output:

              total        used        free      shared  buff/cache   available
Mem:           995G        1.8G        229G        324K        764G        988G
Swap:          8.0G        1.5G        6.5G

process.1.log

@a-ludi
Copy link
Owner

a-ludi commented Mar 17, 2021

As I suspected, it is the same error memory-associated error:

$ jq 'select((.exitStatus // 0) != 0)' process.1.log | head -n50
{
  "thread": 140513968151344,
  "logLevel": "diagnostic",
  "state": "post",
  "command": [
    "computeintrinsicqv",
    "-d19",
    "/tmp/dentist-processPileUps-OeaddP/pileup-55b-56f.db",
    "/tmp/dentist-processPileUps-OeaddP/pileup-55b-56f.pileup-55b-56f-chained-filtered.las"
  ],
  "output": [
    "allocation failure: Invalid argument cachelinesize=0 requested size is 24",
    "AutoArray<unsigned long,alloc_type_memalign_cacheline> failed to allocate 3 elements (24 bytes)",
    "current total allocation 467987",
    "",
    ""
  ],
  "exitStatus": 1,
  "timestamp": 637514242850290800,
  "action": "execute",
  "type": "command"
}
... (many more instances with the same signature)

The problem is clearly not related to a lack of memory. Since I have no in-depth knowledge of computeintrinsicqv, I will ask the author for help.

In the meantime, you may try running it on a different machine.

@a-ludi
Copy link
Owner

a-ludi commented Mar 17, 2021

Information from other user:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:          1.0Ti        15Gi       2.5Gi       4.1Gi       989Gi       982Gi
Swap:            0B          0B          0B
$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="8 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://www.centos.org/"

BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="8"

@RishiDeKayne
Copy link
Author

Hi again,
Weirdly I reran the example set each of our computing nodes - it failed on every one of our big memory machines but ran on our regular machines. I did the same system checks as above but cant find anything obviously different between the two so I'm still not sure what could be causing it. In case it is helpful:

$ lsb_release -a

##WORKED - regular 
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic

##FAILED - big-memory 
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic

$ free -h

##WORKED - regular
              total        used        free      shared  buff/cache   available
Mem:           360G        5.2G        332G        4.4M         21G        354G
Swap:          8.0G        8.0G         88K

##FAILED - big memory  
              total        used        free      shared  buff/cache   available
Mem:           995G        1.8G        229G        324K        764G        988G
Swap:          8.0G        1.5G        6.5G

and now all checksum outputs say 'OK'

@a-ludi
Copy link
Owner

a-ludi commented Mar 22, 2021

Hmm, interesting. I will try running the example on a 1TB memory machine as well. Maybe there is some bug related to large pointers.

@shri1984
Copy link

shri1984 commented Apr 6, 2021

Hi,
I have the same issue. md5sum -c checksum.md5 failed (15 cases). I am using a machine with 2 TB RAM (Ubuntu).

@a-ludi
Copy link
Owner

a-ludi commented Apr 19, 2021

I tried it on one of our big memory machines and it worked as expected:

# submit job with 8 cores
$ sbatch -c8 -pbigmem --wrap='snakemake --configfile=snakemake.yaml --use-singularity --cores=$SLURM_JOB_CPUS_PER_NODE'
# memory information about the machine
$ ssh r01n03 free -h
              total        used        free      shared  buff/cache   available
Mem:           1.0T        964G         40G        1.6G        2.5G         39G
Swap:            0B          0B          0B
# OS information about the machine
$ ssh r01n03 lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:la
nguages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.4.1708 (Core) 
Release:        7.4.1708
Codename:       Core

So I conjecture (:smile:) that it is not just the amount of total or available memory that causes the bug. But I still have no clue what's going on. Also, I have not heard anything from the author of daccord (see this issue). I will keep digging.

@a-ludi
Copy link
Owner

a-ludi commented Jun 22, 2021

@shri1984 @RishiDeKayne I hope you are still interested in DENTIST after all this time but I think I have fixed the bug (25f96d2). I would be very happy if you could test the example again and see if it works.

The issue (likely) was that I used Alpine Linux in the Container which has its own libc implementation that is not 100% compatible with glibc used in common Distros like Ubuntu. I switched to Ubuntu and the error went away on one of my machines.

@shri1984
Copy link

Thanks @a-ludi. example data set went fine including the md5sum. The latest version helped. I am trying dentist on my hic scaffolded hifi assembly. I will post the update here.

@lizhao007
Copy link

@shri1984 @RishiDeKayne I hope you are still interested in DENTIST after all this time but I think I have fixed the bug (25f96d2). I would be very happy if you could test the example again and see if it works.

The issue (likely) was that I used Alpine Linux in the Container which has its own libc implementation that is not 100% compatible with glibc used in common Distros like Ubuntu. I switched to Ubuntu and the error went away on one of my machines.

Thanks for your work, but I get the same issue with example data by the latest version (v4.0.0) — md5sum -c checksum.md5 failed (15 cases). The information about the machine is:

              total        used        free      shared  buff/cache   available
Mem:           2.0T        535G        1.4T         56M        5.7G        1.4T
Swap:          4.0G        2.1G        1.9G

LSB Version:	:core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.5.1804 (Core) 
Release:	7.5.1804
Codename:	Core

@a-ludi
Copy link
Owner

a-ludi commented Dec 5, 2022

Hi @lizhao007 ,

could you please share the list of files that failed the checksum test? I need it to get an idea what went wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants