Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exit status 250 and SIGBUS (0x7) errors on GATK processes #1030

Closed
lfearnley opened this issue May 22, 2023 · 11 comments
Closed

Exit status 250 and SIGBUS (0x7) errors on GATK processes #1030

lfearnley opened this issue May 22, 2023 · 11 comments
Labels
bug Something isn't working

Comments

@lfearnley
Copy link

Description of the bug

I'm encountering the error described in #1024.

Briefly, I'm running nf-sarek using standard parameters on an HPC using singularity. I encounter this error on GATK components intermittently - some steps succeed on resubmission.

I've had a look at the following during debugging:

  • The /tmp directories for the machines on which nf-sarek is running seem to have space.
  • Adding a config.config with process.scratch = false doesn't fix the issue.
  • Cloning the git repo and adding gatk --java-options "-Djava.io.tmpdir=. -Xmx4g" to set the tmp dir still results in the error.

Any thoughts on how best to debug?

Command used and terminal output

../20230420_Garvan_SLOW5_conversion/nextflow run nf-core/sarek -profile wehi -work-dir /vast/scratch/users/fearnley.l/NGS_SAREK/ --step mapping --tools deepvariant,haplotypecaller,strelka,freebayes,manta,merge,cnvkit --save_mapped True --igenomes_base /vast/projects/fearnleyl_ukbiobank/20230518_NGS_Trios/references/ -resume --input /vast/projects/fearnleyl_ukbiobank/20230518_NGS_Trios/20230518_31666_sarek_manifest.csv --outdir /vast/projects/fearnleyl_ukbiobank/20230518_NGS_Trios/20230518_sarek_output/31666/ -c config.config

Results in

Caused by:                                                                                               [37/1961]
  Process `NFCORE_SAREK:SAREK:BAM_BASERECALIBRATOR:GATK4_BASERECALIBRATOR (31666)`terminated with an error exit status (250)

Command executed:

  gatk --java-options "-Xmx4g" BaseRecalibrator  \
      --input 31666.md.cram \
      --output 31666_chr17_491112-21795850.recal.table \
      --reference Homo_sapiens_assembly38.fasta \
      --intervals chr17_491112-21795850.bed \
      --known-sites dbsnp_146.hg38.vcf.gz --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-s
ites Homo_sapiens_assembly38.known_indels.vcf.gz \
      --tmp-dir . \

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SAREK:SAREK:BAM_BASERECALIBRATOR:GATK4_BASERECALIBRATOR":
      gatk4: $(echo $(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*$//')
  END_VERSIONS

Command exit status:
  250

Command output:
  #
  # A fatal error has been detected by the Java Runtime Environment:
  #
  #  SIGBUS (0x7) at pc=0x00002b04cce71b0d, pid=93, tid=94
  #
  # JRE version:  (11.0.15) (build )
  # Java VM: OpenJDK 64-Bit Server VM (11.0.15-internal+0-adhoc..src, mixed mode, sharing, tiered, compressed oops
, g1 gc, linux-amd64)
  # Problematic frame:
  # C  [libc.so.6+0x15cb0d]
  #
  # Core dump will be written. Default location: Core dumps may be processed with "/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e %P %I %h" (or dumping to core.93)
  #
  # An error report file with more information is saved as:
  # hs_err_pid93.log
  #
  #

Command error:
  WARNING: While bind mounting '/vast:/vast': destination is already in the mount point list
  Using GATK jar /usr/local/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
  Running:
      java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Djava.io.tmpdir=. -Xmx4g -jar /usr/local/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar BaseRecalibrator --input 31666.md.cram --output 31666_chr17_491112-21795850.recal.table --reference Homo_sapiens_assembly38.fasta --intervals chr17_491112-21795850.bed --known-sites dbsnp_146.hg38.vcf.gz --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz --tmp-dir .

Work dir:
  /vast/scratch/users/fearnley.l/NGS_SAREK/69/9f297ba2d5a7a7c16e2ab8304e0184

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

Relevant files

No response

System information

nextflow version 23.04.1.5866
HPC
slurm
Singularity
CentOS
Sarek 3.1.2

@lfearnley lfearnley added the bug Something isn't working label May 22, 2023
@lfearnley
Copy link
Author

I pulled Sarek's git repo and tried GATK 4.4.0; the same issue is occurring for me.

@lfearnley
Copy link
Author

lfearnley commented May 23, 2023

nf-sarek was mounting /tmp on the executing host into the container's /tmp. --tmp-dir for GATK sets the tmp directory correctly, except the JVM is still storing hotspot performance data (hsperfdata) in /tmp at /tmp/hsperfdata_<username>/<pid>. This looks like known GATK behaviour - see this old GATK thread for reference.

When multiple singularity containers are running on the same host it looks like multiple containers are trying to map the same area in the host's /tmp - similar to the issue reported for Bazel here.

Modifying all java-options passed to GATK in nf-sarek to use -XX:-UsePerfData seems to fix the problem with no further issues. I don't think hsperfdata is ever being used in the pipeline, and this may also slightly improve performance.

Tagging @pontus and @FriederikeHanssen in case there's something obvious I'm missing here in turning this off?

@pontus
Copy link

pontus commented May 23, 2023

I don't see any problem with turning it off, but I'm also consider the linked issue as not really similar (that seems to be running with docker which by defaults creates a pid namespace (but by contrast doesn't bring the host /tmp into the container).

So, my understanding is that the crash in that issue comes from a mapped file being truncated and other processes having mapped that gets sad.

For the singularity case, those pid collissions that are almost guaranteed with docker will be very, very unusual with singularity.

So, no objection to a PR to add that option by default, but my guess would be that it's not this change that lets your jobs pass (at least not becaue of the reason in the linked issue, and if memory serves, disabling this did no difference with the memory related issues we spent a lot of time troubleshooting - but that was quite long ago now).

@FriederikeHanssen
Copy link
Contributor

@pontus are you referring to the issue that once upon a time all intermediate files were written to /tmp no matter what? this was solved by adding --tmp-dir . to the tool parameters.

I don't know enough about JVM to judge if this option is good or not. Either way though, I would suggest not updating in sarek directly but in nf-core/modules as all GATK modules are shared and it would benefit many other pipelines & developers. Can you open an issue/PR here: https://github.com/nf-core/modules and we can discuss with more people? :)

@pontus
Copy link

pontus commented May 24, 2023

Sorry if that was unclear, I was trying to communicate that I didn't see any problem with disabling perfdata collection, but also didn't think it likely seeing the crash for the same reason as Bazel did in the linked issue (the collisions should be /very/ rare with singularity defaulting to shared pid space and docker defaulting to not share /tmp).

I agree that if this should be brought in, it should be done in the modules repo.

@lfearnley
Copy link
Author

lfearnley commented May 24, 2023 via email

@ffmmulder
Copy link

After running into this issue on our cluster as well (just about on every run, with a testset to production data) implementing the fix as suggested by @lfearnley indeed seems to fix this. Everything running stable so far (knocks on wood).

To confirm: Adding ' -XX:-UsePerfData' to the --java-options in the GATK modules has fixed the SIGBUS GATK crashes for me.

@maxulysse
Copy link
Member

@ffmmulder Can add a comment to the issue in the nf-core/modules#3455

@PatrickMaclean
Copy link

Thank you all for looking into this. I've had some success overcoming a sigbus error by changing the gatk processes' parameter to:

--java_options "-Xmx${avail_mem}M -XX:-UsePerfData"

@lfearnley
Copy link
Author

This is an ongoing issue which applies to other nextflow/nf-core pipelines when running Java. It's also impacting running nf-raredisease; I'll be updating nf-core/modules#3455.

@pontus
Copy link

pontus commented Sep 19, 2023

Hopefully fixed by #1240, closing. Probably best to collect in nf-core/modules#3455 if it didn't help as expected.

@pontus pontus closed this as completed Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants