Exit status 250 and SIGBUS (0x7) errors on GATK processes #1030

lfearnley · 2023-05-22T02:11:54Z

Description of the bug

I'm encountering the error described in #1024.

Briefly, I'm running nf-sarek using standard parameters on an HPC using singularity. I encounter this error on GATK components intermittently - some steps succeed on resubmission.

I've had a look at the following during debugging:

The /tmp directories for the machines on which nf-sarek is running seem to have space.
Adding a config.config with process.scratch = false doesn't fix the issue.
Cloning the git repo and adding gatk --java-options "-Djava.io.tmpdir=. -Xmx4g" to set the tmp dir still results in the error.

Any thoughts on how best to debug?

Command used and terminal output

../20230420_Garvan_SLOW5_conversion/nextflow run nf-core/sarek -profile wehi -work-dir /vast/scratch/users/fearnley.l/NGS_SAREK/ --step mapping --tools deepvariant,haplotypecaller,strelka,freebayes,manta,merge,cnvkit --save_mapped True --igenomes_base /vast/projects/fearnleyl_ukbiobank/20230518_NGS_Trios/references/ -resume --input /vast/projects/fearnleyl_ukbiobank/20230518_NGS_Trios/20230518_31666_sarek_manifest.csv --outdir /vast/projects/fearnleyl_ukbiobank/20230518_NGS_Trios/20230518_sarek_output/31666/ -c config.config

Results in

Caused by:                                                                                               [37/1961]
  Process `NFCORE_SAREK:SAREK:BAM_BASERECALIBRATOR:GATK4_BASERECALIBRATOR (31666)`terminated with an error exit status (250)

Command executed:

  gatk --java-options "-Xmx4g" BaseRecalibrator  \
      --input 31666.md.cram \
      --output 31666_chr17_491112-21795850.recal.table \
      --reference Homo_sapiens_assembly38.fasta \
      --intervals chr17_491112-21795850.bed \
      --known-sites dbsnp_146.hg38.vcf.gz --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-s
ites Homo_sapiens_assembly38.known_indels.vcf.gz \
      --tmp-dir . \

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SAREK:SAREK:BAM_BASERECALIBRATOR:GATK4_BASERECALIBRATOR":
      gatk4: $(echo $(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*$//')
  END_VERSIONS

Command exit status:
  250

Command output:
  #
  # A fatal error has been detected by the Java Runtime Environment:
  #
  #  SIGBUS (0x7) at pc=0x00002b04cce71b0d, pid=93, tid=94
  #
  # JRE version:  (11.0.15) (build )
  # Java VM: OpenJDK 64-Bit Server VM (11.0.15-internal+0-adhoc..src, mixed mode, sharing, tiered, compressed oops
, g1 gc, linux-amd64)
  # Problematic frame:
  # C  [libc.so.6+0x15cb0d]
  #
  # Core dump will be written. Default location: Core dumps may be processed with "/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e %P %I %h" (or dumping to core.93)
  #
  # An error report file with more information is saved as:
  # hs_err_pid93.log
  #
  #

Command error:
  WARNING: While bind mounting '/vast:/vast': destination is already in the mount point list
  Using GATK jar /usr/local/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
  Running:
      java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Djava.io.tmpdir=. -Xmx4g -jar /usr/local/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar BaseRecalibrator --input 31666.md.cram --output 31666_chr17_491112-21795850.recal.table --reference Homo_sapiens_assembly38.fasta --intervals chr17_491112-21795850.bed --known-sites dbsnp_146.hg38.vcf.gz --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz --tmp-dir .

Work dir:
  /vast/scratch/users/fearnley.l/NGS_SAREK/69/9f297ba2d5a7a7c16e2ab8304e0184

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

Relevant files

No response

System information

nextflow version 23.04.1.5866
HPC
slurm
Singularity
CentOS
Sarek 3.1.2

The text was updated successfully, but these errors were encountered:

lfearnley · 2023-05-22T09:24:25Z

I pulled Sarek's git repo and tried GATK 4.4.0; the same issue is occurring for me.

lfearnley · 2023-05-23T08:03:29Z

nf-sarek was mounting /tmp on the executing host into the container's /tmp. --tmp-dir for GATK sets the tmp directory correctly, except the JVM is still storing hotspot performance data (hsperfdata) in /tmp at /tmp/hsperfdata_<username>/<pid>. This looks like known GATK behaviour - see this old GATK thread for reference.

When multiple singularity containers are running on the same host it looks like multiple containers are trying to map the same area in the host's /tmp - similar to the issue reported for Bazel here.

Modifying all java-options passed to GATK in nf-sarek to use -XX:-UsePerfData seems to fix the problem with no further issues. I don't think hsperfdata is ever being used in the pipeline, and this may also slightly improve performance.

Tagging @pontus and @FriederikeHanssen in case there's something obvious I'm missing here in turning this off?

pontus · 2023-05-23T08:23:43Z

I don't see any problem with turning it off, but I'm also consider the linked issue as not really similar (that seems to be running with docker which by defaults creates a pid namespace (but by contrast doesn't bring the host /tmp into the container).

So, my understanding is that the crash in that issue comes from a mapped file being truncated and other processes having mapped that gets sad.

For the singularity case, those pid collissions that are almost guaranteed with docker will be very, very unusual with singularity.

So, no objection to a PR to add that option by default, but my guess would be that it's not this change that lets your jobs pass (at least not becaue of the reason in the linked issue, and if memory serves, disabling this did no difference with the memory related issues we spent a lot of time troubleshooting - but that was quite long ago now).

FriederikeHanssen · 2023-05-24T08:09:08Z

@pontus are you referring to the issue that once upon a time all intermediate files were written to /tmp no matter what? this was solved by adding --tmp-dir . to the tool parameters.

I don't know enough about JVM to judge if this option is good or not. Either way though, I would suggest not updating in sarek directly but in nf-core/modules as all GATK modules are shared and it would benefit many other pipelines & developers. Can you open an issue/PR here: https://github.com/nf-core/modules and we can discuss with more people? :)

pontus · 2023-05-24T08:26:23Z

Sorry if that was unclear, I was trying to communicate that I didn't see any problem with disabling perfdata collection, but also didn't think it likely seeing the crash for the same reason as Bazel did in the linked issue (the collisions should be /very/ rare with singularity defaulting to shared pid space and docker defaulting to not share /tmp).

I agree that if this should be brought in, it should be done in the modules repo.

lfearnley · 2023-05-24T11:02:16Z

No problem! I'll open an issues at the modules repo once I verify that running sarek with PerfData disabled is stable - I'll put an assortment of samples through and see if I can make it fault.

…

On Wed, 24 May 2023, 6:26 pm Pontus Freyhult, ***@***.***> wrote: Sorry if that was unclear, I was trying to communicate that I didn't see any problem with disabling perfdata collection, but also didn't think it likely seeing the crash for the same reason as Bazel did in the linked issue (the collisions should be /very/ rare with singularity defaulting to shared pid space and docker defaulting to not share /tmp). I agree that if this should be brought in, it should be done in the modules repo. — Reply to this email directly, view it on GitHub <#1030 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC25LCMKTVZVD4LLWIWAFJTXHXA3VANCNFSM6AAAAAAYJZAWP4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

ffmmulder · 2023-08-17T11:07:44Z

After running into this issue on our cluster as well (just about on every run, with a testset to production data) implementing the fix as suggested by @lfearnley indeed seems to fix this. Everything running stable so far (knocks on wood).

To confirm: Adding ' -XX:-UsePerfData' to the --java-options in the GATK modules has fixed the SIGBUS GATK crashes for me.

maxulysse · 2023-08-17T12:23:18Z

@ffmmulder Can add a comment to the issue in the nf-core/modules#3455

PatrickMaclean · 2023-09-15T12:32:25Z

Thank you all for looking into this. I've had some success overcoming a sigbus error by changing the gatk processes' parameter to:

--java_options "-Xmx${avail_mem}M -XX:-UsePerfData"

lfearnley · 2023-09-17T02:59:30Z

This is an ongoing issue which applies to other nextflow/nf-core pipelines when running Java. It's also impacting running nf-raredisease; I'll be updating nf-core/modules#3455.

pontus · 2023-09-19T06:27:33Z

Hopefully fixed by #1240, closing. Probably best to collect in nf-core/modules#3455 if it didn't help as expected.

lfearnley added the bug Something isn't working label May 22, 2023

lfearnley mentioned this issue May 24, 2023

[FEATURE] Disabling JVM Hotspot in modules for JAVA tools nf-core/modules#3455

Open

maxulysse mentioned this issue Sep 18, 2023

FIX: Disable JVM hotspot in gatk4 modules #1240

Merged

pontus closed this as completed Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exit status 250 and SIGBUS (0x7) errors on GATK processes #1030

Exit status 250 and SIGBUS (0x7) errors on GATK processes #1030

lfearnley commented May 22, 2023

lfearnley commented May 22, 2023

lfearnley commented May 23, 2023 •

edited

Loading

pontus commented May 23, 2023

FriederikeHanssen commented May 24, 2023

pontus commented May 24, 2023

lfearnley commented May 24, 2023 via email

ffmmulder commented Aug 17, 2023

maxulysse commented Aug 17, 2023

PatrickMaclean commented Sep 15, 2023

lfearnley commented Sep 17, 2023

pontus commented Sep 19, 2023

Exit status 250 and SIGBUS (0x7) errors on GATK processes #1030

Exit status 250 and SIGBUS (0x7) errors on GATK processes #1030

Comments

lfearnley commented May 22, 2023

Description of the bug

Command used and terminal output

Relevant files

System information

lfearnley commented May 22, 2023

lfearnley commented May 23, 2023 • edited Loading

pontus commented May 23, 2023

FriederikeHanssen commented May 24, 2023

pontus commented May 24, 2023

lfearnley commented May 24, 2023 via email

ffmmulder commented Aug 17, 2023

maxulysse commented Aug 17, 2023

PatrickMaclean commented Sep 15, 2023

lfearnley commented Sep 17, 2023

pontus commented Sep 19, 2023

lfearnley commented May 23, 2023 •

edited

Loading