Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bus error on version 0.2.3 #246

Closed
garrison opened this issue Oct 10, 2020 · 18 comments
Closed

Bus error on version 0.2.3 #246

garrison opened this issue Oct 10, 2020 · 18 comments

Comments

@garrison
Copy link

garrison commented Oct 10, 2020

I have been experiencing intermittent Bus errors, I believe since upgrading to v0.2.3.

Here's a typical traceback:

      From worker 27:	signal (7): Bus error
      From worker 27:	in expression starting at none:0
      From worker 27:	unsafe_store! at ./pointer.jl:118 [inlined]
      From worker 27:	unsafe_store! at ./pointer.jl:118 [inlined]
      From worker 27:	unsafe_store! at /home/jrgarr/.julia/packages/JLD2/hIQpq/src/misc.jl:15 [inlined]
      From worker 27:	_write at /home/jrgarr/.julia/packages/JLD2/hIQpq/src/mmapio.jl:190 [inlined]
      From worker 27:	write at /home/jrgarr/.julia/packages/JLD2/hIQpq/src/misc.jl:27 [inlined]
      From worker 27:	write_object_header_and_dataspace_message at /home/jrgarr/.julia/packages/JLD2/hIQpq/src/datasets.jl:442
      From worker 27:	write_dataset at /home/jrgarr/.julia/packages/JLD2/hIQpq/src/datasets.jl:385
      From worker 27:	write_dataset at /home/jrgarr/.julia/packages/JLD2/hIQpq/src/inlineunion.jl:44 [inlined]
      From worker 27:	write at /home/jrgarr/.julia/packages/JLD2/hIQpq/src/groups.jl:119
      From worker 27:	_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
      From worker 27:	jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
      From worker 27:	write at /home/jrgarr/.julia/packages/JLD2/hIQpq/src/JLD2.jl:339
      From worker 27:	_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
      From worker 27:	jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
      From worker 27:	#48 at /home/jrgarr/.julia/packages/JLD2/hIQpq/src/fileio.jl:11
      From worker 27:	#jldopen#35 at /home/jrgarr/.julia/packages/JLD2/hIQpq/src/loadsave.jl:4
      From worker 27:	_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
      From worker 27:	jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
      From worker 27:	jldopen at /home/jrgarr/.julia/packages/JLD2/hIQpq/src/loadsave.jl:2
      From worker 27:	_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
      From worker 27:	jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
      From worker 27:	#save#47 at /home/jrgarr/.julia/packages/JLD2/hIQpq/src/fileio.jl:5
      From worker 27:	save at /home/jrgarr/.julia/packages/JLD2/hIQpq/src/fileio.jl:5
      From worker 27:	unknown function (ip: 0x7f67cac25695)
      From worker 27:	_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
      From worker 27:	jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
      From worker 27:	jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
      From worker 27:	do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:655
      From worker 27:	jl_f__apply_latest at /buildworker/worker/package_linux64/build/src/builtins.c:705
      From worker 27:	#invokelatest#1 at ./essentials.jl:710 [inlined]
      From worker 27:	invokelatest at ./essentials.jl:709 [inlined]
      From worker 27:	#save#31 at /home/jrgarr/.julia/packages/FileIO/wN5rD/src/loadsave.jl:230
      From worker 27:	save at /home/jrgarr/.julia/packages/FileIO/wN5rD/src/loadsave.jl:214 [inlined]
      From worker 27:	#save#17 at /home/jrgarr/.julia/packages/FileIO/wN5rD/src/loadsave.jl:136 [inlined]
      From worker 27:	save at /home/jrgarr/.julia/packages/FileIO/wN5rD/src/loadsave.jl:136 [inlined]

Unfortunately I don't have a minimal test case, but I have now noticed this on two different (unrelated) codes paths, so I believe it is a problem with JLD2 (perhaps combined with some aspects of my configuration). Both code paths only began experiencing errors recently. I was using v0.2.1 previously.

@JonasIsensee
Copy link
Collaborator

JonasIsensee commented Oct 10, 2020

Hi @garrison,
thanks for reporting this!

Can you tell me more about the system and what you are trying to do? (e.g. file size, number of workers)

According to Wikipedia Bus Errors tend to be quite rare and on linux systems mostly show up when there are problems with memory mapping. (so if you use iotype= IOStream there shouldn't be any problem)

Taking a wild guess, can you make sure that your disk quota is not full? Are you certain that different workers don't accidentally try to write/remove the same files?

Edit: The stacktrace in #127 looks quite similar to yours.
This makes me wonder whether this could indeed be a sign of the machine running out of memory. I think the memory foot print will generally be larger when using MmapIO over IOStream - the whole idea is to write to (fast) memory instead of (slow) disk and let the OS take care of syncing later.

@garrison
Copy link
Author

Hi @JonasIsensee. I am experiencing this error with exclusive access to a Regular Shared Memory node on Bridges, a machine at the Pittsburgh Supercomputing Center. The code uses only a small portion of the available RAM. I am using one worker per available core, so 28 workers. Different workers are not accessing the same files. I do not expect it is a quota issue, as I am well below quota, and the error is only intermittent. The successfully saved files are ~16 megabytes.

I can try iotype=IOStream. On the other hand, I may try rolling back to 0.2.1 to see if the error surfaces there.

@JonasIsensee
Copy link
Collaborator

Oh, I hardly could have been more wrong then
but I'll stick to my hypothesis that this is a MMAP problem.

I had a look at the diffs between the latest patch releases of JLD2 and the changes that might have caused this (at least my best guess) were a part v0.2.1 (#231)

So your errors should persist on v0.2.1.

@MichaelHatherly
Copy link

I've hit this twice today on JLD2 0.2.3 and Julia 1.5.3-pre, though I'm now not able to reproduce. Pretty sure I wasn't mmapping anything at the time. Here's the backtrace if it helps. I'll also try to produce a reduced case if I manage to hit it again.

      From worker 4:	
      From worker 4:	signal (7): Bus error
      From worker 4:	in expression starting at none:0
      From worker 4:	unsafe_store! at ./pointer.jl:118 [inlined]
      From worker 4:	unsafe_store! at ./pointer.jl:118 [inlined]
      From worker 4:	unsafe_store! at /home/mike/.julia/packages/JLD2/4AJn8/src/misc.jl:15 [inlined]
      From worker 4:	_write at /home/mike/.julia/packages/JLD2/4AJn8/src/mmapio.jl:183 [inlined]
      From worker 4:	write at /home/mike/.julia/packages/JLD2/4AJn8/src/misc.jl:27 [inlined]
      From worker 4:	commit at /home/mike/.julia/packages/JLD2/4AJn8/src/datatypes.jl:259
      From worker 4:	h5fieldtype at /home/mike/.julia/packages/JLD2/4AJn8/src/data.jl:837
      From worker 4:	unknown function (ip: 0x7f13cda0f09d)
      From worker 4:	h5type at /home/mike/.julia/packages/JLD2/4AJn8/src/data.jl:843 [inlined]
      From worker 4:	commit at /home/mike/.julia/packages/JLD2/4AJn8/src/data.jl:223
      From worker 4:	unknown function (ip: 0x7f13cda0dbba)
      From worker 4:	h5type at /home/mike/.julia/packages/JLD2/4AJn8/src/data.jl:159
      From worker 4:	unknown function (ip: 0x7f13cda0b672)
      From worker 4:	h5type at /home/mike/.julia/packages/JLD2/4AJn8/src/data.jl:168
      From worker 4:	write_dataset at /home/mike/.julia/packages/JLD2/4AJn8/src/datasets.jl:525
      From worker 4:	write at /home/mike/.julia/packages/JLD2/4AJn8/src/groups.jl:119
      From worker 4:	write at /home/mike/.julia/packages/JLD2/4AJn8/src/JLD2.jl:327
      From worker 4:	#33 at /home/mike/.julia/packages/JLD2/4AJn8/src/loadsave.jl:126
      From worker 4:	#jldopen#29 at /home/mike/.julia/packages/JLD2/4AJn8/src/loadsave.jl:4
      From worker 4:	jldopen at /home/mike/.julia/packages/JLD2/4AJn8/src/loadsave.jl:2
      From worker 4:	#save#32 at /home/mike/.julia/packages/JLD2/4AJn8/src/loadsave.jl:120
      From worker 4:	save at /home/mike/.julia/packages/JLD2/4AJn8/src/loadsave.jl:120
      From worker 4:	unknown function (ip: 0x7f13cd9fdbb5)
      From worker 4:	jl_apply at /home/mike/usr/julia/stable/src/julia.h:1690 [inlined]
      From worker 4:	do_apply at /home/mike/usr/julia/stable/src/builtins.c:655
      From worker 4:	jl_f__apply_latest at /home/mike/usr/julia/stable/src/builtins.c:705
      From worker 4:	#invokelatest#1 at ./essentials.jl:710 [inlined]
      From worker 4:	invokelatest at ./essentials.jl:709 [inlined]
      From worker 4:	#save#31 at /home/mike/.julia/packages/FileIO/wN5rD/src/loadsave.jl:230
      From worker 4:	save at /home/mike/.julia/packages/FileIO/wN5rD/src/loadsave.jl:214 [inlined]
      From worker 4:	#save#17 at /home/mike/.julia/packages/FileIO/wN5rD/src/loadsave.jl:136 [inlined]
      From worker 4:	save at /home/mike/.julia/packages/FileIO/wN5rD/src/loadsave.jl:136 [inlined]
      From worker 4:	#call_cached#2 at /home/mike/.julia/dev/ProvenanceTracking/src/caching.jl:95
      From worker 4:	unknown function (ip: 0x7f13cd9ec956)
      From worker 4:	call_cached at /home/mike/.julia/dev/ProvenanceTracking/src/caching.jl:58 [inlined]
      From worker 4:	#10#11 at /home/mike/.julia/dev/ProvenanceTracking/src/plans.jl:139 [inlined]
      From worker 4:	#10 at /home/mike/.julia/dev/ProvenanceTracking/src/plans.jl:139 [inlined]
      From worker 4:	#47 at ./threadingconstructs.jl:169
      From worker 4:	unknown function (ip: 0x7f13cd9e7b2c)
      From worker 4:	jl_apply at /home/mike/usr/julia/stable/src/julia.h:1690 [inlined]
      From worker 4:	start_task at /home/mike/usr/julia/stable/src/task.c:707
      From worker 4:	unknown function (ip: (nil))
      From worker 4:	Allocations: 54092525 (Pool: 54077006; Big: 15519); GC: 53

@garrison
Copy link
Author

I have yet to experience the error since downgrading to 0.2.1. I was hitting it fairly regularly on 0.2.3.

I haven't tried git-bisect yet, but I'll note that #230 did change a line with MmapIO.

Another, less likely, possibility, is that this change to JLD2 is correct but somehow caused some other underlying issue with julia to surface.

@garrison
Copy link
Author

p.s. I am using julia v1.5.2.

@JonasIsensee
Copy link
Collaborator

JonasIsensee commented Oct 11, 2020

Thanks to both of you for trying to help figuring this out.

You're right. #230 did change some lines in the MmapIO struct / file but I don't really see how this could create such errors.
Btw. here's the diff for v0.2.1..v0.2.3 v0.2.1...v0.2.3

Apparently in some cases misused / poorly implemented Locks can cause bus errors. (But this doesn't really seem to fit to the stack traces)
Do you by any chance happen to be using a lot of threads?

There is this JuliaLang/julia#28245
but I'd call that misuse rather than a bug in julia.

@JonasIsensee
Copy link
Collaborator

Btw. Is this reproducible for the exact same file?
(e.g. in general only intermittent but if you run your code again, will the exact same job die?)

In that case this could somehow be if there is something wrong in the byte number calculation and we end up trying to write in a region that is not mapped.

@garrison
Copy link
Author

garrison commented Oct 11, 2020

Btw. Is this reproducible for the exact same file?

[...] this could somehow be if there is something wrong in the byte number calculation and we end up trying to write in a region that is not mapped.

These are Monte Carlo simulations, so the data is different in each (different random seed), but the dimensions are the same in each file. So these files are all identical as far as byte number calculations are concerned. I am experiencing failures on roughly 1 out of 100 save attempts.

@garrison
Copy link
Author

the dimensions are the same in each file.

Nevermind; there is one Vector that varies its length throughout the simulation and is among the things saved at the end. I wonder if this might be behind the intermittent behavior...

@JonasIsensee
Copy link
Collaborator

I'm not sure how to proceed from here.
I've looked at the changes since v0.2.0 and can't find anything that might explain what is going wrong here.
Since I can't reproduce the error there is very little I can do now.

These are Monte Carlo simulations, so the data is different in each (different random seed), but the dimensions are the same in each file. So these files are all identical as far as byte number calculations are concerned. I am experiencing failures on roughly 1 out of 100 save attempts.

Can you tell what seeds produced the error ? And e.g. rerun for the exact same seed? If that can be used to reproduce the error would you consider privately sharing that data with me?
(i.e. use IOStream as backend to actually store the data without an error and then see if loading + saving with Mmap produces an error again )

@maartenvd
Copy link

The easiest way to get a bus error (JLD2, same stack trace) is to have 2 different processes writing to the same .jld file, but then that is a dangerous thing to do anyway

@JonasIsensee
Copy link
Collaborator

The easiest way to get a bus error (JLD2, same stack trace) is to have 2 different processes writing to the same .jld file, but then that is a dangerous thing to do anyway

Fair enough. But as you say, that is an evil thing to do and there is absolutely no reason to expect that to work.

@maartenvd
Copy link

I agree, but maybe that is what OP was doing.

@JonasIsensee
Copy link
Collaborator

I've been able to produce the error myself now.

I was writing on 32 different worker processes (on the same machine) to independent files . No multithreading involved.
So somehow the presence of the other mmappings invalidate them when JLD2 doesn't expect that?

@JonasIsensee
Copy link
Collaborator

Hi @garrison ,

do you still get this error?
I wrote above that I managed to reproduce it once in November but since then it hasn't happened again.

@garrison
Copy link
Author

garrison commented Jul 29, 2021

I have pinned to 0.2.1 (EDIT: actually, I am using the latest JLD2 with iotype=IOStream as described above) and have not been doing the large calculations recently that caused this to intermittently show up. So I don't know, and investigating by reproducing exactly what I did before would require a fair amount of cluster time. I might have a reason to run a similar calculation in the next few months though, and if so I can try it with JLD2 latest.

@JonasIsensee
Copy link
Collaborator

Since this hasn't occurred to me in ages and apparently no one else either, I'll close this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants