Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock bug in FFTW 1.2.4 #163

Closed
Octogonapus opened this issue Aug 27, 2020 · 12 comments · Fixed by #205
Closed

Deadlock bug in FFTW 1.2.4 #163

Octogonapus opened this issue Aug 27, 2020 · 12 comments · Fixed by #205

Comments

@Octogonapus
Copy link
Contributor

Octogonapus commented Aug 27, 2020

I have observed a deadlock bug involving FFTW v1.2.4. I am using the same codebase as I did in #161. It appears that the bug is affected by the number of threads I give Julia (still setting FFTW to 1 thread, though). With 4 threads, the deadlock bug occurs and the program will freeze. With 16 threads, the bug does not occur (it could be that the likelihood is simply much smaller and so does not occur during the ~6 minute runtime). Furthermore, in my debugging, I found that I could add enough debugging print statements that slow down the program enough that the deadlock bug does not occur.

I am confident that the bug was introduced in v1.2.4 because I cannot reproduce the problem with v1.2.2 nor v1.2.3. I can reproduce the problem with Julia v1.4 and v1.5. I can reproduce this problem on multiple machines reliably.

Looking at the changes introduced in v1.2.4, could it be that this loop is waiting indefinitely? v1.2.3...v1.2.4

Edit: It may also be worth noting that when the deadlock occurs, all threads appear to be busy. The CPU usage on the machine jumps to exactly the correct percentage for the number of threads I give Julia (e.g., on an 8 thread machine with 4 Julia threads, the CPU usage jumps to 50% and stays there indefinitely).

@stevengj
Copy link
Member

stevengj commented Aug 30, 2020

Can you create a small program that reproduces the problem?

@Octogonapus
Copy link
Contributor Author

Octogonapus commented Aug 31, 2020

Turns out your original example reproduces the problem.

using FFTW
FFTW.set_num_threads(1)
Threads.@threads for i = 1:10000
    a = rand(100)
    fft(a)
    GC.gc()
    println(i)
end

Sorry I did not catch this before. Without printing the loop index, it's not possible to differentiate the deadlock over a successful computation by looking at CPU usage alone.

On my system, using FFTW v1.2.4, I get this output. It hangs on that last number (8754) indefinitely while using 100% CPU (sometimes it can take a while but it always happens eventually). With FFTW v1.2.2, the issue does not occur.

@Octogonapus
Copy link
Contributor Author

Octogonapus commented Aug 31, 2020

I changed this line while !trylock(deferred_destroy_lock); end to this

    while !trylock(deferred_destroy_lock)
        wait(Timer(0.01))
    end

and I have been running the example code (with an increased number of iterations) for an hour now without issue.

Edit: This example must not be complete, because I get the same successful behavior using lock(deferred_destroy_lock) which produces a separate issue on our real code.

@stevengj
Copy link
Member

I don't see why it should be necessary to sleep in the spinlock here.

@Octogonapus
Copy link
Contributor Author

Octogonapus commented Aug 31, 2020

The current v1.2.4 release seems to work fine with FFTW.set_num_threads(2), just not with FFTW.set_num_threads(1). However, setting any number of threads other than 1 causes my CPU usage to stay at 100%. Is this a separate bug? I would expect roughly 13% CPU usage, not 100%.

Edit: It just took a long time to reproduce the problem with 2 threads.

@Octogonapus
Copy link
Contributor Author

Hi Steven, I was looking into this issue again and it still occurs using Julia 1.6.1 and FFTW 1.4.1, using the original code to reproduce it

using FFTW
FFTW.set_num_threads(1)
Threads.@threads for i = 1:10000
    a = rand(100)
    fft(a)
    GC.gc()
    println(i)
end

I have a core dump of the Julia program when it hangs on the code above. Would it be of any help to you if I sent you the core dump? I can also send you anything else you might need (e.g. my Julia build, etc.). Here is a brief view of the state of the program when it deadlocks: https://pastebin.com/raw/UiJp9cfD.

@Octogonapus
Copy link
Contributor Author

I also want to keep track of the other threads related to this issue. Some discussion happened here #157 and here #161.

Some discussion also happened at JuliaCon, which @IanButterworth can elaborate on if he wants to.

@Octogonapus
Copy link
Contributor Author

I noticed that Julia's docs state that finalizers are leaf locks must not try to acquire any other locks (source). Maybe I am misunderstanding this, but this makes it sound like the finalizer registered here is not allowed to acquire any locks. If my understanding is correct, then it looks difficult to address this issue because of the requirement for holding fftwlock when destroying a plan. Perhaps the thread safety would need to be handled below FFTW.jl?

@Octogonapus
Copy link
Contributor Author

@vtjnash I would love to get your insight on this issue if you have any bandwidth to spare.

@vtjnash
Copy link
Contributor

vtjnash commented Jun 1, 2021

Finalizers don't run until after releasing that lock

@vtjnash
Copy link
Contributor

vtjnash commented Jun 1, 2021

It looks there is a coding bug in the spin lock. It is configured to block forward progress currently. There should be a GC.safepoint() call in there to permit forward progress.

@Octogonapus
Copy link
Contributor Author

Thank you so much @vtjnash, it looks like this addressed the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants