WIP: Multithreading Experiment #6741

tknopp · 2014-05-04T08:14:33Z

This is an experiment on running Julia code in multiple threads (see #1790). While the code is certainly pretty broken I still like to get feedback whether the approach might be feasible.

The idea is to first precompile the function that is supposed to be run in the thread. To this end I use jl_get_specialization, which seems to do what I want. The hope is that the interaction with the Julia environment are minimal after precompilation so that race conditions are less likely to happen.

I have implemented two functions parapply and parapply_jl that take a function that is supposed to be run several times with the last argument being an index that covers a predefined range. While par apply implements the scheduling in C, parapply_jl implements it in Julia and exploits a direct thread API that is exposed to Julia. For the examples I have tried parapply works while parapply_jl crashes sometimes.

While this is a pretty naiv solution the real work is now to determine the places where race conditions can occur. Any help on this would be really appreciated.

This was developed on OSX with clang. For gcc one needs other compiler flags to enable c++11 threads.

JeffBezanson · 2014-05-04T17:13:56Z

Interesting. jl_get_specialization does move compilation earlier, but it doesn't really guarantee much. You'd probably still need a lock around code generation. Disabling GC also doesn't do much since there will be race conditions in allocation. In general I don't think it will be realistic to try to eliminate interaction with the runtime system; there can always be allocation or code generation. Though of course you can get lucky and have some functions that do pure computing with no allocation etc.

tknopp · 2014-05-04T17:36:23Z

Thanks Jeff. Could you give me a hint in which function the actual code generation is performed. Allocation are the functions in gc.c (e.g. jl_gc_managed_malloc), right?

You are right that its does not seem feasible to minimize the interaction with the runtime to zero. But my impression is that in a "hot loop" a lot less interaction happens than compared to for instance in Python.

And please note that I am absolutely aware that this work is pretty naiv. On the other hand one can learn a lot about the Julia internals when debugging race conditions :-)

timholy · 2014-05-04T17:54:21Z

@tknopp, ages ago I started treading down a similar path, implementing direct wrapping of pthread. I was able to dig up an old commit, in case it's at all useful to you. I don't remember the status I left this in, but "segfaults" comes to mind 😄.

------------------------------- src/codegen.cpp -------------------------------
index 6e606bb..a648f8f 100644
@@ -74,6 +74,7 @@
 #include <set>
 #include <cstdio>
 #include <cassert>
+#include <pthread.h>
 using namespace llvm;

 extern "C" {
@@ -199,6 +200,12 @@ static Function *jlputs_func;
 static Function *resetstkoflw_func;
 #endif

+// locking (thread-safety)
+pthread_mutex_t codegen_lock;
+static inline void init_lock(pthread_mutex_t &mutex) { pthread_mutex_init(&mutex, 0); }
+static inline void lock(pthread_mutex_t &mutex) { pthread_mutex_lock(&mutex); }
+static inline void unlock(pthread_mutex_t &mutex) { pthread_mutex_unlock(&mutex); }
+
 static void jl_rethrow_with_add(const char *fmt, ...)
 {
     if (jl_typeis(jl_exception_in_transit, jl_errorexception_type)) {
@@ -222,6 +229,8 @@ static Function *emit_function(jl_lambda_info_t *lam, bool cstyle);
 //static int n_compile=0;
 static Function *to_function(jl_lambda_info_t *li, bool cstyle)
 {
+    if (!nested_compile)
+        lock(codegen_lock);
     JL_SIGATOMIC_BEGIN();
     assert(!li->inInference);
     BasicBlock *old = nested_compile ? builder.GetInsertBlock() : NULL;
@@ -243,6 +252,8 @@ static Function *to_function(jl_lambda_info_t *li, bool cstyle)
             builder.SetCurrentDebugLocation(olddl);
         }
         JL_SIGATOMIC_END();
+   if (!last_n_c)
+            unlock(codegen_lock);
         jl_rethrow_with_add("error compiling %s", li->name->name);
     }
     assert(f != NULL);
@@ -261,6 +272,8 @@ static Function *to_function(jl_lambda_info_t *li, bool cstyle)
         builder.SetCurrentDebugLocation(olddl);
     }
     JL_SIGATOMIC_END();
+    if (!last_n_c)
+   unlock(codegen_lock);
     return f;
 }

@@ -272,7 +285,7 @@ extern "C" void jl_generate_fptr(jl_function_t *f)
     jl_lambda_info_t *li = f->linfo;
     assert(li->functionObject);
     Function *llvmf = (Function*)li->functionObject;
-    if (li->fptr == &jl_trampoline) {
+    if (li->fptr == &jl_trampoline) {  // FIXME? lock
         JL_SIGATOMIC_BEGIN();
         li->fptr = (jl_fptr_t)jl_ExecutionEngine->getPointerToFunction(llvmf);
         if (li->cFunctionObject != NULL)
@@ -336,7 +349,7 @@ void *jl_function_ptr(jl_function_t *f, jl_value_t *rt, jl_value_t *argt)
                                   li->name->name);
                     }
                 }
-                return jl_ExecutionEngine->getPointerToFunction((Function*)ff->linfo->cFunctionObject);
+                return jl_ExecutionEngine->getPointerToFunction((Function*)ff->linfo->cFunctionObject); // FIXME? lock
             }
         }
     }
@@ -3374,7 +3387,7 @@ static void init_julia_llvm_env(Module *m)

 extern "C" void jl_init_codegen(void)
 {
-    
+    init_lock(codegen_lock);
     InitializeNativeTarget();
     jl_Module = new Module("julia", jl_LLVMContext);

tknopp · 2014-05-04T18:19:26Z

Thanks Tim. to_function seems to be the compiling function. And that you have a variable nested_compile seems to indicate that this function can call itself. With a normal lock around the function I get a deadlock ;-)

tknopp · 2014-05-04T19:59:56Z

Ok, locking to_function in the same way as Tim suggests helps and the julia implementation parapply_jl now seems to crash not anymore for the test functions I have implemented.

tknopp · 2014-05-05T09:26:07Z

The reason to use C++11 threads was that it should be cross platform. But I have just seen that libuv provides cross platform threads and mutexes so this seems to be a better solution.

StefanKarpinski · 2014-05-05T14:53:35Z

Vaguely related: https://groups.google.com/d/msg/julia-dev/GTzZlo4Rhrk/ezZhx2wlLnAJ. The approach here is to use a work thread that has no interaction with the Julia runtime to do C work. The tricky part is from Julia, generating a function that can get arguments from a buffer and call a C function to do the actual work, without invoking any GC or code compilation – these generated wrapper functions only use unsafe_load, pointer arithmetic, ccall, and unsafe_store!. Not really relevant to running arbitrary code in another thread, but it was an approach to doing work in a thread.

tknopp · 2014-05-05T15:12:57Z

Thanks Stefan, I seem to have missed that post. Maybe I should use the uv_queue_work function to exploit a thread pool here.

I was kind of surprised that my naiv approach worked and although it is still certainly broken I think that we should proceed here. The tricky part is to determine which functions need locks and which not. I currently don't see any crashes in my test program when using parapply whereas parapply sometimes gives wrong results.

It would also be great if someone has a function in mind in Base that would greatly benefit from mutlithreading. In my test I used a matrix-vector multiplication, which is however not so CPU bound. Cheated a little by using Int128 though :-)

tknopp · 2014-05-05T21:29:11Z

Ported this to libuv threads/mutex so that it does not rely on C++11. Should hopefully satisfy Travis now.

timholy · 2014-05-06T01:27:46Z

It would also be great if someone has a function in mind in Base that would greatly benefit from mutlithreading.

If nothing in Base comes immediately to mind, Images would be a great test case. imfilter, imfilter_gaussian, or restrict would all benefit a lot from multithreading.

tknopp · 2014-05-06T07:45:50Z

yes imfilter seems great. IMHO nd filtering is something belonging to Base but that is another issue (#5155 hint hint) ;-)

I actually am looking for 3 test cases.
a) One that is perfectly parallelizable (i.e. CPU bound) in order to check that the overhead of the internal locks is not too high. I can take something senseless but it would of course be nice to have something meaningful.
b) Something that makes the crowd say "yeah" (e.g. imfilter)
c) Something that still crashes...

I have made some progress with c) and have found a setting where parapply does not crash and parapply_jl still crashes. In this way I have fixed several race conditions but I am not through yet. Locking GC_PUSH/POP was the most painful as I had to lock the outermost pair. Still not sure if that could produce a deadlock. When I better understand how the jl_pgcstack pointer is used it might be a better solution to have one pointer per thread.

PS: c) is of course just meant as an intermediate test case :-)

ViralBShah · 2014-05-06T08:03:52Z

Cc: @stevengj

tknopp · 2014-05-07T05:05:13Z

The nested locks I had for to_function in a former version where useless. I now use the thread id so that one thread can proceed with all its nested calls. GC_PUSH/POP is similar now but I have to leave out the main thread as it would deadlock otherwise. If someone has an idea how I can implement multiple per thread jl_pgcstack pointers this would be of interest.

carnaval · 2014-05-07T07:45:40Z

I believe thread-local gcstack would be ok. You would have to modify the GC to walk every thread's stack instead of the current one only (see the use of gcstack in gc.c).
I don't have time to look at the code unfortunately, but you would have to be sure you are not running the gc from other threads than the main thread.
Making the allocation thread safe would be ok I guess, but the gc collection is highly non atomic for several reasons. First, the runtime isn't prepared to see jl_values with mark bit set. Even if it was, the collection must see a snapshot of the heap so the "points to" relation must not change while it is running.
Pausing every other thread would be a first step, but I'm not sure how to make sure you pause them in a consistent state wrt the GC, e.g., not in the middle of building the next gcstack or something like that.
I'll be reading about your journey please keep us posted :-)

tknopp · 2014-05-07T08:45:39Z

@carnaval: The code is quite messy in the current state because I have tried something here and there. I have not a lot experience with garabage collectors so I am not entirely sure if I get this done alone.

My idea for the thread-local gcstack would be to use a map of pointers:

map<long, jl_gcframe_t *> jl_pgcstacks

void register_thread(long id)
{
     jl_pgcstacks[id] = NULL;
}

#define JL_GC_PUSH1(arg1)                                                  \
void *__gc_stkf[] = {(void*)3, jl_pgcstack[uv_thread_self()], arg1};   \
jl_pgcstack[uv_thread_self()] = (jl_gcframe_t*)__gc_stkf;

#define JL_GC_POP() (jl_pgcstack[uv_thread_self()] = jl_pgcstack[uv_thread_self()]->prev)

and in gc_mark_task one would then run over all pointers in the map.

How to handle garbage collection is indeed an interesting issue. In the first round I would simply disable it.

JeffBezanson · 2014-05-07T15:06:01Z

Using a map for that is not going to work. Aside from this being very performance-sensitive, the code generator needs to be able to access this in generated code. There needs to be a pointer in the TLS to a per-thread struct that has the gc stack pointer at a known offset.

tknopp · 2014-05-07T15:20:11Z

Indeed thanks. Was not aware of TLS. Seems that libuv as support for it (https://github.com/joyent/libuv/blob/master/include/uv.h#L2202)
But I am not entirely sure how to access all the TLS pointers in the gc_mark_task routine if they are thread local. Will think about that.

tknopp · 2014-05-07T15:30:40Z

I just realize that when I disable the gc when the threads are running I could also just let JL_GC_PUSH/POP be noops.

tknopp · 2014-05-08T08:05:52Z

Some more findings / improvements. I added some missing locks around the alloc* functions in gc.c and made JL_GC_PUSH/POP noops in the threads. With this I don't get race conditions in the GC anymore.

Locking code generation seems to be quite tricky. I had a situation were one thread was in to_function whereas the other was in cache_method. One thing that seems to help and made parapply_jl run without crashes is to not only trigger code generation via jl_get_specialization before starting the threads but to actually call the function once before going into the parallel phase.

An alternative that is currently commited is to lock jl_apply_generic. Not entirely sure if the code really runs in parallel then though. Travis seems to run into a deadlock.

tknopp · 2014-05-24T06:06:33Z

I have been looking into the print issue but did not get that far. It seem there is some task switching going on when printing. I have seen a crash where the thread calls
julia_write and then a little deeper jl_f_yieldto and then alloc_big(sz=140730322856156)

So I really need to understand what all this task switching is about.

cbecker · 2014-05-24T07:00:52Z

Regardless, it is good that you mention it in the 'docs'.

Wouldn't it be good to have a critical section macro, for example, within a thread one can do

@critical println("hello")

to be converted to

lock(someMutex)
println("hello")
unlock(someMutex)

though there is an issue: where to initialize the mutex. Does julia have something like static variables?

Something probably equally or more important would be to have somethings like @atomic:

@atomic a += 2

I assume LLVM provides a way to indicate that.

I am thinking of the openMP constructs I frequently use that make a difference when working with multithreading.

timholy · 2014-05-24T11:09:42Z

Another approach I've used in C is to implement some kind of message queue, and do all printing from the main thread.

tknopp · 2014-05-24T11:42:13Z

I have already tried to lock println but this is not the issue here. Its a task switch that is absolutely not threadsafe.

I am quite new to coroutines so while I understand the concept it is not entirely clear when task switches can happen in Julia code.

So I am unsure which is the approach to go:

Prevent task switches in threaded code (is this feasible?)
Make the complete task machinery thread aware using TLS. This would mean that each thread has its own tasks. But I wonder if 2. is not too much overhead.
Maybe someone with more experience with tasks (@vtjnash) has an opinion about that.

Keno · 2014-05-24T12:10:24Z

You're also not allowed to do I/O from anything but the main thread because the uv interface is not thread safe.

stevengj · 2014-05-24T12:46:53Z

Note that uv_async_send is already thread-safe, and is basically the only thread-safe function in libuv as I understand it. This is the documented way to wake up tasks in the main thread from other threads.

tknopp · 2014-05-24T12:47:31Z

Ah thanks @loladiro. I had not yet tried to do I/O in a thread so have not seen that. This kind of restricts us to computational things which is probably ok. My dream was a little to also use threads when doing GUI programming where one usually also does I/O in threads to let the GUI still be responsive. @async kind of helps here but I still had problems with freezing UI when programming with Gtk.jl.

vtjnash · 2014-05-24T14:51:37Z

You might try turning off COPY_STACK in task.c

Since when libuv gained thread pools, i thought it had become (more) thread-aware

tknopp · 2014-05-24T15:15:02Z

task.c does not compile when I undef COPY_STACK

tknopp · 2014-05-24T19:25:52Z

Ok I got println in threads working, although I am not 100% sure if this is the best fix. What I have done is to prevent a task switch when yieldto is called from anything but the main thread.

I have further fixed some locks that should now make it possible to run (raw) threads interactively from the REPL (gc has to be disabled!). With that I was able to spawn a thread that was periodically incrementing an array value and see this in action from the REPL.

and not just an unspecific error message

tknopp · 2014-05-27T13:18:17Z

I improved a little the exception handling. Before, only an unspecific excpetion was rethrown in the main thread (i.e. a flag: there was an excption in a thread). Now the actual exception is thrown. TLS is very useful for this stuff.

cbecker · 2014-05-28T11:25:39Z

@tknopp Sounds great, I will give it a try later this week or next week, will keep you posted.

tknopp · 2014-08-07T19:17:24Z

I am closing this as the threads branch is various steps ahead of this one. @JeffBezanson: Is threads the main threading branch now? I have seen two other and am not entirely sure in which one the development takes place.

tknopp · 2014-08-07T19:17:48Z

ah wrong button

StefanKarpinski · 2014-08-07T19:20:12Z

Yes, threads is the branch now.

tknopp added 3 commits May 2, 2014 15:54

some threading experiments

5d205dd

various improvements

6884214

little improvement to thread type

c93180f

add some locks

3a4f58c

tknopp added 2 commits May 5, 2014 22:43

port to libuv threads and mutex

84c9703

remove stdlib=libc++ from Makefile

f65313d

introduce several new locks

c6bdc3a

fix some locks

87b9824

tknopp added 2 commits May 7, 2014 20:45

do not gc root parameters in threads (as gc is disabled)

c702245

more improvements

edbc1d7

crashing.jl does not crash anymore

541a08c

tknopp added 2 commits May 24, 2014 20:35

fix crashes when running threads from the REPL

7afa445

println in a thread now works.

221d6e3

tknopp added 2 commits May 24, 2014 22:13

use atomics in gc

9f4bbb5

cleanup and add support for atomic operations on windows

e1b102d

ViralBShah added the performance label May 25, 2014

ViralBShah added this to the 1.0 milestone May 25, 2014

tknopp added 3 commits May 25, 2014 22:28

one fix in pool alloc

676b7d6

Add some documentation to julia.h

7d0ba16

rethrow the actual exception in the main thread

f2bfb74

and not just an unspecific error message

amitmurthy mentioned this pull request May 29, 2014

isready() blocks #7006

Closed

tknopp mentioned this pull request Jun 13, 2014

RFC:Add generic Cholesky decomposition and make Cholesky parametric on matrix type #7236

Merged

tknopp closed this Aug 7, 2014

tknopp mentioned this pull request Nov 26, 2014

Parallel Computing Planning Issue #9167

Closed

12 tasks

tknopp mentioned this pull request Jan 7, 2017

Multithreading JuliaMath/NFFT.jl#19

Closed

tknopp mentioned this pull request Jul 25, 2019

draft of multithreading blog post JuliaLang/www.julialang.org#408

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Multithreading Experiment #6741

WIP: Multithreading Experiment #6741

tknopp commented May 4, 2014

JeffBezanson commented May 4, 2014

tknopp commented May 4, 2014

timholy commented May 4, 2014

tknopp commented May 4, 2014

tknopp commented May 4, 2014

tknopp commented May 5, 2014

StefanKarpinski commented May 5, 2014

tknopp commented May 5, 2014

tknopp commented May 5, 2014

timholy commented May 6, 2014

tknopp commented May 6, 2014

ViralBShah commented May 6, 2014

tknopp commented May 7, 2014

carnaval commented May 7, 2014

tknopp commented May 7, 2014

JeffBezanson commented May 7, 2014

tknopp commented May 7, 2014

tknopp commented May 7, 2014

tknopp commented May 8, 2014

tknopp commented May 24, 2014

cbecker commented May 24, 2014

timholy commented May 24, 2014

tknopp commented May 24, 2014

Keno commented May 24, 2014

stevengj commented May 24, 2014

tknopp commented May 24, 2014

vtjnash commented May 24, 2014

tknopp commented May 24, 2014

tknopp commented May 24, 2014

tknopp commented May 27, 2014

cbecker commented May 28, 2014

tknopp commented Aug 7, 2014

tknopp commented Aug 7, 2014

StefanKarpinski commented Aug 7, 2014

WIP: Multithreading Experiment #6741

WIP: Multithreading Experiment #6741

Conversation

tknopp commented May 4, 2014

JeffBezanson commented May 4, 2014

tknopp commented May 4, 2014

timholy commented May 4, 2014

tknopp commented May 4, 2014

tknopp commented May 4, 2014

tknopp commented May 5, 2014

StefanKarpinski commented May 5, 2014

tknopp commented May 5, 2014

tknopp commented May 5, 2014

timholy commented May 6, 2014

tknopp commented May 6, 2014

ViralBShah commented May 6, 2014

tknopp commented May 7, 2014

carnaval commented May 7, 2014

tknopp commented May 7, 2014

JeffBezanson commented May 7, 2014

tknopp commented May 7, 2014

tknopp commented May 7, 2014

tknopp commented May 8, 2014

tknopp commented May 24, 2014

cbecker commented May 24, 2014

timholy commented May 24, 2014

tknopp commented May 24, 2014

Keno commented May 24, 2014

stevengj commented May 24, 2014

tknopp commented May 24, 2014

vtjnash commented May 24, 2014

tknopp commented May 24, 2014

tknopp commented May 24, 2014

tknopp commented May 27, 2014

cbecker commented May 28, 2014

tknopp commented Aug 7, 2014

tknopp commented Aug 7, 2014

StefanKarpinski commented Aug 7, 2014