Use atomics in synaptic effect application #139

denisalevi · 2018-07-20T13:38:36Z

PR for #78.

First implementation seems fine. Tests are running. First speed tests on Brunel Hakim with hetergogeneous delays look as expected.

We are trying to use CUDA atomic operations for synaptic effect application whenever possible in order to parallelise in situation where we else couldn't (because of possible race conditions when writing in parallel to the same e.g. postneuron variable from two different synapses).

TODOs:

I have basically merged the NumpyCodeGenerator class with our CUDACodeGenerator class. For that I have just used the algorithms from the NumpyCodeGenerator, which test if parallelisation is possible, without following all the details behind it for time reasons.
I'm gonna summarise my understanding of whats happening here, for later reference. And @mstimberg @thesamovar could you maybe confirm that I'm not making some terrible mistakes here? :)

When I have a synapses code block, e.g. on_pre = 'w_syn += 1; v_post += w_syn', this means that functionally we iterate over all synapses and apply per synapse the two effects in the order they appear in the code block. So if w_syn=0 at the start of the time step, we end up adding v_post += 1.
For a code block like on_pre = 'v_post += 1; w_syn += v_post', the results would depend on the order in which we loop through the synapses (the value added to w_syn depends on how many synapses already added 1 to v_post). This raises a OderDependenceError (and creates a warning).
I'm not sure about on_pre = 'u_post += 1; v_post += u_post. It raises a OderDependenceError, but it should be a false alarm, right? Even though the amount added from a specific synapse to v_post depends on the synapse order, the total amount added to v_post will be independent of the synapse order. So I guess this is just a situation, which is difficult to catch? Or did I miss something? EDIT: Ah, or is this order dependent since for stmt in statements; for syn in synapses is different than for syn in synapses; for stmt in statements? Does that mean I don't have to take care of what my simulation does in such a case? So point 1. is only true if looping first through the statements and then through the synapses gives the same result?
For cases that raise an OrderDependenceError, do you still make sure that different code targets produce the same result? I suppose not, since the order in which synapses are executed is obviously different when parallelising the execution in some way, correct?
Most importantly: When can we vectorise?
1. When the code in the NumpyCodeGenerator can be vectorised (does not raise a VectorisationError), we can use atomic cuda operations. Both, numpy ufunc.at as well as atomic operations in CUDA executed from different threads, have the same result: add to all targets some value and it works also if a target index appears multiple times. You raise a VectorisationError in two cases:
  - If an in-place operation has no ufunc implementation. This does not apply for the CUDA case, I am implementing in-place operations in software using atomicCAS, which can just be extended to any in-place operation.
  - If any of the variables vectorised will be written to later or is used in its own expr (see the discussion following this comment in brian2 PR 531). As far as I see it, we can just use this condition to decide if we use atomics.
2. I see two differences between numpy vectorisation and cuda atomics:
  - One is, that in numpy each ufunc.at blocks any subsequent operations (which means its execution is finished before the next code line is executed), while for cuda atomics, this is not the case. But this does not matter, since we don't have any situation, where we would want one line in a code block to be executed for all synapses and then use the result of that in a subsequent line (this would contradict point 1. - functionally looping through all synapses and executing the entire code block for that synapse).
  - And the other difference is, that in numpy one can guarantee that all variables read before the ufunc are actually read before. In CUDA this is not the case since different cuda blocks are executed at different times and therefore on block might have finished all its code before another one even reads the first variables. But as far as I understand it, such a case would be caught by your vectorisation test, which tests that no variable that will be written to is also used in an expr of another statement, right?

And then I have a few more technical questions in the NumpyCodeGenerator:

You are passing variables and variable_indices around as function arguments instead of using the class attributes (self.variables and self.variable_indices). I didn't see those dictionaries modified in the code (but maybe I missed it?), so is this just a leftover from some older changes or does it a have a reason I should be aware of?
There is a created_vars dictionary in the NumpyCodeGenerator, but nothing like that in the CPPCodeGenerator. Didn't quite get what thats for, its used here in conditional_write(). Is this something numpy specific that I can ignore?
For the standard in-place operations (+=, -=, /=, *=), I added software implementations using atomicCAS. I just added all those operations for ['int', 'float', 'double'] types to the universal_support_code. But thats just another function thats cluttering the created code (besides the modulo functions). Is there a better place to put those functions, maybe only when they are needed? I could use the determine_keywords() function to add only those atomics, which where used maybe?
Here you choose to translate the expr in the vectorised code without substituting self.func_name_replacements (meaning you use that line instead of self.translate_expression(expr). This breaks things for me if the expr includes e.g. clip, which should be translated to _brian_clip. Any reason here not to use self.translate_expression? Somehow for numpy the expression still gets correctly translated, even if I remove the word substitution in self.translate_expression entirely.

This uses the logic form the NumpyCodeGenerator (which does the correct testing for vectorisation that the CUDA implementation needs as well). There is a bug which leads to multiple declarations currently.

See PR brian-team/brian2#982

mstimberg · 2018-07-26T14:52:19Z

For that I have just used the algorithms from the NumpyCodeGenerator, which test if parallelisation is possible, without following all the details behind it for time reasons.

@thesamovar has written most of this and it is certainly not trivial. We wouldn't have activated it without the extensive testing (including the "sanity checks" for our test cases). But it's been a while since we implemented all this, so my memory of the details is a bit shaky...

When I have a synapses code block, e.g. on_pre = 'w_syn += 1; v_post += w_syn', this means that functionally we iterate over all synapses and apply per synapse the two effects in the order they appear in the code block. So if w_syn=0 at the start of the time step, we end up adding v_post += 1.

Yes, the order within the code block always has to be preserved.

For a code block like on_pre = 'v_post += 1; w_syn += v_post', the results would depend on the order in which we loop through the synapses (the value added to w_syn depends on how many synapses already added 1 to v_post). This raises a OderDependenceError (and creates a warning).

Exactly.

I'm not sure about on_pre = 'u_post += 1; v_post += u_post. It raises a OderDependenceError, but it should be a false alarm, right? Even though the amount added from a specific synapse to v_post depends on the synapse order, the total amount added to v_post will be independent of the synapse order. So I guess this is just a situation, which is difficult to catch? Or did I miss something?

I think this is a false positive. You can see it by putting it into the "bad examples" in test_synapses.py and run the "sanity checks". That these kind of false positives still exist is the reason why OrderDependenceError is only a warning and not an error. If you can fix it without breaking other things, I'd be all ears ;)

EDIT: Ah, or is this order dependent since for stmt in statements; for syn in synapses is different than for syn in synapses; for stmt in statements? Does that mean I don't have to take care of what my simulation does in such a case? So point 1. is only true if looping first through the statements and then through the synapses gives the same result?

Hmm, I see. I don't think this was intended, but maybe it would be a good criterion. Or we'd rather need two kind of warnings/errors: one where results are undefined even when synaptic statement blocks are executed sequentially (but in undefined order over synapses) -- this is what our "sanity check" currently checks for. The second one would be about statements that can give different results when statement execution is interleaved across synapses (as in this example).

But honestly, I would not spend too much time on this. I'd say > 99% of synaptic models will only change one post-synaptic variable (and potentially several synaptic variables). For Brian2CUDA, I'd be completely ok with refusing to work with anything that raises an OrderDependenceError -- if anyone ever comes up with a reasonable use case where our check is a false positive, we can always reconsider.

You are passing variables and variable_indices around as function arguments instead of using the class attributes (self.variables and self.variable_indices). I didn't see those dictionaries modified in the code (but maybe I missed it?), so is this just a leftover from some older changes or does it a have a reason I should be aware of?

From a quick look, I don't see why (it always passes self.variables etc. anyway). Even if they were modified, it would not change anything if we didn't copy them first.

There is a created_vars dictionary in the NumpyCodeGenerator, but nothing like that in the CPPCodeGenerator. Didn't quite get what thats for, its used here in conditional_write(). Is this something numpy specific that I can ignore?

I don't quite remember the details either but I think it is numpy-specific and you can just ignore it.

For the standard in-place operations (+=, -=, /=, *=), I added software implementations using atomicCAS. I just added all those operations for ['int', 'float', 'double'] types to the universal_support_code. But thats just another function thats cluttering the created code (besides the modulo functions). Is there a better place to put those functions, maybe only when they are needed? I could use the determine_keywords() function to add only those atomics, which where used maybe?

I think the best solution would be if all this stuff would go into a separate header file (I think there's a comment in the code about it somewhere, and we also have brian-team/brian2#597). But then, I don't think this should be a high priority item, it is just inconvenient when reading the source code.

Here you choose to translate the expr in the vectorised code without substituting self.func_name_replacements (meaning you use that line instead of self.translate_expression(expr). This breaks things for me if the expr includes e.g. clip, which should be translated to _brian_clip. Any reason here not to use self.translate_expression? Somehow for numpy the expression still gets correctly translated, even if I remove the word substitution in self.translate_expression entirely.

No reason not to use it, but in numpy it does not matter since none of the standard functions use the name replacement mechanism.

thesamovar · 2018-07-31T10:21:14Z

It's also been a while since I thought about this operation order stuff, so I'll do a refresh of my memory and get back to you soon. :)

denisalevi · 2018-07-31T13:04:54Z

Thanks for thinking about this stuff again :)

Hmm, I see. I don't think this was intended, but maybe it would be a good criterion. Or we'd rather need two kind of warnings/errors: one where results are undefined even when synaptic statement blocks are executed sequentially (but in undefined order over synapses) -- this is what our "sanity check" currently checks for. The second one would be about statements that can give different results when statement execution is interleaved across synapses (as in this example).

But honestly, I would not spend too much time on this. I'd say > 99% of synaptic models will only change one post-synaptic variable (and potentially several synaptic variables). For Brian2CUDA, I'd be completely ok with refusing to work with anything that raises an OrderDependenceError -- if anyone ever comes up with a reasonable use case where our check is a false positive, we can always reconsider.

I would be totally happy going for the easy solution here. I'm not trying to fix the false alarm cases right now :). So whenever there is an OderDependenceError, we will just not use atomics and serialize effect application when needed, which should result in same behaviour as in cpp_standalone (and produce a warning). And we already have a user prefs for using atomics or not, we can just add another option 'force' to force using atomics even when an OderDependenceError is raised (but as you said, this is not important, but since I have to add something to deal with OderDependenceErrors anyways, I might as well just do this).

I'm just trying to make sure that there are no cases in which numpy vectorisation works fine but cuda atomics parallelisation doesn't. And the OrderDependenceError does not check for different results if statement execution is interleaved across synapses. But I think that is tested for here (see also discussion following this comment in brian2 PR 531). Vectorisation only happens when no variable is written to that was used in an expression before. I modified the test_ufunc_at_vectorisation for cuda_standalone and its all passing. This seems fine now.

thesamovar · 2018-08-01T10:25:46Z

OK for the order-dependence stuff I've reread the code (https://github.com/brian-team/brian2/blob/master/brian2/codegen/permutation_analysis.py) and although it's been a while I think I can explain it now.

We take a very conservative approach so that we don't accidentally make an error, but it has some false positives so we only warn rather than raise an error. Our assumption is that you are doing essentially "for all synapses: for each statement: code", but that the order in which you consider synapses could be changed. So the logic behind the code is to look at when does the order of synapses matter. In principle, we'd like to say when does the order matter for the final value at the end of the iteration, but I think that was too difficult to solve in general. So instead, we look at when does the order matter for any intermediate step in the calculation. Of course, that goes too far the other way, because then even a simple "v+=w" would be order-dependent. So, what we do exactly is we start by considering that any order-dependence is problematic (i.e. the overcompensating approach), but that we explicitly allow "var += expr" once. That is, as long as all the variables in expr are not order-dependent, we allow you to write "var += expr". However, doing so marks "var" as order-dependent. (We also allow *=, -=, /=.) This is safe because we know the final value of var will not be order-dependent.

Your example with u+=1 and v+=u would be OK, but it's very tricky to catch those situations without also bringing in false negatives which we really want to avoid.

As Marcel says: if you have an idea that can allow us to reduce false positives without introducing any false negatives that would be great!

thesamovar · 2018-08-01T10:29:58Z

And I'll look at the vectorisation stuff later!

thesamovar · 2018-08-01T11:53:46Z

OK interesting. One thing I don't think we ever explicitly discussed (as far as I can tell) is that in numpy the loop is "for statements: for synapses: code" whereas for other targets it's "for synapses: for statements: code". Actually, I think that this doesn't make a difference to us because our very strict order dependence check would catch any situation where this changed things, but perhaps we should revisit that to make sure? Assuming that's correct (and what you said seems to suggest you agree, and your logic sounds correct to me), then in summary I agree with what you've said and think you can use the same logic as for numpy. In all cases, one could be less restrictive, but it appears not to be an issue in practice (since those early issues, I don't think anyone has ever reported a problem with OrderDependenceError).

This reverts commit 23f0aac. Alternative solution, see brian-team/brian2#982.

denisalevi · 2018-08-09T17:41:20Z

Thank you @thesamovar for looking into it again!

As Marcel says: if you have an idea that can allow us to reduce false positives without introducing any false negatives that would be great!

I didn't dive into the details of the order-dependence check but just took it as it is. I don't have anything to offer here right now :)

One thing I don't think we ever explicitly discussed (as far as I can tell) is that in numpy the loop is "for statements: for synapses: code" whereas for other targets it's "for synapses: for statements: code". Actually, I think that this doesn't make a difference to us because our very strict order dependence check would catch any situation where this changed things, but perhaps we should revisit that to make sure?

Isn't this exactly what your comment here in brian2 PR 531 is referring to? So I guess the numpy vectorisation does check to make sure that for synapses: for statements: code should be the same as for statements: for synapses: code.

So I guess, the atomics implementation is good to go now for us, thanks again for the reviews :)

thesamovar · 2018-08-09T19:12:34Z

Well noticed! You're right.

denisalevi · 2018-08-13T13:12:14Z

I implemented parallelisation over spiking neurons for the homogeneous delay case. I chose to use cudaMemcpy to get the number of spiking neurons every time step in order to get the number of neurons that spiked (instead of using dynamic parallelism). There are potential optimisations for which I opened an issue (#146). Tests are passing. Merging.

denisalevi added 7 commits July 20, 2018 01:30

First implementation of atomic operations in synapse code (WIP)

c0fe173

This uses the logic form the NumpyCodeGenerator (which does the correct testing for vectorisation that the CUDA implementation needs as well). There is a bug which leads to multiple declarations currently.

Fix multiple declaration bug, cpp/cuda consistency test passing

e697973

Fix import in speed test script

d54598b

Add brian2 changes in CodeObject to brian2.diff

23f0aac

See PR brian-team/brian2#982

Modify test suite script to apply brian2 change

a02bd34

Fix missing funcion name substitution for atomics implementation

948de3d

Fix missing declarations in atomics implementation

ec5bd02

denisalevi added 6 commits August 7, 2018 16:10

Revert "Add brian2 changes in CodeObject to brian2.diff"

922ca47

This reverts commit 23f0aac. Alternative solution, see brian-team/brian2#982.

Implement seperate code object / generator classes for atomics usage

2bbe163

Clean up atomics implementation in CUDACodeGenerator

952c960

Add test comparing atomics vs not atomics

b3e6d40

Fix number of synapses info message

5aa2e5e

Fix typo in cuda_generator.py

262c3ed

Fix bug in atomics usage preference

6d06e50

Implement parallelisation over spiking neurons for homog. delays

e8c500a

denisalevi merged commit ce90c97 into master Aug 13, 2018

denisalevi deleted the atomicAdd_effect_application branch August 13, 2018 13:12

denisalevi mentioned this pull request Aug 13, 2018

atomics in vector_code for more parallelisation #78

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use atomics in synaptic effect application #139

Use atomics in synaptic effect application #139

denisalevi commented Jul 20, 2018 •

edited

Loading

mstimberg commented Jul 26, 2018

thesamovar commented Jul 31, 2018

denisalevi commented Jul 31, 2018

thesamovar commented Aug 1, 2018

thesamovar commented Aug 1, 2018

thesamovar commented Aug 1, 2018

denisalevi commented Aug 9, 2018 •

edited

Loading

thesamovar commented Aug 9, 2018

denisalevi commented Aug 13, 2018

Use atomics in synaptic effect application #139

Use atomics in synaptic effect application #139

Conversation

denisalevi commented Jul 20, 2018 • edited Loading

mstimberg commented Jul 26, 2018

thesamovar commented Jul 31, 2018

denisalevi commented Jul 31, 2018

thesamovar commented Aug 1, 2018

thesamovar commented Aug 1, 2018

thesamovar commented Aug 1, 2018

denisalevi commented Aug 9, 2018 • edited Loading

thesamovar commented Aug 9, 2018

denisalevi commented Aug 13, 2018

denisalevi commented Jul 20, 2018 •

edited

Loading

denisalevi commented Aug 9, 2018 •

edited

Loading