-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use atomics in synaptic effect application #139
Conversation
This uses the logic form the NumpyCodeGenerator (which does the correct testing for vectorisation that the CUDA implementation needs as well). There is a bug which leads to multiple declarations currently.
@thesamovar has written most of this and it is certainly not trivial. We wouldn't have activated it without the extensive testing (including the "sanity checks" for our test cases). But it's been a while since we implemented all this, so my memory of the details is a bit shaky...
Yes, the order within the code block always has to be preserved.
Exactly.
I think this is a false positive. You can see it by putting it into the "bad examples" in
Hmm, I see. I don't think this was intended, but maybe it would be a good criterion. Or we'd rather need two kind of warnings/errors: one where results are undefined even when synaptic statement blocks are executed sequentially (but in undefined order over synapses) -- this is what our "sanity check" currently checks for. The second one would be about statements that can give different results when statement execution is interleaved across synapses (as in this example). But honestly, I would not spend too much time on this. I'd say > 99% of synaptic models will only change one post-synaptic variable (and potentially several synaptic variables). For Brian2CUDA, I'd be completely ok with refusing to work with anything that raises an
From a quick look, I don't see why (it always passes
I don't quite remember the details either but I think it is numpy-specific and you can just ignore it.
I think the best solution would be if all this stuff would go into a separate header file (I think there's a comment in the code about it somewhere, and we also have brian-team/brian2#597). But then, I don't think this should be a high priority item, it is just inconvenient when reading the source code.
No reason not to use it, but in numpy it does not matter since none of the standard functions use the name replacement mechanism. |
It's also been a while since I thought about this operation order stuff, so I'll do a refresh of my memory and get back to you soon. :) |
Thanks for thinking about this stuff again :)
I would be totally happy going for the easy solution here. I'm not trying to fix the false alarm cases right now :). So whenever there is an I'm just trying to make sure that there are no cases in which numpy vectorisation works fine but cuda atomics parallelisation doesn't. And the |
OK for the order-dependence stuff I've reread the code (https://github.com/brian-team/brian2/blob/master/brian2/codegen/permutation_analysis.py) and although it's been a while I think I can explain it now. We take a very conservative approach so that we don't accidentally make an error, but it has some false positives so we only warn rather than raise an error. Our assumption is that you are doing essentially "for all synapses: for each statement: code", but that the order in which you consider synapses could be changed. So the logic behind the code is to look at when does the order of synapses matter. In principle, we'd like to say when does the order matter for the final value at the end of the iteration, but I think that was too difficult to solve in general. So instead, we look at when does the order matter for any intermediate step in the calculation. Of course, that goes too far the other way, because then even a simple "v+=w" would be order-dependent. So, what we do exactly is we start by considering that any order-dependence is problematic (i.e. the overcompensating approach), but that we explicitly allow "var += expr" once. That is, as long as all the variables in expr are not order-dependent, we allow you to write "var += expr". However, doing so marks "var" as order-dependent. (We also allow *=, -=, /=.) This is safe because we know the final value of var will not be order-dependent. Your example with u+=1 and v+=u would be OK, but it's very tricky to catch those situations without also bringing in false negatives which we really want to avoid. As Marcel says: if you have an idea that can allow us to reduce false positives without introducing any false negatives that would be great! |
And I'll look at the vectorisation stuff later! |
OK interesting. One thing I don't think we ever explicitly discussed (as far as I can tell) is that in numpy the loop is "for statements: for synapses: code" whereas for other targets it's "for synapses: for statements: code". Actually, I think that this doesn't make a difference to us because our very strict order dependence check would catch any situation where this changed things, but perhaps we should revisit that to make sure? Assuming that's correct (and what you said seems to suggest you agree, and your logic sounds correct to me), then in summary I agree with what you've said and think you can use the same logic as for numpy. In all cases, one could be less restrictive, but it appears not to be an issue in practice (since those early issues, I don't think anyone has ever reported a problem with OrderDependenceError). |
This reverts commit 23f0aac. Alternative solution, see brian-team/brian2#982.
Thank you @thesamovar for looking into it again!
I didn't dive into the details of the order-dependence check but just took it as it is. I don't have anything to offer here right now :)
Isn't this exactly what your comment here in brian2 PR 531 is referring to? So I guess the numpy vectorisation does check to make sure that So I guess, the atomics implementation is good to go now for us, thanks again for the reviews :) |
Well noticed! You're right. |
I implemented parallelisation over spiking neurons for the homogeneous delay case. I chose to use |
PR for #78.
First implementation seems fine. Tests are running. First speed tests on Brunel Hakim with hetergogeneous delays look as expected.
We are trying to use CUDA atomic operations for synaptic effect application whenever possible in order to parallelise in situation where we else couldn't (because of possible race conditions when writing in parallel to the same e.g. postneuron variable from two different synapses).
TODOs:
test_synapses:test_ufunc_at_vectorisation()
from brian?CodeObject
classes instead of passing a keyword argument (see Make translation of abstract code codeobject-dependent inCodeGenerator
brian2#982)source
mode with postneuron variables which don't have a chance for race conditions, we should only use atomics for the source effects (currently not the case) (opened issue: Specify which variables in synaptic effect application should use atomic operations #141 )+=, -=, *=, /=
). (easy) (opened isse: Add atomics support for all in-place operations #142 )I have basically merged the
NumpyCodeGenerator
class with ourCUDACodeGenerator
class. For that I have just used the algorithms from theNumpyCodeGenerator
, which test if parallelisation is possible, without following all the details behind it for time reasons.I'm gonna summarise my understanding of whats happening here, for later reference. And @mstimberg @thesamovar could you maybe confirm that I'm not making some terrible mistakes here? :)
When I have a synapses code block, e.g.
on_pre = 'w_syn += 1; v_post += w_syn'
, this means that functionally we iterate over all synapses and apply per synapse the two effects in the order they appear in the code block. So ifw_syn=0
at the start of the time step, we end up addingv_post += 1
.For a code block like
on_pre = 'v_post += 1; w_syn += v_post'
, the results would depend on the order in which we loop through the synapses (the value added tow_syn
depends on how many synapses already added1
tov_post
). This raises aOderDependenceError
(and creates a warning).I'm not sure about
on_pre = 'u_post += 1; v_post += u_post
. It raises aOderDependenceError
, but it should be a false alarm, right? Even though the amount added from a specific synapse tov_post
depends on the synapse order, the total amount added tov_post
will be independent of the synapse order. So I guess this is just a situation, which is difficult to catch? Or did I miss something? EDIT: Ah, or is this order dependent sincefor stmt in statements; for syn in synapses
is different thanfor syn in synapses; for stmt in statements
? Does that mean I don't have to take care of what my simulation does in such a case? So point 1. is only true if looping first through the statements and then through the synapses gives the same result?For cases that raise an
OrderDependenceError
, do you still make sure that different code targets produce the same result? I suppose not, since the order in which synapses are executed is obviously different when parallelising the execution in some way, correct?Most importantly: When can we vectorise?
When the code in the
NumpyCodeGenerator
can be vectorised (does not raise aVectorisationError
), we can use atomic cuda operations. Both, numpyufunc.at
as well asatomic
operations in CUDA executed from different threads, have the same result: add to all targets some value and it works also if a target index appears multiple times. You raise aVectorisationError
in two cases:atomicCAS
, which can just be extended to any in-place operation.expr
(see the discussion following this comment in brian2 PR 531). As far as I see it, we can just use this condition to decide if we use atomics.I see two differences between numpy vectorisation and cuda atomics:
ufunc.at
blocks any subsequent operations (which means its execution is finished before the next code line is executed), while for cuda atomics, this is not the case. But this does not matter, since we don't have any situation, where we would want one line in a code block to be executed for all synapses and then use the result of that in a subsequent line (this would contradict point 1. - functionally looping through all synapses and executing the entire code block for that synapse).expr
of another statement, right?And then I have a few more technical questions in the
NumpyCodeGenerator
:You are passing
variables
andvariable_indices
around as function arguments instead of using the class attributes (self.variables
andself.variable_indices
). I didn't see those dictionaries modified in the code (but maybe I missed it?), so is this just a leftover from some older changes or does it a have a reason I should be aware of?There is a
created_vars
dictionary in theNumpyCodeGenerator
, but nothing like that in theCPPCodeGenerator
. Didn't quite get what thats for, its used here inconditional_write()
. Is this something numpy specific that I can ignore?For the standard in-place operations (
+=, -=, /=, *=
), I added software implementations usingatomicCAS
. I just added all those operations for ['int', 'float', 'double'] types to theuniversal_support_code
. But thats just another function thats cluttering the created code (besides the modulo functions). Is there a better place to put those functions, maybe only when they are needed? I could use thedetermine_keywords()
function to add only those atomics, which where used maybe?Here you choose to translate the
expr
in the vectorised code without substitutingself.func_name_replacements
(meaning you use that line instead ofself.translate_expression(expr
). This breaks things for me if the expr includes e.g.clip
, which should be translated to_brian_clip
. Any reason here not to useself.translate_expression
? Somehow for numpy the expression still gets correctly translated, even if I remove the word substitution inself.translate_expression
entirely.