Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cuda implementation for TimedArray #181

Merged
merged 23 commits into from
Sep 3, 2020
Merged

Conversation

denisalevi
Copy link
Member

PR for #47

We copied the CPPCodeGenerator class instead of copying because of
function implementation issues (see PR #39). This just updates some
changes in CPPCodeGenerator.
See brian-team/brian2@ffde29e
Function namespace is dealt with in CPPStandaloneDevice instead of
CPPCodeGenerator. We inherit the change from CPPStandaloneDevice, but
didn't update the change in CUDACodeGenerator (which doesn not inherit
from CPPCodeGenerator (PR #39))
- Only works for TimedArray in device code
- Bug: timedarray seems to not always use the codeobject owner's clock,
  needs fix
@denisalevi denisalevi changed the base branch from KERNEL_VARIABLES_cleanup to master August 26, 2020 18:19
@denisalevi denisalevi changed the base branch from master to KERNEL_VARIABLES_cleanup August 26, 2020 18:19
Clocks only run on the host. It seems my previous attemt at copying time
variables to the host when needed did not work in all cases.
Base automatically changed from KERNEL_VARIABLES_cleanup to master August 26, 2020 19:02
- Original commit that removed clock sync: 658925b
- With the commit before this one, I implemented a better (and working)
  way to pass time variables by value to kernels
With 51b467c, the rand/n function names changed and where not caught
anymore in the CUDAStandaloneDevice
Comment on lines 589 to 591
#support_code.append('__device__ {0} _namespace{1};'.format(type_str,
# ns_key))
#pointers.append('_namespace{0} = {1};'.format(ns_key, ns_key))
Copy link
Member Author

@denisalevi denisalevi Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While implementing TimedArray, I came across some code that has been updated in Brian like 5 years ago but never made it into Brian2CUDA 😅. So I've been updating some things and came across this section here on function namespaces (guess this changed in brian-team/brian2#380). But I don't think I actually understand what the use case is where I need these namespaces. @mstimberg Could you shed some light here? :)

This function is copied from CPPCodeGenerator to CUDACodeGenerator. I have deleted the support_code line here because our variables are defined globally in objects.cu anyways and since we have both host and device pointers, it depends on the template which variable should be use (our synapse generator template is running on the host, all others codeobjects run on the device). But I didn't quite get why cpp needs this file global variable in the first place, aren't the variables in objects.cpp global as well?

And we don't use the pointers list, we do all that in brian2cuda/device.py.

Question is: what situations am I breaking here? For the TimedArray implementation, I need the device pointer d_timedarray_values array that is global anyways, no need for the _namespace... version.

Final tests are running right now, but I don't think I saw any test failing because of a namespace issue.

Copy link
Member Author

@denisalevi denisalevi Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just came across a documentation section on user defined namespaces. I guess that is it. And I found the Brian2 tests, which are not marked as standalone-compatible and therefore not run. I'll adapt them. Ok, I have a starting point now, I'll look into it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if I remember correctly this is mostly so that the values from the namespace are accessible under a known name for user-written code. And particularly the name is the same for runtime and standalone mode, so in principle you could share the C++ implementation which was relevant when we still supported weave.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with it and I still don't understand it. Here is an example I used with C++ standalone:

set_device('cpp_standalone')

my_var = 0.1 * np.array(np.arange(5))

@implementation('cpp', '''
float foo(int i)
{
    using namespace brian;
    return my_var[i];
}''' , namespace={'my_var': my_var})
@check_units(i=1, result=1)
def foo(i):
    return my_var[i]

G = NeuronGroup(5, 'v : 1')
G.run_regularly('v = foo(i)')
net = Network(G)
net.run(defaultclock.dt)

assert_allclose(G.v_[:], my_var)

This generates the following code in neurongroup_run_regularly_codeobject.cpp:

// support code
static double* _namespacemy_var;
float foo(int i)
{
    using namespace brian;
    return my_var[i];
}

// pointer_lines (uses namespace brian)
_namespacemy_var = my_var;

The file-global _namespacemy_var variable is not used at all, instead the global my_var variable is used. And the user still has to define using namespace brian; in their function implementation, otherwise it doesn't work.

If the goal was to make my_var accessible without having to add the brian:: namespace, then the cpp function should use the _namespacemy_var I guess. But it doesn't.

And the only place in the brian2 code that I found where this _namespace{var} mechanism is actually used is in the TimedArray implementation, where the return value is hardcoded to be called _namespace{name} (see here).

I think I'm still missing something here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your function code gets included verbatim and does not get rewritten automatically (maybe it should?). The correct way to write your example if you follow the instructions in the documentation is to write:

@implementation('cpp', '''
float foo(int i)
{
    return _namespacemy_var[i];
}''' , namespace={'my_var': my_var})

It's not really relevant anymore, but back in the days with weave this would have worked for both runtime and standalone mode whereas your variant hardcodes knowledge about standalone mode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see. Got it now. I didn't see that part in the documentation. Alright, if it is in the docs, it should work with _namespace{var}. I implemented this now for brain2CUDA as well. Turns out, the mechanism is even useful for brian2CUDA, since this way the same function can be used verbatim in both, host and device code and I can just let the _namespace{var} pointer point to either host or device data, depending on codeobject, without the user having to care about it. Perfect.

Thanks again for your help!

This is an attempt for a cleaner solution that removes the unnecessary
value -> pointer -> value copying of Clock variables we had before. But
it breaks the USES_VARIABLES mechanism. I needed to redefine _clock_t/dt
to not use the [0] index.
@denisalevi
Copy link
Member Author

Hi @mstimberg , I have another implementation choice question here.

All our Clocks are running on the host. We used to copy time variables to the device every time step, but that created too much overhead (could be implemented efficiently I guess, but we didn't). Instead, we are now passing Clock variables as kernel parameters (by value) only when they are needed. But when Clock variables are used in scalar_code or vector_code, they are expected to be pointers.

My solution here was to get the address of the Clock variables in %KERNEL_VARIABLES% (which is the cuda_standalone version of {{pointer_lines}}). Works fine and looks something like this:

__global__ void kernel_codobject(
    double _value_neurongroup_clock_t  // Clock variable passed by value
)
{
    ...
    // KERNEL_VARIABLES lines, getting the pointer
    double* _ptr_array_neurongroup_glock_t = &_value_neurongroup_clock_t;
    ...
    // scalar or vector code (unchanged from what cpp_standalone produces)
    double __source_value_neurongroup_t = _ptr_array_neurongroup_clock_t[0];
    // some array that uses t, e.g. timedarray
    const double _lio_1 = _timedarray(__source_value_neurongroup_t);
   ...
}

But I just realized that this is another hacky solution because I didn't have all the Brian internals in mind well enough ... So instead of extracting the pointer in KERNEL_VARIABLES, I should maybe just change how time variables are read in CUDACodeGenerator? I did that just now in 02c29d8. Now the code from above would look like this:

__global__ void kernel_codobject(
    double _value_neurongroup_clock_t  // Clock variable passed by value
)
{
    ...
    // nothing in KERNEL_VARIABLES lines
    ...
    // scalar or vector code (removed the [0] in the t variable read)
    double __source_value_neurongroup_t = _ptr_array_neurongroup_clock_t;
    // some array that uses t, e.g. timedarray
    const double _lio_1 = _timedarray(__source_value_neurongroup_t);
   ...
}

This seems to be the cleaner solution and it removes the unnecessary value -> pointer -> value copies. The only problem I have now is that the USES_VARIABLES mechanism does not work anymore. I did change the names of Clock variables to _value{name} (instead of _ptr{name}) in CUDACodeGenerator.get_array_name() (see 02c29d8), so the variable names in the templates are correct when using USES_VARIABLES. But the indexing is wrong, it uses _value{name}[0] instead of _value{name}. For the scalar_code and vector_code, I changed CUDACodeGenerator.translate_to_read_arrays() (see 02c29d8), but for the USES_VARIABLES mechanism, this seems to happen here in Device.code_object() and I didn't quite now how to change that. So for now I just redefine the _clock_t and _clock_dt variables in the templates (see e.g. ratemonitor.cu in 02c29d8). This works but again: not such a clean solution.

And since this becomes a bit more complicated, I thought maybe the easiest way would have been to just change the Clock Variables themselves?

So, sorry for the endless post. Hope it is not too unclear. What is your take, how would you implement this? I realize I am adding some hacky, not-Brian-intended solutions everywhere, so I wanted to start asking for the intended approach when possible :P

@mstimberg
Copy link
Member

I'm afraid I did not understand completely. The USES_VARIABLES mechanism does only decide what is passed in the variables dictionary to the code generation process so I'm not quite sure how this affects your code. Or is this about generated code that you are re-using from C++ standalone?

@denisalevi
Copy link
Member Author

denisalevi commented Aug 27, 2020

EDIT: I found some more issues that I would need to care about even if I fix the problem described below. So I will just take the solution I had and that worked. Thanks for answering and sorry for the mess 😆


Sorry for all the details. The question evolved while writing it....

I'm afraid I did not understand completely. The USES_VARIABLES mechanism does only decide what is passed in the variables dictionary to the code generation process so I'm not quite sure how this affects your code. Or is this about generated code that you are re-using from C++ standalone?

Yes, this is almost exclusively copied C++ standalone code (with minor modifications).

The USES_VARIABLES mechanism adds items to the variables dictionary but it also adds the correct array name to template_kwds, such that templates can use e.g. {{t}} (this is happening here in Device.code_object()). The array name is determined through device.get_array_name(), which I define in CUDACodeGenerator (just as CPPCodeGenerator). The only thing I can't control is that little [0] index added in line 304 🙃.

Ok, this seems to be quite a special case. Maybe just not worth the effort. But since I already sank so much time into understanding this, I'll suggest something ^^. We could:

  1. Change that line 304 in Device.code_object() to

     pointer_name += generator.get_scalar_var_suffix(var)

    And define a new class method for CodeObject that all generators inherit but that I could overwrite:

    def get_scalar_var_suffix(var):
        return '[0]'

    EDIT: If I ever want to implement this, I will also have to take care about the template that the variables are used in, since host and device code would differ. So it would need another argument in get_scarlar_var_suffix(var, template_name) or such to decide depending on template / codeobject what to do. Or it would just need Clock variables as non-pointers on the host too.

  2. Or just forget about it and I use the other solution that works for me :)

@denisalevi denisalevi merged commit 7c79915 into master Sep 3, 2020
@denisalevi
Copy link
Member Author

Tests are passing. Merging.

denisalevi added a commit that referenced this pull request Jun 10, 2021
…ARIABLES"

This reverts commit 02c29d8.
Decided against this implementation, see my comments in PR #181.
denisalevi added a commit that referenced this pull request Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants