Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concerns about ggml_graph_compute's threading #324

Closed
CCLDArjun opened this issue Jun 29, 2023 · 2 comments
Closed

Concerns about ggml_graph_compute's threading #324

CCLDArjun opened this issue Jun 29, 2023 · 2 comments

Comments

@CCLDArjun
Copy link
Contributor

Hi, I'm new to ggml and I've been looking at ggml_graph_compute. More specifically, the function it calls, ggml_graph_compute_thread. I think some threads are simultaneously computing the same result.

If you look at ggml_graph_compute_thread, I've drawn out a scenario.

  1. 4 threads: a, b, c and d and 4 nodes on the graph.
  2. each thread's n_node begins at -1 and shared->n_active is 4.
  3. a,b,c reduce shared->n_active to 1
  4. d starts computing the first node while a, b and c spin.
  5. d sets shared->node_n = 1 because cgraph->nodes[1]->n_tasks > 1.
  6. since shared->node_n is updated, a,b,c stop spinning and set their own node_n = 1.
  7. doesn't this mean that a,b,c,d compute the same thing simultaneously?

Here's the sample program that I'm using:

#include "ggml.h"
#include <stdio.h>
#include <unistd.h>

int main() {
    struct ggml_init_params params = {
        .mem_size   = 16*1024*1024,
        .mem_buffer = NULL,
    };

    printf("pid: %d\n", getpid());
    struct ggml_context * ctx = ggml_init(params);

    struct ggml_tensor * x = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);

    ggml_set_param(ctx, x);

    struct ggml_tensor * a  = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
    struct ggml_tensor * b  = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
    struct ggml_tensor * x2 = ggml_mul(ctx, x, x);
    struct ggml_tensor * f  = ggml_add(ctx, ggml_mul(ctx, a, x2), b);
    struct ggml_cgraph gf = ggml_build_forward(f);
    gf.n_threads = 1;

    ggml_set_f32(x, 2.0f);
    ggml_set_f32(a, 3.0f);
    ggml_set_f32(b, 4.0f);

    ggml_graph_compute(ctx, &gf);

    printf("f = %f\n", ggml_get_f32_1d(f, 0));

}

I used gdb to count the number of calls to ggml_compute_forward. When n_threads is set to 1, there are 12 calls and when it's set to 4, there are 21:

1 thread
(gdb) b ggml_compute_forward
Breakpoint 1 at 0x10001ec50: file /Users/ccldarjun/Python/ggml/src/ggml.c, line 15370.
(gdb) ignore 1 100000
Will ignore next 100000 crossings of breakpoint 1.
(gdb) r
Starting program: /Users/ccldarjun/Python/ggml/testing/a.out 
[New Thread 0x1603 of process 22165]
^C[New Thread 0x2003 of process 22165]
warning: unhandled dyld version (17)
pid: 22165
ggml_init: GELU, Quick GELU, SILU and EXP tables initialized in 5.782000 ms
ggml_init: g_state initialized in 0.049000 ms
ggml_init: found unused context 0
ggml_init: context initialized
ggml_build_forward_impl: visited 4 new nodes
ggml_graph_compute_thread: 0/4 pthread id: 1143273024 n_tasks: 1 temp: 1
ggml_graph_compute_thread: 1/4 pthread id: 1143273024 n_tasks: 1 temp: 1
ggml_graph_compute_thread: 2/4 pthread id: 1143273024 n_tasks: 1 temp: 1
ggml_graph_compute_thread: 3/4 pthread id: 1143273024 n_tasks: 1 temp: 1
ggml_graph_compute: perf (1) - cpu = 0.000 / 0.000 ms, wall = 0.000 / 0.000 ms
f = 16.000000
[Inferior 1 (process 22165) exited normally]
(gdb) info breakpoints
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000010001ec50 in ggml_compute_forward at /Users/ccldarjun/Python/ggml/src/ggml.c:15370
	breakpoint already hit 12 times
	ignore next 99988 hits
4 threads
(gdb) b ggml_compute_forward
Breakpoint 1 at 0x10001ec50: file /Users/ccldarjun/Python/ggml/src/ggml.c, line 15370.
(gdb) info breakpoints
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000010001ec50 in ggml_compute_forward at /Users/ccldarjun/Python/ggml/src/ggml.c:15370
(gdb) ignore 
Argument required (a breakpoint number).
(gdb) ignore 1 100000
Will ignore next 100000 crossings of breakpoint 1.
(gdb) r
Starting program: /Users/ccldarjun/Python/ggml/testing/a.out 
[New Thread 0x1203 of process 22148]
[New Thread 0x2003 of process 22148]
warning: unhandled dyld version (17)
pid: 22148
ggml_init: GELU, Quick GELU, SILU and EXP tables initialized in 3.603000 ms
ggml_init: g_state initialized in 0.031000 ms
ggml_init: found unused context 0
ggml_init: context initialized
ggml_build_forward_impl: visited 4 new nodes
[New Thread 0x1307 of process 22148]
[New Thread 0x2103 of process 22148]
[New Thread 0x2903 of process 22148]
ggml_graph_compute_thread: 0/4 pthread id: 63651840 n_tasks: 1 temp: 1
ggml_graph_compute_thread: 1/4 pthread id: 63651840 n_tasks: 4 temp: 1
ggml_graph_compute_thread: 2/4 pthread id: 1143273024 n_tasks: 4 temp: 1
ggml_graph_compute_thread: 3/4 pthread id: 63651840 n_tasks: 4 temp: 1
ggml_graph_compute: perf (1) - cpu = 0.000 / 0.000 ms, wall = 0.000 / 0.000 ms
f = 16.000000
[Inferior 1 (process 22148) exited normally]
(gdb) info breakpoints
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000010001ec50 in ggml_compute_forward at /Users/ccldarjun/Python/ggml/src/ggml.c:15370
	breakpoint already hit 21 times
	ignore next 99979 hits

I think these two things are connected.

@slaren
Copy link
Collaborator

slaren commented Jun 29, 2023

There is one call to ggml_compute_forward for GGML_TASK_INIT, another for GGML_TASK_FINALIZE, and for parallelizable tasks, as many GGML_TASK_COMPUTE as there are threads. You have 3 ops here, all parallizable, so with n_threads=1 you should see 9 calls to ggml_compute_forward (3 init, 3 finalize, 3 compute). With n_threads=4, you should see 18 calls (3 init, 3 finalize, 4*3 compute).

That's what I see in my tests. Are you seeing something different?

@CCLDArjun
Copy link
Contributor Author

Ohh ok I see, computing one op by itself is parallizable
I was thinking it was supposed to compute multiple nodes at once so I was confused

With n_threads=4, you should see 18 calls (3 init, 3 finalize, 4*3 compute).

That's what I see in my tests. Are you seeing something different?

I'm getting 21 calls for 4 threads and 12 calls for 1 thread (there's a nop at the start, so it takes up 3 calls).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants