ggml_allocr_alloc_graph allocated overlapping tensor memory #700

bssrdf · 2024-01-18T01:24:03Z

Hi, I have encountered a strange issue using ggml_allocr_alloc_graph to allocate tensor data. When building the graph, I used no-alloca context and later used ggml_allocr_alloc_graph to allocate all tensors' data. However, I noticed two particular tensors have exactly the same memory address for their data member. Is this a bug?

You can replicate the issue using my branch here. After building ggml, run ./bin/test-alloc-graph.

The graph is a simple one:

The text was updated successfully, but these errors were encountered:

slaren · 2024-01-18T01:37:44Z

This is not bug, it is actually the main function of ggml-alloc. The memory of the tensors with intermediate results is reused as soon as they aren't needed anymore to reduce the size of the compute buffers. If you want every tensor to have a different address, you can use a context without no_alloc, or ggml_backend_alloc_ctx_tensors.
If you only want to inspect the results of intermediate computations, you can also compute the graph one node at a time, such as:

    for (int i = 0; i < g1->n_nodes; i++) {
        struct ggml_tensor * t1 = g1->nodes[i];
        struct ggml_cgraph g1v = ggml_graph_view(g1, i, i + 1);
        ggml_backend_graph_compute(backend, &g1v);
    }

There was also a callback added to ggml_backend_sched for this purpose in ggerganov/llama.cpp#4935.
If you want to keep some of the intermediate results, the recommended approach would be pre-allocate some tensors in a different buffer and use ggml_cpy to copy the result there. Technically it is also possible to add a dependency at the end of the graph with a no-op such as ggml_scale(ctx, a, 1), but I wouldn't recommend that.

bssrdf · 2024-01-18T01:49:50Z

Thanks for the quick response.

Sorry I am new to ggml. I understand this memory overwrite is fine for inference (i.e., forward compute). How about backward compute? Won't this memory overwrite defeat the purpose of backpropagation for training? I found out this behavior when trainning a VAE.

slaren · 2024-01-18T01:54:20Z

I don't know much about training, but I believe that the way the training examples in llama.cpp handle this is by adding dependencies at the end of the graph with ggml_scale(ctx, a, 1), which may be the best way to do this at the moment if you need to keep a lot of the intermediate results.

bssrdf · 2024-01-18T02:17:53Z

Thanks for the suggestions.

bssrdf closed this as completed Jan 18, 2024

bobqianic mentioned this issue Jan 24, 2024

[DRAFT] Token level timestamps with DTW (#375) ggerganov/whisper.cpp#1485

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml_allocr_alloc_graph allocated overlapping tensor memory #700

ggml_allocr_alloc_graph allocated overlapping tensor memory #700

bssrdf commented Jan 18, 2024 •

edited

Loading

slaren commented Jan 18, 2024 •

edited

Loading

bssrdf commented Jan 18, 2024

slaren commented Jan 18, 2024

bssrdf commented Jan 18, 2024

ggml_allocr_alloc_graph allocated overlapping tensor memory #700

ggml_allocr_alloc_graph allocated overlapping tensor memory #700

Comments

bssrdf commented Jan 18, 2024 • edited Loading

slaren commented Jan 18, 2024 • edited Loading

bssrdf commented Jan 18, 2024

slaren commented Jan 18, 2024

bssrdf commented Jan 18, 2024

bssrdf commented Jan 18, 2024 •

edited

Loading

slaren commented Jan 18, 2024 •

edited

Loading