Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First rough draft of recoverable errors feature. #1325

Closed

Conversation

KerfuffleV2
Copy link
Collaborator

@KerfuffleV2 KerfuffleV2 commented May 4, 2023

@ggerganov

Context: ggerganov/ggml#123

Unfortunately (like always) this turned out to be more complicated than I initially expected.

At first I was planning to make it possible to acknowledge and clear an error, but there are some cases where recovery seems impossible like my map functions for example: Because they have to create another tensor and there's no way to free a tensor clearing the error would still leave that tensor allocated.

Probably need to scrap that plan and say "Once an error occurs, you need to free that ggml_context and start over". I think that's fine. (Right now those specific functions will abort if tensor creation fails after the first step, but assuming the context is just considered dead then returning NULL will be okay.)

Also, the way I implemented this is if you ignore an error and just keep trying to use the context, then it will abort.

Right now, this just adds the basic scaffolding. I also had to add NULL propagation to basically all functions that make a tensor. This could be simplified with a define instead of having to have a bunch of if (blah == NULL) return NULL; statements but for this sketch I'm keeping it simple.

This converts the asserts in ggml_add_impl and ggml_new_tensor_impl to be recoverable. There's also code cleanup, etc that would need to be done before this is actually ready to become a real pull.

Right now GGML_RECOVERABLE_ERRORS is just #defineed for testing. I have verified that it compiles (on Linux at least) and can run a model. I haven't tested triggering the asserts yet.

Before I continue with the rest, I want to make sure that I'm generally on the right track. What do you think?

@KerfuffleV2
Copy link
Collaborator Author

There are two possible alternative approaches instead of just always marking the context as dead if an error occurs:

  1. Have a special permanent flag that can't be cleared and allow most errors to be recovered from. You could just get unlucky and have something like ggml_map_unary_f32 succeed grabbing the addr tensor then fail with the map tensor itself.
  2. There are only a couple places where this is a problem. Maybe there's a way to pre-check all the conditions initially before proceeding to make sure the operation fails before it does anything that can't be reversed.

Those are technically possible (and I'd be willing to work on it if you have strong feelings) but personally it's probably not worth making things excessively complicated. I don't like the first option because it's basically random if you get screwed over and there needs to be special handling to deal with that case.

do { \
if (!(x)) { \
printf(__VA_ARGS__); \
fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could these be defined as e.g.

Suggested change
fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \
fprintf(stderr, "GGML_ASSERT: %s:%d: %s: %s\n", __FILE__, __LINE__, __func__, #x); \

? That way, asserts wouldn’t have to handle the function name boilerplate.

@KerfuffleV2
Copy link
Collaborator Author

I just had an idea for an approach that's probably better (and also likely useful for other stuff too). Add a function to do the size calculation part without actually adding a tensor. ggml_new_tensor_impl could also just use that without having to duplicate code. I think there's also a clever way to make it possible to check if multiple tensors fit before committing to anything.

Something like:

bool ggml_tensor * ggml_ensure_tensor_memory(
        struct ggml_context * ctx,
        enum   ggml_type type,
        int    n_dims,
        const int64_t* ne,
        int *ctx_required,
        int *scratch_required) {
        // etc
        *ctx_required += calculated_ctx_required;
        *scratch_required += calculated_scratch_required;
        return *ctx_required <= available_ctx && *scratch_required <= available_scratch;
}

That way it would be possible to call the function repeatedly as long as it returned true and it would just update the required memory arguments (which would be set to 0 initially).

I actually like this idea and I think it would be pretty easy to make those few functions that are currently hard to deal with safely retryable.

ggml.h Outdated Show resolved Hide resolved
ggml.c Outdated Show resolved Hide resolved
@@ -11741,6 +11816,9 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
}

void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph) {
#ifdef GGML_RECOVERABLE_ERRORS
GGML_ASSERT(ctx->last_error_code == GGML_ERRCODE_SUCCESS);
#endif
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this is needed

Copy link
Collaborator Author

@KerfuffleV2 KerfuffleV2 May 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it works currently is setting an error status in the context when an error occurs. When the error occurs, that operation didn't complete successfully.

Imagine this scenario:

  1. Create tensor 1 (succeeds)
  2. Create tensor 2 (fails)
  3. Run the graph with ggml_graph_compute

At step 3, we didn't actually build the graph successfully. We can't compute it, because tensor 2 is missing or in an invalid state. Right?

Now, if it had been tensor 1 that failed, trying to create tensor 2 would have run into the assert and crashed the process: step 3 would never be reached so that's okay.


I will make the other changes you suggested so the API stays the same. Can I take this response as indicating you don't have a problem with the general approach I'm using and I should continue to develop it?

edit: Also, do you have an opinion on the clearing error conditions stuff and approach to take with that?

Change return from ggml_last_error_msg to be const.

Cast returning error msg buffer to avoid a compiler warning.
@KerfuffleV2
Copy link
Collaborator Author

Closing this, at least for now as it doesn't seem like there's interest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants