Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dont catch ocaml runtime exceptions #964

Merged
merged 8 commits into from
Aug 4, 2023

Conversation

raphael-proust
Copy link
Collaborator

Exceptions such as Out-of-memory and Stack-overflow shouldn't be caught by Lwt.

  • Or should they? Please let me know: this PR changes the inner workings of Lwt in an opinionated way.
  • Any other exceptions should be let through?

@raphael-proust raphael-proust force-pushed the dont-catch-ocaml-runtime-exceptions branch from 7755e44 to a19f764 Compare October 12, 2022 12:31
@hannesm
Copy link
Contributor

hannesm commented Oct 12, 2022

I'm in favour of this change.

@Julow
Copy link

Julow commented Oct 17, 2022

When the unipi unikernel runs out of memory while fetching its data, the out of memory exception is handled and somehow turned into an empty result (the unikernel finally crashes because of that).
I couldn't reproduce the issue with this PR. The unikernel crashes because of an out of memory exception, as expected.

@raphael-proust raphael-proust force-pushed the dont-catch-ocaml-runtime-exceptions branch from a19f764 to e914f3b Compare October 27, 2022 07:50
@raphael-proust
Copy link
Collaborator Author

I added a couple of commits. One avoids catching OOM/SO in Lwt_seq. This had been forgotten because of the different way exceptions are caught (match … with | exception … -> instead of try-with). This is clearly in line with the rest and not a worry at all.

The other commit avoids catching OOM/SO in Lwt_main.run. By avoiding catching those, it leaves the main-loop in an inconsistent state. As a result, it is not possible to restart the main loop after a runtime exception escapes (calling Lwt_main.run again raises an exception immediately).

I'm not entirely sure about this second change. It might be reasonable for someone to call Lwt_main.run again after SO (maybe not so much after OOM though). Opinions? Recommendations? Ideas?

(Also I rebased on master.)

@vouillon
Copy link
Member

I'm concerned that these exceptions can put Lwt in an inconsistent state. You mention the main loop, but the change in Lwt.with_value is probably an issue as well.

Also, this means that it is not longer possible to contain these exceptions to a particular thread. Instead, they might be caught at some completely unrelated place.

If we want these exceptions to stop the program, maybe we should call exit right away rather than ignoring them. This is the default behavior of Lwt.async, by the way.

@raphael-proust
Copy link
Collaborator Author

I'm concerned that these exceptions can put Lwt in an inconsistent state. You mention the main loop, but the change in Lwt.with_value is probably an issue as well.

The change with with_value would be an issue if the exceptions were not consistently left to escape. But in the current implementation, when an OOM happens, then it propagates all the way to the outside of the main loop which cannot be re-entered. Because it cannot be re-entered, none of the paused promises and none of the promises waiting on I/O will be resolved. For this reason, I don't know that the inconsistent state that with_value is left in is observable. Or have I missed something?

I guess inconsistent with_value state could be observable if a user were to store a wakener in a global mutable data-structure, and then, after an OOM exception had escaped Lwt_main.run call, they used the wakener to trigger the registered callbacks. I can make the wording stronger on the documentation of Lwt_main.run to indicate that this shouldn't be done.

Also, this means that it is not longer possible to contain these exceptions to a particular thread. Instead, they might be caught at some completely unrelated place.

The aim is to make it only possible to catch outside of Lwt. Specifically, outside of the Lwt_main.run. I do think this is a very opinionated thing to do. But I disagree that the place the exception would be caught is unrelated: the exception was raised sometime during the execution of Lwt_main.run and it is catchable by the caller of Lwt_main.run.

In addition, it is correct that it is no longer possible to catch these exceptions using the tools provided by Lwt. However, it is possible to catch them from within a non-Lwt section of the code. E.g.,

let mk n =
  let* () = Lwt.pause () in
  let a = try Array.make n 0 with Out_of_memory -> [||] in
  let* () = (* iterate over the indexes of the array here *) in
  a

Basically, these exceptions can be handled at the very local scale (sub-promise scale) or very globally (above run) but not in-between.

If we want these exceptions to stop the program, maybe we should call exit right away rather than ignoring them. This is the default behavior of Lwt.async, by the way.

One issue with calling exit is that we have to chose the exit code. Different applications have different meaning for different exit codes.

Maybe we could register a global handler for these exceptions and make it an option so that it can be set to be ignored. I'm not a big fan of registering global handlers through mutable state. But if it is important for some application to be able to handle runtime exceptions maybe it is worth adding that feature.


Is there a way that with_value's inconsistency is observable by casual users?

Should we had a flag to capture/not the exceptions? Should we add a global mutable ocaml-runtime-exception handler?
What's the use-case for that?

@gadmm
Copy link

gadmm commented Oct 28, 2022

As a general rule, you want to propagate the exception to let the user program do some clean-up; and as a general rule also the programmer should only definitely catch the exception if they know that no broken state escapes. Typically, they let the program terminate. Without looking at the details of the implementation, the change sounds good in principle (modulo the caveats from my reply on discuss).

@vouillon
Copy link
Member

When the unipi unikernel runs out of memory while fetching its data, the out of memory exception is handled and somehow turned into an empty result (the unikernel finally crashes because of that). I couldn't reproduce the issue with this PR. The unikernel crashes because of an out of memory exception, as expected.

Does it make any difference if the out of memory exception is not ignored by the error handler (https://github.com/roburio/unipi/blob/fbaaa27cd641ab17b1f840ece441f8bdacf60b2e/unikernel.ml#L184)?

  let ignore_error _ ?request:_ _ _ = ()

@vouillon
Copy link
Member

Also, this means that it is not longer possible to contain these exceptions to a particular thread. Instead, they might be caught at some completely unrelated place.

The aim is to make it only possible to catch outside of Lwt. Specifically, outside of the Lwt_main.run. I do think this is a very opinionated thing to do. But I disagree that the place the exception would be caught is unrelated: the exception was raised sometime during the execution of Lwt_main.run and it is catchable by the caller of Lwt_main.run.

Suppose you have this piece of code:

let x = Lwt_main.run (Cohttp_lwt_unix.Client.get ...)

What I mean is that Lwt_main.run can fail with an exception which has nothing to do with the HTTP request, but comes from some unrelated code that happens to run concurrently. For instance, from a background thread started in a different module:

let rec background_thread () = let%lwt () = ... in background_thread ()
let () = Lwt.async background_thread

So, indeed, you can catch the exception outside Lwt_main.run. But for what? You have no idea where it comes from. So, for instance, you don't know whether you need to restart this background thread or another thread. You don't know whether any resource (file descriptors, ...) needs to be released either.

At the moment, you have a lot more control. The default behavior of Lwt is to stop the program if there is a stack overflow (or any uncaught exception) in the background thread. You can make the thread more robust by catching exceptions locally. You can use Lwt.finally to make sure that resources are released appropriately.

For a more concrete example, do you see how to update Cohttp so that a Web server can continue running without leaking file descriptors when the request handler fails with a stack overflow? Or do you consider that the Web server should just stop when this happens?

One issue with calling exit is that we have to chose the exit code. Different applications have different meaning for different exit codes.

OCaml already calls exit for uncaught exceptions, and there is no way to choose the exit code it uses.

@gadmm
Copy link

gadmm commented Oct 31, 2022

I am less familiar with Lwt but with fibers in multicore, every fiber would need to be discontinued in reaction to this exception, causing each fiber to be restarted with an exception, and subsequently finally handlers being run. Here you need something similar to release resources owned by every thread (one way to think about it is that the fact that the threads can own resources make themselves become resources that require clean-up).

I am on the fence regarding the treatment of Stack_overflow: if it is a request consuming too much resources you want to be able to catch it normally, but if it is a bug in the application you want to fail fast.

@raphael-proust
Copy link
Collaborator Author

Suppose you have this piece of code:

let x = Lwt_main.run (Cohttp_lwt_unix.Client.get ...)

What I mean is that Lwt_main.run can fail with an exception which has nothing to do with the HTTP request, but comes from some unrelated code that happens to run concurrently. For instance, from a background thread started in a different module:

let rec background_thread () = let%lwt () = ... in background_thread ()
let () = Lwt.async background_thread

I wouldn't recommend writing this code. Doing so you may give the impression that the background thread is executing concurrently and somewhat independently from the main.run call, but in fact the thread is stalled unless the main loop is running. (It will stop at the first pause or at the first I/O because the main loop isn't there to wake it up afterwards.)

Essentially, in that code, there is a hidden bit of control-flow where the call to Lwt_main.run causes the background thread to progress. (Which it stops doing if the main loop finishes.)

In that sense the exception happens in the context of the Lwt_main.run anyway. It cannot happen until the Lwt_main.run is running.


Still, I understand that some patterns of code similar to this one may need to have a more fine-grained error-management mechanism. Especially re: Stack_overflow. (For Out_of_memory I don't think it is as useful (maybe exit is the right way to go) and I don't know that we can offer the right tools anyway because we would need those tools to never allocate to not raise their own Out_of_memory in response.)

So I'm thinking of ways to allow more flexibility for users that might need to catch those exceptions specifically, and may need to resume work and/or clean-up resources. I'll push some more commits to try to accommodate those needs.

@hannesm
Copy link
Contributor

hannesm commented Nov 7, 2022

@Julow wrote

When the unipi unikernel runs out of memory while fetching its data, the out of memory exception is handled and somehow turned into an empty result (the unikernel finally crashes because of that). I couldn't reproduce the issue with this PR. The unikernel crashes because of an out of memory exception, as expected.

@vouillon replied

Does it make any difference if the out of memory exception is not ignored by the error handler (https://github.com/roburio/unipi/blob/fbaaa27cd641ab17b1f840ece441f8bdacf60b2e/unikernel.ml#L184)?

let ignore_error _ ?request:_ _ _ = ()

This is a great observation, thanks, and indeed fixed in robur-coop/unipi@1d2f5e0.

But, this is the error_handler used for the HTTP service, which is only started after the git clone was successful. The unikernel is executed after OCaml runtime setup by executing the start function, which does a git clone as its first operation (there's no error_handler being installed).

@raphael-proust
Copy link
Collaborator Author

Thinking about this a bit more, we could do the following as an intermediate solution:
Replace the is_not_ocaml_runtime_exception function by a allow_catching_of_exc reference; the initial (default) value for this reference is fun _ -> true to allow catching all exceptions (the currently released behaviour). Export two functions: allow_catching_all_exceptions: unit -> unit and prevent_catching_non_runtime_exceptions: unit -> unit to switch between the currently released behaviour and the behaviour proposed in this PR.

Later on, we can decide whether to change the default. We can make this decision based on feedback and actual uses.

@raphael-proust raphael-proust force-pushed the dont-catch-ocaml-runtime-exceptions branch from e914f3b to 2f1edec Compare March 3, 2023 10:13
@raphael-proust
Copy link
Collaborator Author

@vouillon @Julow @hannesm Are you all ok with the form I just pushed? TL;DR:

type exception_filter
val catch_all_filter : exception_filter
val catch_not_runtime_filter : exception_filter
val set_exception_filter : exception_filter -> unit

@hannesm
Copy link
Contributor

hannesm commented Mar 3, 2023

looks fine, but I've not much knowledge about the code and internals:

  • does it impose any performance regressions?
  • is it safe/unsafe to not catch Out_of_memory / Stack_overflow? Regarding to comments above, it may be bad for Lwt.set_value?
  • is "catch_not_runtime_filter" a good name (sorry, I've not come up with a better one)?
  • I'm not entirely convinced that pushing the decision to the API user is a good path forward: I prefer sensible defaults and a small API -- while here, the API is only extended, and the default (catching all exceptions) is not modified.

But under the line, I understand that this is a good compromise, and for me it is better than the earlier behaviour (I'll initialize Lwt to not catch the OOM and SO exceptions -- I also don't use the Lwt.*value functions).

@raphael-proust
Copy link
Collaborator Author

I also dislike the names I picked in this commit. I need to think about it a bit more to find something good. In the meantime suggestions are welcome.

Regarding performance regression, I think that it shouldn't affect performance of code that does not raise exceptions. That is because the diff for this PR is essentially replacing try … with exn -> … with try … with exn when !exception_filter exn -> …: the non-exception path is unaffected. Exception handling is going to be more expensive: one dereference, and one function application. It's not much: the function is simple and the reference is global. But I guess it might slow-down code which relies heavily on exceptions.

Re Lwt.with_value I first want to note that these functions are in the “deprecated” section of the documentation. They are not marked as [@deprecated] though so they are likely still used more than they should. As for potential issues with it: yes, it can leave the state of global tables of key-values in an inconsistent state. However, note that it is not possible to re-enter the main-loop of Lwt after an unhandled exception. So it is possible to check Lwt.get and get some random garbage, but it is not possible to start any meaningful Lwt computation where such a call would most likely be located.

@raphael-proust
Copy link
Collaborator Author

@vouillon What do you think about the approach taken by this PR?

@raphael-proust raphael-proust force-pushed the dont-catch-ocaml-runtime-exceptions branch from 64dc203 to f064ff1 Compare July 17, 2023 10:22
@raphael-proust
Copy link
Collaborator Author

This PR now has a shape I like enough to consider merging.

@hannesm do you like this interface enough?

@vouillon are you ok with the optional filtering system?

I'd like to do a release of Lwt soonish because it's been a while. Let me know opinions and such if you have any.

@hannesm
Copy link
Contributor

hannesm commented Jul 31, 2023

Dear Raphael,

thanks for your work. I appreciate the revised API (although it introduces mutable state).

src/core/lwt.mli Outdated Show resolved Hide resolved
@vouillon
Copy link
Member

vouillon commented Aug 2, 2023

This looks good to me, except for the small comment I made.

@raphael-proust raphael-proust force-pushed the dont-catch-ocaml-runtime-exceptions branch from aa72d56 to 032b120 Compare August 4, 2023 07:46
@raphael-proust raphael-proust merged commit 2eee2a1 into master Aug 4, 2023
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants