Improve debugability of warnings/errors "Triggered internally at" #128064

RuRo · 2024-06-05T21:33:51Z

🚀 The feature, motivation and pitch

Pytorch has a horrible habit of obfuscating warnings/errors by making them lazy, deferring and implementing them in the C++ code base. Even the usual python trick of making warnings raise exceptions (warnings.filterwarnings("error")) doesn't help in this case. Too often, the user is confronted with something along the lines of:

.../site-packages/torch/autograd/graph.py:744: UserWarning: Somebody added this warning, presumably to
warn you about something going wrong in your code. This problem might be making your training slow or
inaccurate. Or it might just be a false warning, lol. Oh and here is some contextual information that
might make sense in a C++ context, but is almost useless for somebody who is using torch from python.
(Triggered internally at torch/good_luck.cpp:9001.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

See this issue search query for an approximate list of users running into such issues. This problem has almost become normalized to the point where people are conditioned to ignore the warnings emitted by pytorch (and I honestly can't blame them too much).

It seems that there are at least some mechanisms for determining and printing the actual traceback of the problem (like the Traceback of forward call that caused the error: message in anomaly detection mode). However this only works in specific cases, where the traceback was intentionally included by the C++ developer adding this particular warning. Ideally, I'd like to see something similar for all/most invocations of TORCH_WARN, possibly only enabled when some config option or environment variable is set.

Alternatives

Accept the status quo (annoying, virtually undebuggable warnings with potential performance and correctness implications). It is still theoretically possible to debug these issues by essentially doing a binary search of the user's code (delete/stub out half of the code, run the code and hope that the warning disappears, etc), but this is an insanely annoying process. Especially if the warning is caused by some non-trivial combination of multiple pieces of code/features (amp, ddp, jit, etc) or by code in external libraries (timm, lightning, fairscale, etc).

Additional context

No response

cc @ezyang @albanD @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7 @malfet

The text was updated successfully, but these errors were encountered:

ezyang · 2024-06-09T22:25:04Z

Anomaly mode is supposed to always work for errors that are triggered from Variable._execution_engine.run_backward. But actually, it seems like the problem here is warnings, rather than errors, and it does seem likely to me that we are not annotating the traceback of the forward that caused it with warnings. This is compounded by the fact that we are typically not setting stacklevel correctly when we warn, so the single line warning printout doesn't even say what the relevant user code is.

It feels like it should be possible to install a temporary warning handler when we run backwards which augments warnings as well with user stacks. But we... probably don't want to print the full stacks? So we need some way of abbreviating it to one filename:lineno by default?? Not trivial. If anyone wants to try their hand at it I'd be happy tor eview.

@albanD wdyt

ezyang · 2024-06-09T22:25:40Z

It might also be a good time, in the age of PT2 supremacy, to consider turning anomaly mode error tracking on by default.

RuRo · 2024-06-10T00:01:32Z

Anomaly mode is supposed to always work for errors that are triggered from Variable._execution_engine.run_backward. But actually, it seems like the problem here is warnings, rather than errors

To be clear, I am not talking about anomaly mode, but warnings (that are printed from C++) in general. I actually meant that anomaly mode is an example of relatively GOOD warning messages. What I'd like to see is for all other C++ warnings in pytorch to have the same debugability.

But we... probably don't want to print the full stacks? So we need some way of abbreviating it to one filename:lineno by default?? Not trivial.

Actually, I think that full tracebacks are required for debugging. Rather than trying to guess the correct stack level, I'd prefer to have a special "verbose warnings" mode where the full tracebacks are printed.

I suspect that the hard part would be to record/identify the correct tracebacks during the forward pass. Without this, it doesn't matter if you are able to guess the stack level correctly, all of the warnings will just point to the my_loss.backwards() line in the user's code (which isn't that much of an improvement compared to return Variable._execution_engine.run_backward).

ezyang · 2024-06-10T02:00:26Z

Oh, I am reminded of #72948 which we eventually decided not to do because passing C++ log messages to Python was just... not a great idea. @kurtamohler did we ever take a closer look at the warning only piece of the puzzle?

kurtamohler · 2024-06-10T17:45:27Z

I don't think I ever looked into improving the traceability of warnings

soulitzer · 2024-06-18T14:46:38Z

Actionable to augment warnings during backward with user stacks.
Prior to executing each node during backward, we already enter a warning recording context - when anomaly mode is enabled we should be able to include more information there. See: ehttps://github.com//pull/66235.

soulitzer · 2024-06-18T14:51:41Z

Reserving this as an internal onboarding task

janeyx99 added module: autograd Related to torch.autograd, and the autograd engine in general module: error checking Bugs related to incorrect/lacking error checking triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 6, 2024

soulitzer removed the module: autograd Related to torch.autograd, and the autograd engine in general label Jun 6, 2024

ezyang added the module: autograd Related to torch.autograd, and the autograd engine in general label Jun 9, 2024

soulitzer added the actionable label Jun 18, 2024

soulitzer self-assigned this Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve debugability of warnings/errors "Triggered internally at" #128064

Improve debugability of warnings/errors "Triggered internally at" #128064

RuRo commented Jun 5, 2024 •

edited by pytorch-bot bot

Loading

ezyang commented Jun 9, 2024

ezyang commented Jun 9, 2024

RuRo commented Jun 10, 2024 •

edited

Loading

ezyang commented Jun 10, 2024

kurtamohler commented Jun 10, 2024

soulitzer commented Jun 18, 2024

soulitzer commented Jun 18, 2024

Improve debugability of warnings/errors "Triggered internally at" #128064

Improve debugability of warnings/errors "Triggered internally at" #128064

Comments

RuRo commented Jun 5, 2024 • edited by pytorch-bot bot Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

ezyang commented Jun 9, 2024

ezyang commented Jun 9, 2024

RuRo commented Jun 10, 2024 • edited Loading

ezyang commented Jun 10, 2024

kurtamohler commented Jun 10, 2024

soulitzer commented Jun 18, 2024

soulitzer commented Jun 18, 2024

RuRo commented Jun 5, 2024 •

edited by pytorch-bot bot

Loading

RuRo commented Jun 10, 2024 •

edited

Loading