Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve debugability of warnings/errors "Triggered internally at" #128064

Open
RuRo opened this issue Jun 5, 2024 · 7 comments
Open

Improve debugability of warnings/errors "Triggered internally at" #128064

RuRo opened this issue Jun 5, 2024 · 7 comments
Assignees
Labels
actionable module: autograd Related to torch.autograd, and the autograd engine in general module: error checking Bugs related to incorrect/lacking error checking triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@RuRo
Copy link

RuRo commented Jun 5, 2024

🚀 The feature, motivation and pitch

Pytorch has a horrible habit of obfuscating warnings/errors by making them lazy, deferring and implementing them in the C++ code base. Even the usual python trick of making warnings raise exceptions (warnings.filterwarnings("error")) doesn't help in this case. Too often, the user is confronted with something along the lines of:

.../site-packages/torch/autograd/graph.py:744: UserWarning: Somebody added this warning, presumably to
warn you about something going wrong in your code. This problem might be making your training slow or
inaccurate. Or it might just be a false warning, lol. Oh and here is some contextual information that
might make sense in a C++ context, but is almost useless for somebody who is using torch from python.
(Triggered internally at torch/good_luck.cpp:9001.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

See this issue search query for an approximate list of users running into such issues. This problem has almost become normalized to the point where people are conditioned to ignore the warnings emitted by pytorch (and I honestly can't blame them too much).

It seems that there are at least some mechanisms for determining and printing the actual traceback of the problem (like the Traceback of forward call that caused the error: message in anomaly detection mode). However this only works in specific cases, where the traceback was intentionally included by the C++ developer adding this particular warning. Ideally, I'd like to see something similar for all/most invocations of TORCH_WARN, possibly only enabled when some config option or environment variable is set.

Alternatives

Accept the status quo (annoying, virtually undebuggable warnings with potential performance and correctness implications). It is still theoretically possible to debug these issues by essentially doing a binary search of the user's code (delete/stub out half of the code, run the code and hope that the warning disappears, etc), but this is an insanely annoying process. Especially if the warning is caused by some non-trivial combination of multiple pieces of code/features (amp, ddp, jit, etc) or by code in external libraries (timm, lightning, fairscale, etc).

Additional context

No response

cc @ezyang @albanD @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7 @malfet

@janeyx99 janeyx99 added module: autograd Related to torch.autograd, and the autograd engine in general module: error checking Bugs related to incorrect/lacking error checking triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 6, 2024
@soulitzer soulitzer removed the module: autograd Related to torch.autograd, and the autograd engine in general label Jun 6, 2024
@ezyang ezyang added the module: autograd Related to torch.autograd, and the autograd engine in general label Jun 9, 2024
@ezyang
Copy link
Contributor

ezyang commented Jun 9, 2024

Anomaly mode is supposed to always work for errors that are triggered from Variable._execution_engine.run_backward. But actually, it seems like the problem here is warnings, rather than errors, and it does seem likely to me that we are not annotating the traceback of the forward that caused it with warnings. This is compounded by the fact that we are typically not setting stacklevel correctly when we warn, so the single line warning printout doesn't even say what the relevant user code is.

It feels like it should be possible to install a temporary warning handler when we run backwards which augments warnings as well with user stacks. But we... probably don't want to print the full stacks? So we need some way of abbreviating it to one filename:lineno by default?? Not trivial. If anyone wants to try their hand at it I'd be happy tor eview.

@albanD wdyt

@ezyang
Copy link
Contributor

ezyang commented Jun 9, 2024

It might also be a good time, in the age of PT2 supremacy, to consider turning anomaly mode error tracking on by default.

@RuRo
Copy link
Author

RuRo commented Jun 10, 2024

Anomaly mode is supposed to always work for errors that are triggered from Variable._execution_engine.run_backward. But actually, it seems like the problem here is warnings, rather than errors

To be clear, I am not talking about anomaly mode, but warnings (that are printed from C++) in general. I actually meant that anomaly mode is an example of relatively GOOD warning messages. What I'd like to see is for all other C++ warnings in pytorch to have the same debugability.

But we... probably don't want to print the full stacks? So we need some way of abbreviating it to one filename:lineno by default?? Not trivial.

Actually, I think that full tracebacks are required for debugging. Rather than trying to guess the correct stack level, I'd prefer to have a special "verbose warnings" mode where the full tracebacks are printed.

I suspect that the hard part would be to record/identify the correct tracebacks during the forward pass. Without this, it doesn't matter if you are able to guess the stack level correctly, all of the warnings will just point to the my_loss.backwards() line in the user's code (which isn't that much of an improvement compared to return Variable._execution_engine.run_backward).

@ezyang
Copy link
Contributor

ezyang commented Jun 10, 2024

Oh, I am reminded of #72948 which we eventually decided not to do because passing C++ log messages to Python was just... not a great idea. @kurtamohler did we ever take a closer look at the warning only piece of the puzzle?

@kurtamohler
Copy link
Collaborator

I don't think I ever looked into improving the traceability of warnings

@soulitzer
Copy link
Contributor

Actionable to augment warnings during backward with user stacks.
Prior to executing each node during backward, we already enter a warning recording context - when anomaly mode is enabled we should be able to include more information there. See: ehttps://github.com//pull/66235.

@soulitzer soulitzer self-assigned this Jun 18, 2024
@soulitzer
Copy link
Contributor

Reserving this as an internal onboarding task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
actionable module: autograd Related to torch.autograd, and the autograd engine in general module: error checking Bugs related to incorrect/lacking error checking triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants