-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve debugability of warnings/errors "Triggered internally at" #128064
Comments
Anomaly mode is supposed to always work for errors that are triggered from Variable._execution_engine.run_backward. But actually, it seems like the problem here is warnings, rather than errors, and it does seem likely to me that we are not annotating the traceback of the forward that caused it with warnings. This is compounded by the fact that we are typically not setting stacklevel correctly when we warn, so the single line warning printout doesn't even say what the relevant user code is. It feels like it should be possible to install a temporary warning handler when we run backwards which augments warnings as well with user stacks. But we... probably don't want to print the full stacks? So we need some way of abbreviating it to one filename:lineno by default?? Not trivial. If anyone wants to try their hand at it I'd be happy tor eview. @albanD wdyt |
It might also be a good time, in the age of PT2 supremacy, to consider turning anomaly mode error tracking on by default. |
To be clear, I am not talking about anomaly mode, but warnings (that are printed from C++) in general. I actually meant that anomaly mode is an example of relatively GOOD warning messages. What I'd like to see is for all other C++ warnings in pytorch to have the same debugability.
Actually, I think that full tracebacks are required for debugging. Rather than trying to guess the correct stack level, I'd prefer to have a special "verbose warnings" mode where the full tracebacks are printed. I suspect that the hard part would be to record/identify the correct tracebacks during the forward pass. Without this, it doesn't matter if you are able to guess the stack level correctly, all of the warnings will just point to the |
Oh, I am reminded of #72948 which we eventually decided not to do because passing C++ log messages to Python was just... not a great idea. @kurtamohler did we ever take a closer look at the warning only piece of the puzzle? |
I don't think I ever looked into improving the traceability of warnings |
Actionable to augment warnings during backward with user stacks. |
Reserving this as an internal onboarding task |
🚀 The feature, motivation and pitch
Pytorch has a horrible habit of obfuscating warnings/errors by making them lazy, deferring and implementing them in the C++ code base. Even the usual python trick of making warnings raise exceptions (
warnings.filterwarnings("error")
) doesn't help in this case. Too often, the user is confronted with something along the lines of:See this issue search query for an approximate list of users running into such issues. This problem has almost become normalized to the point where people are conditioned to ignore the warnings emitted by pytorch (and I honestly can't blame them too much).
It seems that there are at least some mechanisms for determining and printing the actual traceback of the problem (like the
Traceback of forward call that caused the error:
message in anomaly detection mode). However this only works in specific cases, where the traceback was intentionally included by the C++ developer adding this particular warning. Ideally, I'd like to see something similar for all/most invocations ofTORCH_WARN
, possibly only enabled when some config option or environment variable is set.Alternatives
Accept the status quo (annoying, virtually undebuggable warnings with potential performance and correctness implications). It is still theoretically possible to debug these issues by essentially doing a binary search of the user's code (delete/stub out half of the code, run the code and hope that the warning disappears, etc), but this is an insanely annoying process. Especially if the warning is caused by some non-trivial combination of multiple pieces of code/features (amp, ddp, jit, etc) or by code in external libraries (timm, lightning, fairscale, etc).
Additional context
No response
cc @ezyang @albanD @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7 @malfet
The text was updated successfully, but these errors were encountered: