Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erratic crash apparently related to the Unpkg dialog #259

Closed
Lithopsian opened this issue Feb 5, 2024 · 12 comments
Closed

Erratic crash apparently related to the Unpkg dialog #259

Lithopsian opened this issue Feb 5, 2024 · 12 comments

Comments

@Lithopsian
Copy link
Contributor

This one happens, then doesn't for ages even when I try, then happens again. So far. the way I can provoke it is by starting a package query, then starting an Unpkg query before it has completed. It might happen first time, it might take many attempts. When I've caught it in the debugger, the crash has happened at several different points, usually processing the Unpkg settings, but also once clearing the DirTree after the Unpkg dialog returns and once in TreemapTile where the root orig pointer was clearly stale or corrupt (magic was negative and large).

Here's a couple of crashes that I caught, but the corruption is likely to have happened quite a bit earlier. Logging seems to end abruptly around the time that the read jobs are being killed and the file list cache is being destroyed, but apparently some time before the actual crash. I did get the Unpkg settings dumped once and they looked intact, but usually the logging stops earlier than that.
Screenshot_2024-02-05_15-19-55
Screenshot_2024-02-05_16-08-58

@shundhammer
Copy link
Owner

shundhammer commented Feb 5, 2024

If you are trying to force crashes, chances are that they will happen.

Just have a little patience and don't nervously open a dialog to trigger a read again while the previous one hasn't finished yet. This is a clear case of "Doctor, it hurts when I do that!" - "Then do that."

I don't think that this has ever been a problem in real life for users. And it's not as if any precious data that the user spent any work on would get lost; at worst, the program needs to be restarted. So what: Impatiently clicking wildly in the menus to stop one program mode and start a completely different one means that the user has no concept of what he actually wants to achieve.

I find myself spending way too much time on such artificially created scenarios, and the code complexity keeps growing.

@Lithopsian
Copy link
Contributor Author

It happened. Normally. While I was doing other things. More than once. I only tried "forcing" things to make it happen repeatedly for debugging (and failed, to be honest, it is still hard to reproduce on demand). If it happened for me, then it has happened for someone else, just 99% of them will pass on by without reporting it. Don't fix it if you don't want, but blaming a crash on the user seems like a pretty poor attitude to me.

@shundhammer
Copy link
Owner

Here is the deal: Adding more and more code to handle very exotic cases does not make the code any better; to the contrary, it makes it considerably worse.

Every additionally if (...) ... else ... duplicates the number of code paths. And a great number of them is error handling of some kind. The problem is that error handling code is mostly dead code - because it's the code path taken only in very exceptional cases. Like 0.001% or less.

Error handling code is the most bug-ridden code in pretty much every software; because it's hard to test it, even worse if it's a long if (...) cascade in a long call chain. I've been in projects where we were proud of our test coverage, and when we could finally convince the boss to shell out money for some code coverage software, it turned out that 80% (!!) of all code paths were never taken and thus completely untested; despite the test coverage that we thought we had.

And repeatedly, we received customer bug reports about crashes which mostly turned out to be in some of that error handling code. More often than not, the error handling for harmless problems which would have caused only a minor annoyance caused a hard crash; and thus a showstopper that the user could not work around.

Dead code is typically bad quality code; because it is almost never used, and if it is, it very often leads to nasty surprises. It degrades the overall code quality, it leads to technical debt, it makes the software less maintainable.

It's a very fine line between having a reasonable amount of error handling code and going completely overboard with error handling and obscuring the normal code path.

@shundhammer
Copy link
Owner

As for "it happened in real life", I really wonder about your usage pattern. Starting a pkg query, and while it is still running (which takes - what - 20 seconds?), immediately starting an unpkg query? That's a completely different use case.

I use the normal dir tree mode, pkg mode, unpkg mode, file age view a lot. But almost always I use them separately, not starting one after the other.

In some cases it's useful to simply start multiple instances of QDirStat; for example when comparing a a directory tree against the packages that make up its majority. It never even once occurred to me to nervously reload one or the other repeatedly.

@shundhammer
Copy link
Owner

shundhammer commented Feb 6, 2024

I am very skeptical how many users use any of the other modes to begin with: The pkg mode, the unpkg mode, the file age view, let alone the histogram view.

There are those who keep popping up in user forums who want an exact copy of WinDirStat's file type view, complete with the color legend taking up a large part of the main window; and those users can't even be bothered to check the menus to find QDirStat's file type view. That's the amount of attention they give.

Bloggers keep writing articles where they don't even grasp what that colorful graphic in the bottom half of the main window is; but they know how to copy and paste their "How to install XY on Ubuntu", "How to install XY on Arch Linux", "How to install XY on Fedora" text. Not a single one of those people ever discovered advanced features like the Pkg view, the Unpkg view, the file age view. But they all know how to ask for donations for that "work" that they are doing.

We live in an age of ultra-short attention spans. Nobody reads documentation anymore. Nobody even bothers to check out the menus to see what's available. If a piece of software doesn't do within the first minute what a user has in his mind, toss it and try the next one. If a Linux distro doesn't do what a user wants it to do (without ever reading any documentation, of course), toss it and distro-hop to the next one. Users have unrealistic expectations these days, zero patience and hardly any willingness to learn.

This is the context in which you expect me to spend more time on the intricacies of switching from one mode to another, of impatiently interrupting an ongoing thing like scanning a directory tree or querying the package manager.

Sorry, no. I don't see any return of investment for this.

@shundhammer
Copy link
Owner

shundhammer commented Feb 6, 2024

Worse, I don't see any tangible benefit for most users.

Take the recent GitHub issues since the 1.9 release, for example. You might have noticed that I did not, unlike I always did, add a change log entry to the README.md file for each of them. Why? Because most of those scenarios are so specialized that it would be a struggle to summarize them in a way that normal users can make sense of.

For some it may be possible, for others it can't be more than a link to the issue itself where a casually interested user will very likely get lost in too many technical details.

But all those things consume an enormous amount of time; time that could also be spent doing innovative things, not dwelling on exotic scenarios, on problems that almost no user has in real life.

I have limited time that I can and want to dedicate to QDirStat. I want to make that time count.

@Lithopsian
Copy link
Contributor Author

So, the culprit is ... BusyPopup. No, not really, but it is part of the problem. The Unpkg stuff was a bit of a red herring, it just happened to be what I was doing the first time this happened. It happens any time a packaged or unpackaged read is started during a very short window just as a treemap is being built. Not so complicated really but it is quite a short window.

So, here's a fairly easy to way to crash:

  1. Start a package read. It is convenient because it is relatively easy to see the progress and the treemap build is moderately slow, but the crash can happen with any sort of read.
  2. Open the readPkg dialog, ready to go.
  3. Just as the package read is finishing, start the next package read. An Unpackaged read will also cause it. Timing is tricky, you're trying hit the spot where the treemap is getting built but before it is painted.

What happens is, the treemap gets built, but not yet painted. readPkg() clears the tree, but hasn't yet gone to "busyDisplay" mode, so the treemap is still live. There is a startingReading() signal that triggers busyDisplay() that destroys any treemap and prevents another one being built (usually, although it turns out you can hit F9 in the middle of a tree read and get a partial treemap, not great). With a normal openDir() directory read, the signal is synchronous and the treemap is destroyed before it can get painted. BusyPopup appears to interrupt this and spin the event loop: the treemap paint() gets called, but there is no tree. It may crash immediately, or sometimes just causes massive corruption and crashes somewhere unrelated.

So, do you want to fix the bug or keep making excuses? Trivial fixes like explicitly disabling the treemap before the BusyPopup appear to prevent the crash (I can cause the crash maybe one time out of three attempts, but with this change I've never got it to crash), but you can probably come up with a more elegant and general solution.

@shundhammer
Copy link
Owner

That BusyPopup was a fairly recent addition to show some feedback while data from the package manager are read.

Sigh. Here we go again.

@shundhammer shundhammer reopened this Feb 6, 2024
@shundhammer
Copy link
Owner

With the default settings, I find it impossible to reproduce the problem. Only when I change the timeout of the DelayedRebuilder from 200 milliseconds to 5000 milliseconds (5 seconds) I can force it sometimes.

@shundhammer
Copy link
Owner

Please try again with this new version.

AFAICS the delayed rebuilding of the treemap used a FileInfo * that had become invalid when clearing the DirTree, and that member variable wasn't reset to 0 in TreemapView::clear() which is connected to the DirTree::clearing() signal (as it should be).

But to trigger that problem, you really have to do it within that miniscule timeout of 200 milliseconds between getting a signal that the treemap should be rebuilt and the actual rebuilding. That's the same timeout as in a mouse double click between the first and the second click.

@Lithopsian
Copy link
Contributor Author

LGTM! I really hammered on overlapping rebuilds of different types and sizes, but no crash.

Do you think it's worth having some CHECK_MAGIC in the treemap? Too late for this bug, but it could potentially trap this type of error before it trashes memory and then crashes somewhere else. Currently the treemap build and the painting just assuming that any FileInfo pointer they are given is valid until the treemap is destroyed (which it should be, but possibly isn't in some strange cases).

@shundhammer
Copy link
Owner

shundhammer commented Feb 9, 2024

All views should actually work on the DirTree and store as few lone FileInfo pointers as possible; or connect to the DirTree's signals to react to changes.

In this particular example, the _newRoot member variable held one for the delayed treemap build; because of the usual Qt signals and events that need to be processed first. For example, the ResizeEvent needs to be processed to know the exact final size of the widget, so the GraphicsScene can be scaled accordingly to fit exactly into that screen space. The joys of event-driven graphics programming...

The treemap does connect to the DirTree signals, and it clears itself when the tree is cleared. But that _newRoot was missing in Treemap::clear(). Now it's there.

Other parts of QDirStat use different methods such as a Subtree which is designed to hold information about a FileInfo or DirInfo without storing a pointer to it; because in some situations (cleanups!), that pointer might have become invalid in the meantime: A Subtree stores a path / an URL instead, at the price of needing to locate it again in the (remaining) DirTree when needed, and it might still return null if the URL doesn't exist in the DirTree anymore.

The Subtree came much later, long after the treemap, so the treemap uses a FileInfo pointer.

I don't know. Maybe introduce a smart pointer for certain things that become null when the corresponding FileInfo node is destroyed. Or use a Subtree more often.

This particular problem is now fixed; and there really shouldn't be many lone FileInfo pointers around elsewhere. I hope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants