-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cachex.fetch hangs and breaks my app #319
Comments
Next time rather than toggle it off, I suggest you check several things about the state of the cache:
Rather than build a toggle to turn it on and off, I would suggest that you enable some sort of logging so we can see exactly where things are getting stuck. It would also be helpful to know the version you are using.
Maybe, but I was given no reproduction case in that issue and I still have no reproduction case in this issue, not entirely sure how I'm expected to magically fix these things!
There was no real need for this comment... 🙂 I understand you're probably frustrated because your app isn't working, but filing this for help in this tone isn't particularly helpful. Cachex is proven in production use in hundreds of systems (including several government systems). It's either something specific to your implementation, or a bug which can be fixed. I would recommend against building your own implementation until you understand exactly what is happening, because you'll likely just hit the same problem. |
Sorry, you're right, I apologize for the last part. I was frustrated with debugging production issues for the last couple days.
The key is just the ID taken from the page URL, just a simple string.
I don't know the exact size of the caches, but the entire app is using around 300mb total.
I think there's usually less than a hundred keys.
I was just toggling the cache off in order to narrow the possibilities since I didn't have a stack trace to go off of.
Are there any other things I can log to debug this?
…On December 7, 2023 4:45:35 PM EST, Isaac Whitfield ***@***.***> wrote:
> We've got a basic Phoenix app. In our controllers we use Cachex.fetch for a basic read-through scenario to load data from the database with ecto. After a day or so, pages just hang forever. I built a toggle into my app to switch between using Cachex or just hitting the database directly. The next time the app started hanging, I switched off Cachex and my pages load again.
Next time rather than toggle it off, I suggest you check several things about the state of the cache:
* The key you're trying to read/write
* The size of the cache (the keyspace)
* The memory consumption of the cache
* The memory consumption of your application
Rather than build a toggle to turn it on and off, I would suggest that you enable some sort of logging so we can see exactly where things are getting stuck. It would also be helpful to know the version you are using.
> I suspect I'm running into the unfixed scenario mentioned here: #287 (comment)
Maybe, but I was given no reproduction case in that issue and I still have no reproduction case in this issue, not entirely sure how I'm expected to magically fix these things!
> Since it seems like Cachex is unreliable, we're going to have to rip it out and build our own simple caching with ETS.
There was no real need for this comment... 🙂 I understand you're probably frustrated because your app isn't working, but filing this for help in this tone isn't particularly helpful.
Cachex is proven in production use in hundreds of systems (including several government systems). It's either something specific to your implementation, or a bug which can be fixed. I would recommend against building your own implementation until you understand exactly what is happening, because you'll likely just hit the same problem.
--
Reply to this email directly or view it on GitHub:
#319 (comment)
You are receiving this because you authored the thread.
Message ID: ***@***.***>
|
I edited the description to remove my offensive comment. Sorry again. |
No problem, totally get it - hopefully I can do at least something here to help!
I mean more to see if there's a trend in the keys triggering the issue; there shouldn't be but I'm really curious about the "after about a day" part. This implies it's something specific to set it off, so starting with the keys (or maybe the data behind them) could be a good start.
Yeah, makes perfect sense! Definitely would have done the same.
Yeah this is the most interesting thing here; it seems like nothing crashes it just gets stuck. The only thing that can do this is the background process handling
I think if you log before you call What version of Cachex are you using? And is this totally reproducible for you? If we e.g. found a fix, would you be able to tell for sure if it helped? I'm also more than happy to try your exact code flow; if you don't want to expose this on GitHub, I'm reachable on the Elixir Slack or via the email on my GitHub profile. In the other issue you linked, I spent some time staring at the Edit: even if I can't get a reproduction case, I might try clean up the Courier further. There's definitely a few weird cases of behaviour in there I can make more robust - it should effectively be the same, but in case of some edge cases it might help clear them up. |
We're using cachex 3.6.0. I added the logging you asked for. I built a way to clear a specific cache at runtime into my app. I'll see if clearing a cache helps anything. I'm going to try to build a simpler test case, but it might take me a little while. |
@ericlathrop got it, so my changes in v3.6 didn't help - I'll spend some more time. I'm flying a lot this week, but maybe I can get some time on those to take a long look! No rush, whenever you get chance just ping me back here! |
@ericlathrop I'm pretty sure I've located the cause here - somehow the child processes are being killed, and because the I wrote a test case for this, and it hangs. I made some changes (which will probably be in Just to set expectation properly; this should stop the hanging issue. It could be that fixing this will give you back an error that still needs follow up investigation on my part (depending on why the child process is dying). This is why I'm not releasing these changes just yet, in case they need further work based on what we uncover. Let me know whenever you get some time, happy holidays! |
@whitfin thanks for investigating. I spent a day last week creating a sample project trying to reproduce. I wasn't able to reproduce, but I got some weird pauses when I got to like 40 worker processes. I'll post the repo later when I'm at my computer |
I do suspect the problem was related to crashes because we get periodic ecto crashes related to running out of postgres connections. I didn't get time to test that in my example repo. |
@ericlathrop awesome, no rush :) If it is due to crashes, and you have some other crashes happening, then yeah this is likely the cause. In that case if you can see if the errors that are raised are actually useful, I'd appreciate it (it'll probably just bubble up the Ecto error, but it would be nice to verify). If we can confirm this, I'm happy to get it published ASAP! |
(reopening just due to GitHub auto close, as we still need to verify) |
Just following up @ericlathrop - did you need to check anything on your side, or are we happy that this is solved in |
@whitfin Hey, sorry for the delay. My sample project where I tried to reproduce the error (but wasn't able to) is here: https://gitlab.com/ericlathrop/debug_cachex. It basically spins up a bunch of genservers that try to read random stuff from the database through Sorry I wasn't able to reproduce the issue, and sorry, but I wasn't able to try out the new version. I switched our app to a different caching library, and it works for our usage, and it would be problematic to switch back and deploy it to see if it hangs again. Feel free to close this issue if you want. |
We've got a basic Phoenix app. In our controllers we use
Cachex.fetch
for a basic read-through scenario to load data from the database with ecto. After a day or so, pages just hang forever. I built a toggle into my app to switch between using Cachex or just hitting the database directly. The next time the app started hanging, I switched off Cachex and my pages load again.There's no errors in my logs.
I suspect I'm running into the unfixed scenario mentioned here: #287 (comment)
Sorry I don't have any more specific details on how to reproduce.
The text was updated successfully, but these errors were encountered: