Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cachex.fetch hangs and breaks my app #319

Closed
ericlathrop opened this issue Dec 7, 2023 · 13 comments · Fixed by #320
Closed

Cachex.fetch hangs and breaks my app #319

ericlathrop opened this issue Dec 7, 2023 · 13 comments · Fixed by #320
Assignees
Milestone

Comments

@ericlathrop
Copy link

ericlathrop commented Dec 7, 2023

We've got a basic Phoenix app. In our controllers we use Cachex.fetch for a basic read-through scenario to load data from the database with ecto. After a day or so, pages just hang forever. I built a toggle into my app to switch between using Cachex or just hitting the database directly. The next time the app started hanging, I switched off Cachex and my pages load again.

There's no errors in my logs.

I suspect I'm running into the unfixed scenario mentioned here: #287 (comment)

Sorry I don't have any more specific details on how to reproduce.

@whitfin
Copy link
Owner

whitfin commented Dec 7, 2023

We've got a basic Phoenix app. In our controllers we use Cachex.fetch for a basic read-through scenario to load data from the database with ecto. After a day or so, pages just hang forever. I built a toggle into my app to switch between using Cachex or just hitting the database directly. The next time the app started hanging, I switched off Cachex and my pages load again.

Next time rather than toggle it off, I suggest you check several things about the state of the cache:

  • The key you're trying to read/write
  • The size of the cache (the keyspace)
  • The memory consumption of the cache
  • The memory consumption of your application

Rather than build a toggle to turn it on and off, I would suggest that you enable some sort of logging so we can see exactly where things are getting stuck. It would also be helpful to know the version you are using.

I suspect I'm running into the unfixed scenario mentioned here: #287 (comment)

Maybe, but I was given no reproduction case in that issue and I still have no reproduction case in this issue, not entirely sure how I'm expected to magically fix these things!

Since it seems like Cachex is unreliable, we're going to have to rip it out and build our own simple caching with ETS.

There was no real need for this comment... 🙂 I understand you're probably frustrated because your app isn't working, but filing this for help in this tone isn't particularly helpful.

Cachex is proven in production use in hundreds of systems (including several government systems). It's either something specific to your implementation, or a bug which can be fixed. I would recommend against building your own implementation until you understand exactly what is happening, because you'll likely just hit the same problem.

@ericlathrop
Copy link
Author

ericlathrop commented Dec 7, 2023 via email

@ericlathrop
Copy link
Author

I edited the description to remove my offensive comment. Sorry again.

@whitfin
Copy link
Owner

whitfin commented Dec 7, 2023

Sorry, you're right, I apologize for the last part. I was frustrated with debugging production issues for the last couple days.

No problem, totally get it - hopefully I can do at least something here to help!

The key is just the ID taken from the page URL, just a simple string.

I mean more to see if there's a trend in the keys triggering the issue; there shouldn't be but I'm really curious about the "after about a day" part. This implies it's something specific to set it off, so starting with the keys (or maybe the data behind them) could be a good start.

I was just toggling the cache off in order to narrow the possibilities since I didn't have a stack trace to go off of.

Yeah, makes perfect sense! Definitely would have done the same.

I was just toggling the cache off in order to narrow the possibilities since I didn't have a stack trace to go off of.

Yeah this is the most interesting thing here; it seems like nothing crashes it just gets stuck. The only thing that can do this is the background process handling fetch calls, but I'm kinda at a loss as to why that would happen.

Are there any other things I can log to debug this?

I think if you log before you call fetch, log inside your fetch function, and log afterwards, it might give me somewhere to start. It'll at least tell me where it's getting stuck, which is likely going to be after the first two logs but before the third.

What version of Cachex are you using? And is this totally reproducible for you? If we e.g. found a fix, would you be able to tell for sure if it helped? I'm also more than happy to try your exact code flow; if you don't want to expose this on GitHub, I'm reachable on the Elixir Slack or via the email on my GitHub profile.

In the other issue you linked, I spent some time staring at the Courier service inside Cachex, which is where this is likely to be caused - but I couldn't find anything specific. I rewrote it a bit to hopefully clean it up, but it seems there may still be a problem. I'm definitely willing to put the time in to figure this out, I'm just lacking an environment where I can reproduce the issue!

Edit: even if I can't get a reproduction case, I might try clean up the Courier further. There's definitely a few weird cases of behaviour in there I can make more robust - it should effectively be the same, but in case of some edge cases it might help clear them up.

@ericlathrop
Copy link
Author

We're using cachex 3.6.0.

I added the logging you asked for.

I built a way to clear a specific cache at runtime into my app. I'll see if clearing a cache helps anything.

I'm going to try to build a simpler test case, but it might take me a little while.

@whitfin
Copy link
Owner

whitfin commented Dec 11, 2023

@ericlathrop got it, so my changes in v3.6 didn't help - I'll spend some more time. I'm flying a lot this week, but maybe I can get some time on those to take a long look!

No rush, whenever you get chance just ping me back here!

@whitfin whitfin added the bug label Dec 11, 2023
@whitfin whitfin self-assigned this Dec 11, 2023
@whitfin whitfin added this to the v3.7.0 milestone Dec 24, 2023
@whitfin
Copy link
Owner

whitfin commented Dec 24, 2023

@ericlathrop I'm pretty sure I've located the cause here - somehow the child processes are being killed, and because the spawn is not linked to the Courier, it's not cleaning up properly. I couldn't tell you why it's dying, probably something specific to your app (network issue to the database maybe?).

I wrote a test case for this, and it hangs. I made some changes (which will probably be in main by the time you read this), and it no longer hangs. I'm not sure it's totally cleaned up fully yet, but you should be able to test it all out and see if you can still reproduce the issue. For reference, the PR is inside #320. I'm also considering adding logging throughout so you can toggle on logs inside Cachex itself to help debug stuff like this in future.

Just to set expectation properly; this should stop the hanging issue. It could be that fixing this will give you back an error that still needs follow up investigation on my part (depending on why the child process is dying). This is why I'm not releasing these changes just yet, in case they need further work based on what we uncover.

Let me know whenever you get some time, happy holidays!

@ericlathrop
Copy link
Author

@whitfin thanks for investigating. I spent a day last week creating a sample project trying to reproduce. I wasn't able to reproduce, but I got some weird pauses when I got to like 40 worker processes. I'll post the repo later when I'm at my computer

@ericlathrop
Copy link
Author

I do suspect the problem was related to crashes because we get periodic ecto crashes related to running out of postgres connections. I didn't get time to test that in my example repo.

@whitfin
Copy link
Owner

whitfin commented Dec 24, 2023

@ericlathrop awesome, no rush :) If it is due to crashes, and you have some other crashes happening, then yeah this is likely the cause.

In that case if you can see if the errors that are raised are actually useful, I'd appreciate it (it'll probably just bubble up the Ecto error, but it would be nice to verify). If we can confirm this, I'm happy to get it published ASAP!

@whitfin whitfin reopened this Dec 24, 2023
@whitfin
Copy link
Owner

whitfin commented Dec 24, 2023

(reopening just due to GitHub auto close, as we still need to verify)

@whitfin
Copy link
Owner

whitfin commented Jan 16, 2024

Just following up @ericlathrop - did you need to check anything on your side, or are we happy that this is solved in main? If the latter, I'll close this out and it'll be included in the 3.7 release.

@ericlathrop
Copy link
Author

@whitfin Hey, sorry for the delay. My sample project where I tried to reproduce the error (but wasn't able to) is here: https://gitlab.com/ericlathrop/debug_cachex. It basically spins up a bunch of genservers that try to read random stuff from the database through Cachex.fetch with random waits between. If you're familiar with docker-compose, you can spin it up with a database and everything pretty easy.

Sorry I wasn't able to reproduce the issue, and sorry, but I wasn't able to try out the new version. I switched our app to a different caching library, and it works for our usage, and it would be problematic to switch back and deploy it to see if it hangs again. Feel free to close this issue if you want.

@whitfin whitfin closed this as completed Jan 25, 2024
@whitfin whitfin modified the milestones: v3.7.0, v4.0.0 Mar 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants