Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve dist diagnostics #5013

Merged
merged 1 commit into from
Mar 24, 2024
Merged

Improve dist diagnostics #5013

merged 1 commit into from
Mar 24, 2024

Conversation

nickva
Copy link
Contributor

@nickva nickva commented Mar 23, 2024

  • Monitor nodes membership in mem3_distribution and log unexpected nodedown reasons.

  • Enhance mem3_distribution gen_sever to keep track of the last few events for each node. This can help detect nodes flapping or disconnecting often.

  • Expose the last few events from mem3_distribution in the .../_system endpoint, besides the already existing dist packet stats metrics.

  • Add ping_nodes(...) debug function. Use it to ping all connected nodes. This can help detect a still connected but slow network connection.

  • Add dead_nodes(...) debug function. Use it to detect a partitioned cluster, where some nodes may not be connected to each other.

Copy link
Contributor

@jaydoane jaydoane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice improvement!

src/mem3/src/mem3.erl Outdated Show resolved Hide resolved
src/mem3/src/mem3.erl Outdated Show resolved Hide resolved
 * Monitor nodes membership in mem3_distribution and log unexpected `nodedown`
   reasons.

 * Enhance mem3_distribution gen_sever to keep track of the last few events for
   each node. This can help detect nodes flapping or disconnecting often.

 * Expose the last few events from mem3_distribution in the .../_system
   endpoint, besides the already existing dist packet stats metrics.

 * Add `ping_nodes(...)` debug function. Use it to ping all connected nodes.
   This can help detect a still connected but slow network connection.

 * Add `dead_nodes(...)` debug function. Use it to detect a partitioned
   cluster, where some nodes may not be connected to each other.
@nickva nickva merged commit e75b98f into main Mar 24, 2024
14 checks passed
@nickva nickva deleted the improve-dist-debugging branch March 24, 2024 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants