-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-1501] Add metrics library for monitoring TaskManagers #421
Conversation
You are the man, Robert! |
Indeed :-) What does the OS load mean? It would be really awesome to show the CPU load, too. I think this is a helpful indicator. |
This looks great! ^^ |
This is much needed monitoring and really great! How about showing small indicators (load, % non-Flink-managed heap, GC interval) with a simple color coding (red for hot, blue for cool). This would help to find TMs which are more loaded than others. The detailed view could be opened by clicking on the TMs. But we do not need to get the perfect solution at once. |
Thanks everybody for the positive feedback!
On the OS load: http:https://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages I totally agree that the OS load is not a very good metric for our purposes. Another way to get the CPU load of the process would be parsing the output of
No, you cannot choose which three TMs.
I totally agree that we should do small incremental improvements. I have started working on the "per-job" monitoring and found that I have to change some details of this PR as well. |
Looks really nice and informative 👍 Some suggestions:
|
Thanks for the detailed response. |
@fhueske The user could do so by selecting "Show all task managers" and then identify the struggling task manager. For large cluster setups, it makes sense to sample just from a few task managers. |
I am not sure about that. Would you like to scroll through say 100 detailed charts where only 5 fit on a screen to check whether there is one ore more misbehaving nodes? Having a small overview with the possibility for detailed analysis is the way to go in the long run, IMO. |
I didn't fully understand that you wanted to have one chart containing all task managers' load. That's a good thing. If it is only one chart, the overhead to update it should not be as high as creating a chart for every task manager (like it is now). |
Hey, I am one of the IIT Mandi students contributing to Flink. I would like to pick up a task pertaining to the monitoring enhancements. Robert, what is the status on the per job monitoring task? Are there any further changes to this PR? |
Hey @bhatsachin, I've started working on the per job monitoring .. but its currently in a work in progress state and I did not find time to finish it yet. If you are interested in working on the topic, I would actually suggest to enhance the monitoring I've added in this pull request (the TaskManager monitoring). If nobody has any objections, I would like to merge this change in the next 24 hours. |
Hey Robert, could you please enlist the further monitoring enhancements required for this pull request and are there any objections to the merge? |
No, there are no objections. I think the next steps are:
|
I rebased the code to the current master in the branch "flink1501-rebased" in my github (https://github.com/rmetzger/flink/tree/flink1501-rebased). |
Hey @bhatsachin, If you want, we can do a quick hangout or skype call to discuss potential contributions from your side. |
I've filed a JIRA for the changes requested here: https://issues.apache.org/jira/browse/FLINK-1792 |
Thanks a lot Robert, let us have a hangout session tomorrow (Tuesday). Please suggest a time of your convenience. My other friends from IIT Mandi will also join in. |
Cool. |
Great. 17:00 India Time Zone (UTC+05:30) would be perfect. |
Confirmed ;) Looking forward talking to you tomorrow. My google hangout id is [email protected]. |
@rmetzger how can i view these metrics? Do i need JMX to be enabled for viewing these metrics? As of now i see only a few numbers in metrics tab for TaskManager |
Yes, per default metrics are published via the JMX. But you can also On Tue, Jul 12, 2016 at 12:02 AM, Sumit Chawla [email protected]
|
@sumitchawla: These charts have been removed from the TaskManager view due to licensing issues. |
Hey,
I've spend some time exploring the metrics library for improving the performance monitoring in Flink.
This pull request is a first step into that direction. The primary objective is a clean integration of the JVM monitoring into our system.
I spend probably 80% of the time in making the javascript frontend work. For that, I've used rickshaw, a project also used by projects like Apache Ambari for creating nice graphs.
Still, the visualization is not perfect and I would like to see incremental improvements there.
The next step for me will be metrics for individual jobs.