[FLINK-1501] Add metrics library for monitoring TaskManagers #421

rmetzger · 2015-02-19T14:27:25Z

Hey,
I've spend some time exploring the metrics library for improving the performance monitoring in Flink.

This pull request is a first step into that direction. The primary objective is a clean integration of the JVM monitoring into our system.

I spend probably 80% of the time in making the javascript frontend work. For that, I've used rickshaw, a project also used by projects like Apache Ambari for creating nice graphs.
Still, the visualization is not perfect and I would like to see incremental improvements there.

The next step for me will be metrics for individual jobs.

hsaputra · 2015-02-19T17:03:19Z

You are the man, Robert!

tillrohrmann · 2015-02-20T09:20:54Z

Indeed :-)

What does the OS load mean? It would be really awesome to show the CPU load, too. I think this is a helpful indicator.

vasia · 2015-02-20T10:32:40Z

This looks great! ^^

fhueske · 2015-02-20T20:27:59Z

This is much needed monitoring and really great!
What are the current options for showing the detailed metrics? I see a "show 3 TMs" and "show all TMs" button in the screenshot? Can you select which three to show?

How about showing small indicators (load, % non-Flink-managed heap, GC interval) with a simple color coding (red for hot, blue for cool). This would help to find TMs which are more loaded than others. The detailed view could be opened by clicking on the TMs.

But we do not need to get the perfect solution at once.
How about we open a document and sketch the design of the monitoring and create smaller PRs to get there step-by-step.
This PR is definitely a huge step in the right direction!

rmetzger · 2015-02-23T14:06:02Z

Thanks everybody for the positive feedback!

What does the OS load mean? It would be really awesome to show the CPU load, too. I think this is a helpful indicator.

On the OS load: http:https://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages

I totally agree that the OS load is not a very good metric for our purposes.
The reason why I didn't try to get better metrics for this is that I didn't want to play "ugly tricks" to get them.
My code is getting the metrics only via the management beans. The OperatingSystemMXBean is only exposing the load and the number of processor cores:
http:https://docs.oracle.com/javase/7/docs/api/java/lang/management/OperatingSystemMXBean.html#getSystemLoadAverage()
There is another implementation of the OperatingSystemMXBean (https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html) which is also exposing stuff like getProcessCpuLoad().
But the availability of this management bean depends on the used JVM version etc.

Another way to get the CPU load of the process would be parsing the output of ps or top. But that also falls into the category of "ugly tricks".
I think we should aim for getting those metrics into the system as well. Adding them is a matter of registering another Gauge in the TaskManager's metrics registry and visualizing the JSON output.
I hope that these kinds of refinements are done by external contributors.
Once this PR has been merged, I'll file a JIRA to improve the CPU monitoring.

What are the current options for showing the detailed metrics? I see a "show 3 TMs" and "show all TMs" button in the screenshot? Can you select which three to show?

No, you cannot choose which three TMs.
I added these buttons because starting a large Flink cluster (50+ nodes) will cause quite some load on the browser updating all the charts. Usually its sufficient to see monitor the load of a few TMs only, because they are doing mostly the same (ideally).
But I agree that there is room for improvement.

How about we open a document and sketch the design of the monitoring and create smaller PRs to get there step-by-step.

I totally agree that we should do small incremental improvements.
As I said in the PR description, the primary purpose of this PR is to get the basic monitoring infrastructure in place, how we present the stuff in the end is subject to further PRs.

I have started working on the "per-job" monitoring and found that I have to change some details of this PR as well.
Depending on my progress on the "per-job" monitoring I might contribute the changes here together with the "per-job" metrics. If I don't have enough time this week to open a PR for the per job metrics this week, I'll merge this change to master.

mxm · 2015-02-23T14:50:58Z

Looks really nice and informative 👍

Some suggestions:

It would be great if one could specify the number of task managers to see. If the number of task managers shown is smaller than the total number of task managers, there should be a "shuffle" button to show a random selection of task managers.
Add some information about the different memory statistics. The labels might not be intuitive for the average user.

fhueske · 2015-02-23T15:25:24Z

Thanks for the detailed response.
I am not sure how helpful it is to show three random TMs (incl. a shuffling button to show other random ones). I think it is not uncommon, that a single or few nodes are struggling (data skew, hw problems, ...) and it would be IMO very cool if a user could quickly identify such a node and get the detailed stats.

mxm · 2015-02-23T15:39:04Z

@fhueske The user could do so by selecting "Show all task managers" and then identify the struggling task manager. For large cluster setups, it makes sense to sample just from a few task managers.

fhueske · 2015-02-23T16:38:50Z

I am not sure about that. Would you like to scroll through say 100 detailed charts where only 5 fit on a screen to check whether there is one ore more misbehaving nodes?
Three random nodes don't tell you a lot. There might always be another misbehaving one for which you have to check if you think something is not working alright.

Having a small overview with the possibility for detailed analysis is the way to go in the long run, IMO.

mxm · 2015-02-23T18:26:26Z

I didn't fully understand that you wanted to have one chart containing all task managers' load. That's a good thing. If it is only one chart, the overhead to update it should not be as high as creating a chart for every task manager (like it is now).

bhatsachin · 2015-03-21T08:40:10Z

Hey, I am one of the IIT Mandi students contributing to Flink. I would like to pick up a task pertaining to the monitoring enhancements. Robert, what is the status on the per job monitoring task? Are there any further changes to this PR?

rmetzger · 2015-03-23T10:26:19Z

Hey @bhatsachin, I've started working on the per job monitoring .. but its currently in a work in progress state and I did not find time to finish it yet.

If you are interested in working on the topic, I would actually suggest to enhance the monitoring I've added in this pull request (the TaskManager monitoring).

If nobody has any objections, I would like to merge this change in the next 24 hours.

bhatsachin · 2015-03-26T09:41:39Z

Hey Robert, could you please enlist the further monitoring enhancements required for this pull request and are there any objections to the merge?

rmetzger · 2015-03-26T10:21:19Z

No, there are no objections.
I'll rebase this pull request to the current master and merge it later today.

I think the next steps are:

Get the CPU utilization in % from each TaskManager process
Remove the metrics graph from the overview and only show the current stats as numbers (cpu load, heap utilization) and add a button to enable the detailed graph.

rmetzger · 2015-03-26T13:40:43Z

I rebased the code to the current master in the branch "flink1501-rebased" in my github (https://github.com/rmetzger/flink/tree/flink1501-rebased).
As soon as the tests are going through, I'll merge it to master: https://travis-ci.org/rmetzger/flink/builds/55949997

rmetzger · 2015-03-27T10:41:53Z

Hey @bhatsachin,
I've merged the change to master.

If you want, we can do a quick hangout or skype call to discuss potential contributions from your side.

rmetzger · 2015-03-27T10:45:04Z

I've filed a JIRA for the changes requested here: https://issues.apache.org/jira/browse/FLINK-1792

bhatsachin · 2015-03-30T18:07:35Z

Thanks a lot Robert, let us have a hangout session tomorrow (Tuesday). Please suggest a time of your convenience. My other friends from IIT Mandi will also join in.

rmetzger · 2015-03-30T18:27:29Z

Cool.
I'm available between 16:30 - 20:30 India Time Zone (UTC+05:30). Is that possible for you?

bhatsachin · 2015-03-30T19:15:46Z

Great. 17:00 India Time Zone (UTC+05:30) would be perfect.

rmetzger · 2015-03-30T19:20:33Z

Confirmed ;)

Looking forward talking to you tomorrow. My google hangout id is [email protected].

This closes apache#421

sumitchawla · 2016-07-11T22:02:36Z

@rmetzger how can i view these metrics? Do i need JMX to be enabled for viewing these metrics? As of now i see only a few numbers in metrics tab for TaskManager

tillrohrmann · 2016-07-13T07:35:48Z

Yes, per default metrics are published via the JMX. But you can also
configure a Ganglia, Graphite or StatsD reporter.

On Tue, Jul 12, 2016 at 12:02 AM, Sumit Chawla [email protected]
wrote:

@rmetzger https://github.com/rmetzger how can i view these metrics? Do
i need JMX to be enabled for viewing these metrics? As of now i see only a
few numbers in metrics tab for TaskManager

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#421 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AFfXujT6c8933nSvMuICEIg3kG_agKgBks5qUr2DgaJpZM4DitV-
.

rmetzger · 2016-07-13T07:40:41Z

@sumitchawla: These charts have been removed from the TaskManager view due to licensing issues.

[FLINK-1501] Add metrics library for monitoring TaskManagers

13d1715

asfgit closed this in 2d1f8b0 Mar 27, 2015

bhatsachin pushed a commit to bhatsachin/flink that referenced this pull request May 5, 2015

[FLINK-1501] Add metrics library for monitoring TaskManagers

b94a108

This closes apache#421

marthavk pushed a commit to marthavk/flink that referenced this pull request Jun 9, 2015

[FLINK-1501] Add metrics library for monitoring TaskManagers

39bbd85

This closes apache#421

rmetzger added the component=Runtime/Coordination label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-1501] Add metrics library for monitoring TaskManagers #421

[FLINK-1501] Add metrics library for monitoring TaskManagers #421

rmetzger commented Feb 19, 2015

hsaputra commented Feb 19, 2015

tillrohrmann commented Feb 20, 2015

vasia commented Feb 20, 2015

fhueske commented Feb 20, 2015

rmetzger commented Feb 23, 2015

mxm commented Feb 23, 2015

fhueske commented Feb 23, 2015

mxm commented Feb 23, 2015

fhueske commented Feb 23, 2015

mxm commented Feb 23, 2015

bhatsachin commented Mar 21, 2015

rmetzger commented Mar 23, 2015

bhatsachin commented Mar 26, 2015

rmetzger commented Mar 26, 2015

rmetzger commented Mar 26, 2015

rmetzger commented Mar 27, 2015

rmetzger commented Mar 27, 2015

bhatsachin commented Mar 30, 2015

rmetzger commented Mar 30, 2015

bhatsachin commented Mar 30, 2015

rmetzger commented Mar 30, 2015

sumitchawla commented Jul 11, 2016

tillrohrmann commented Jul 13, 2016

rmetzger commented Jul 13, 2016

[FLINK-1501] Add metrics library for monitoring TaskManagers #421

[FLINK-1501] Add metrics library for monitoring TaskManagers #421

Conversation

rmetzger commented Feb 19, 2015

hsaputra commented Feb 19, 2015

tillrohrmann commented Feb 20, 2015

vasia commented Feb 20, 2015

fhueske commented Feb 20, 2015

rmetzger commented Feb 23, 2015

mxm commented Feb 23, 2015

fhueske commented Feb 23, 2015

mxm commented Feb 23, 2015

fhueske commented Feb 23, 2015

mxm commented Feb 23, 2015

bhatsachin commented Mar 21, 2015

rmetzger commented Mar 23, 2015

bhatsachin commented Mar 26, 2015

rmetzger commented Mar 26, 2015

rmetzger commented Mar 26, 2015

rmetzger commented Mar 27, 2015

rmetzger commented Mar 27, 2015

bhatsachin commented Mar 30, 2015

rmetzger commented Mar 30, 2015

bhatsachin commented Mar 30, 2015

rmetzger commented Mar 30, 2015

sumitchawla commented Jul 11, 2016

tillrohrmann commented Jul 13, 2016

rmetzger commented Jul 13, 2016