Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-1501] Add metrics library for monitoring TaskManagers #421

Closed
wants to merge 1 commit into from

Conversation

rmetzger
Copy link
Contributor

Hey,
I've spend some time exploring the metrics library for improving the performance monitoring in Flink.

This pull request is a first step into that direction. The primary objective is a clean integration of the JVM monitoring into our system.

I spend probably 80% of the time in making the javascript frontend work. For that, I've used rickshaw, a project also used by projects like Apache Ambari for creating nice graphs.
Still, the visualization is not perfect and I would like to see incremental improvements there.

The next step for me will be metrics for individual jobs.

newmonitoring

@hsaputra
Copy link
Contributor

You are the man, Robert!

@tillrohrmann
Copy link
Contributor

Indeed :-)

What does the OS load mean? It would be really awesome to show the CPU load, too. I think this is a helpful indicator.

@vasia
Copy link
Contributor

vasia commented Feb 20, 2015

This looks great! ^^

@fhueske
Copy link
Contributor

fhueske commented Feb 20, 2015

This is much needed monitoring and really great!
What are the current options for showing the detailed metrics? I see a "show 3 TMs" and "show all TMs" button in the screenshot? Can you select which three to show?

How about showing small indicators (load, % non-Flink-managed heap, GC interval) with a simple color coding (red for hot, blue for cool). This would help to find TMs which are more loaded than others. The detailed view could be opened by clicking on the TMs.

But we do not need to get the perfect solution at once.
How about we open a document and sketch the design of the monitoring and create smaller PRs to get there step-by-step.
This PR is definitely a huge step in the right direction!

@rmetzger
Copy link
Contributor Author

Thanks everybody for the positive feedback!

What does the OS load mean? It would be really awesome to show the CPU load, too. I think this is a helpful indicator.

On the OS load: http:https://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages

I totally agree that the OS load is not a very good metric for our purposes.
The reason why I didn't try to get better metrics for this is that I didn't want to play "ugly tricks" to get them.
My code is getting the metrics only via the management beans. The OperatingSystemMXBean is only exposing the load and the number of processor cores:
http:https://docs.oracle.com/javase/7/docs/api/java/lang/management/OperatingSystemMXBean.html#getSystemLoadAverage()
There is another implementation of the OperatingSystemMXBean (https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html) which is also exposing stuff like getProcessCpuLoad().
But the availability of this management bean depends on the used JVM version etc.

Another way to get the CPU load of the process would be parsing the output of ps or top. But that also falls into the category of "ugly tricks".
I think we should aim for getting those metrics into the system as well. Adding them is a matter of registering another Gauge in the TaskManager's metrics registry and visualizing the JSON output.
I hope that these kinds of refinements are done by external contributors.
Once this PR has been merged, I'll file a JIRA to improve the CPU monitoring.

What are the current options for showing the detailed metrics? I see a "show 3 TMs" and "show all TMs" button in the screenshot? Can you select which three to show?

No, you cannot choose which three TMs.
I added these buttons because starting a large Flink cluster (50+ nodes) will cause quite some load on the browser updating all the charts. Usually its sufficient to see monitor the load of a few TMs only, because they are doing mostly the same (ideally).
But I agree that there is room for improvement.

How about we open a document and sketch the design of the monitoring and create smaller PRs to get there step-by-step.

I totally agree that we should do small incremental improvements.
As I said in the PR description, the primary purpose of this PR is to get the basic monitoring infrastructure in place, how we present the stuff in the end is subject to further PRs.

I have started working on the "per-job" monitoring and found that I have to change some details of this PR as well.
Depending on my progress on the "per-job" monitoring I might contribute the changes here together with the "per-job" metrics. If I don't have enough time this week to open a PR for the per job metrics this week, I'll merge this change to master.

@mxm
Copy link
Contributor

mxm commented Feb 23, 2015

Looks really nice and informative 👍

Some suggestions:

  • It would be great if one could specify the number of task managers to see. If the number of task managers shown is smaller than the total number of task managers, there should be a "shuffle" button to show a random selection of task managers.
  • Add some information about the different memory statistics. The labels might not be intuitive for the average user.

@fhueske
Copy link
Contributor

fhueske commented Feb 23, 2015

Thanks for the detailed response.
I am not sure how helpful it is to show three random TMs (incl. a shuffling button to show other random ones). I think it is not uncommon, that a single or few nodes are struggling (data skew, hw problems, ...) and it would be IMO very cool if a user could quickly identify such a node and get the detailed stats.

@mxm
Copy link
Contributor

mxm commented Feb 23, 2015

@fhueske The user could do so by selecting "Show all task managers" and then identify the struggling task manager. For large cluster setups, it makes sense to sample just from a few task managers.

@fhueske
Copy link
Contributor

fhueske commented Feb 23, 2015

I am not sure about that. Would you like to scroll through say 100 detailed charts where only 5 fit on a screen to check whether there is one ore more misbehaving nodes?
Three random nodes don't tell you a lot. There might always be another misbehaving one for which you have to check if you think something is not working alright.

Having a small overview with the possibility for detailed analysis is the way to go in the long run, IMO.

@mxm
Copy link
Contributor

mxm commented Feb 23, 2015

I didn't fully understand that you wanted to have one chart containing all task managers' load. That's a good thing. If it is only one chart, the overhead to update it should not be as high as creating a chart for every task manager (like it is now).

@bhatsachin
Copy link
Contributor

Hey, I am one of the IIT Mandi students contributing to Flink. I would like to pick up a task pertaining to the monitoring enhancements. Robert, what is the status on the per job monitoring task? Are there any further changes to this PR?

@rmetzger
Copy link
Contributor Author

Hey @bhatsachin, I've started working on the per job monitoring .. but its currently in a work in progress state and I did not find time to finish it yet.

If you are interested in working on the topic, I would actually suggest to enhance the monitoring I've added in this pull request (the TaskManager monitoring).

If nobody has any objections, I would like to merge this change in the next 24 hours.

@bhatsachin
Copy link
Contributor

Hey Robert, could you please enlist the further monitoring enhancements required for this pull request and are there any objections to the merge?

@rmetzger
Copy link
Contributor Author

No, there are no objections.
I'll rebase this pull request to the current master and merge it later today.

I think the next steps are:

  • Get the CPU utilization in % from each TaskManager process
  • Remove the metrics graph from the overview and only show the current stats as numbers (cpu load, heap utilization) and add a button to enable the detailed graph.

@rmetzger
Copy link
Contributor Author

I rebased the code to the current master in the branch "flink1501-rebased" in my github (https://github.com/rmetzger/flink/tree/flink1501-rebased).
As soon as the tests are going through, I'll merge it to master: https://travis-ci.org/rmetzger/flink/builds/55949997

@rmetzger
Copy link
Contributor Author

Hey @bhatsachin,
I've merged the change to master.

If you want, we can do a quick hangout or skype call to discuss potential contributions from your side.

@rmetzger
Copy link
Contributor Author

I've filed a JIRA for the changes requested here: https://issues.apache.org/jira/browse/FLINK-1792

@asfgit asfgit closed this in 2d1f8b0 Mar 27, 2015
@bhatsachin
Copy link
Contributor

Thanks a lot Robert, let us have a hangout session tomorrow (Tuesday). Please suggest a time of your convenience. My other friends from IIT Mandi will also join in.

@rmetzger
Copy link
Contributor Author

Cool.
I'm available between 16:30 - 20:30 India Time Zone (UTC+05:30). Is that possible for you?

@bhatsachin
Copy link
Contributor

Great. 17:00 India Time Zone (UTC+05:30) would be perfect.

@rmetzger
Copy link
Contributor Author

Confirmed ;)

Looking forward talking to you tomorrow. My google hangout id is [email protected].

bhatsachin pushed a commit to bhatsachin/flink that referenced this pull request May 5, 2015
marthavk pushed a commit to marthavk/flink that referenced this pull request Jun 9, 2015
@sumitchawla
Copy link
Contributor

@rmetzger how can i view these metrics? Do i need JMX to be enabled for viewing these metrics? As of now i see only a few numbers in metrics tab for TaskManager

@tillrohrmann
Copy link
Contributor

Yes, per default metrics are published via the JMX. But you can also
configure a Ganglia, Graphite or StatsD reporter.

On Tue, Jul 12, 2016 at 12:02 AM, Sumit Chawla [email protected]
wrote:

@rmetzger https://github.com/rmetzger how can i view these metrics? Do
i need JMX to be enabled for viewing these metrics? As of now i see only a
few numbers in metrics tab for TaskManager


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#421 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AFfXujT6c8933nSvMuICEIg3kG_agKgBks5qUr2DgaJpZM4DitV-
.

@rmetzger
Copy link
Contributor Author

@sumitchawla: These charts have been removed from the TaskManager view due to licensing issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants