[BEAM-1672] Accumulable MetricsContainers. #2649

aviemzur · 2017-04-22T16:05:05Z

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

Make sure the PR title is formatted like:
[BEAM-<Jira issue #>] Description of pull request
Make sure tests pass via mvn clean verify. (Even better, enable
Travis-CI on your fork and ensure the whole test matrix passes).
Replace <Jira issue #> in the title with the actual Jira issue
number, if there is one.
If this contribution is large, please file an Apache
Individual Contributor License Agreement.

aviemzur · 2017-04-23T05:15:41Z

R: @bjchambers for Java SDK and Direct runner changes.
R: @aljoscha for Flink runner changes.
R: @staslev for Spark runner changes.

aljoscha · 2017-04-23T10:47:51Z

runners/flink/src/main/java/org/apache/beam/runners/flink/metrics/FlinkMetricContainer.java

- DistributionData update = metricUpdate.getUpdate();
-
- // update flink metric
- FlinkDistributionGauge gauge = flinkDistributionGaugeCache.get(flinkMetricName);


It looks like this is removing the support for exporting Beam Metrics to Flink Metrics. (Same holds for updateCounters() and updateGauge().

Unfortunately this is not tested right now since this is somewhat hard to do. I created a Jira issue for this: https://issues.apache.org/jira/browse/BEAM-2056

Yeah, missed this part. I'll try to implement a similar solution to what is done with Spark's metrics sink.

We use Spark's implementation of MetricResults and report these metrics to Spark's metrics sink.
https://github.com/apache/beam/blob/v0.6.0/runners/spark/src/main/java/org/apache/beam/runners/spark/metrics/WithMetricsSupport.java
https://github.com/apache/beam/blob/v0.6.0/runners/spark/src/main/java/org/apache/beam/runners/spark/metrics/SparkBeamMetric.java

In this PR this is now done with AccumulatedAttemptedMetricResults. I'll try to use this to report these to Flink metrics similarly to how we report these to Spark's metrics sink.

bjchambers · 2017-04-24T16:33:40Z

runners/direct-java/src/main/java/org/apache/beam/runners/direct/TransformExecutor.java

- // Report the physical metrics from the end of this step.
- context.getMetrics().commitPhysical(inputBundle, metricsContainer.getCumulative());
-
+ synchronized (context) {


This shouldn't need to syncrhonize.

Synchronizing when there are multiple threads executing will be expensive.

Each processing thread should have a dedicated evaluation context and a dedicated metrics container, and only the processing thread should be writing to it.

The metrics container (I believe) uses Atomic values, so that reading will receive a reasonable value.

The reason for synchronization is exactly what you remarked on, the evaluation context is shared between several threads.

There is synchronization in the existing metrics implementation for this reason as well DirectMetrics.java.

I agree that each thread having its own evaluation context would be better but I don't think this should be changed as part of this PR.

I will see if there is a different point in the flow in which the containers can be merged without synchronization.

Ah, I forgot this was in the DirectRunner. That sounds reasonable, although @tgroh may have an idea if there is somewhere better to merge containers.

bjchambers · 2017-04-24T16:33:54Z

runners/direct-java/src/main/java/org/apache/beam/runners/direct/TransformExecutor.java

+ finishBundle(evaluator, enforcements);
+
+ synchronized (context) {
+ context.getCommittedMetrics().update(stepName, metricsContainer);


Why does this need to be synchronized?

Same answer as below.

bjchambers · 2017-04-24T16:34:45Z

runners/direct-java/src/main/java/org/apache/beam/runners/direct/TransformExecutor.java

- // Report the physical metrics after each element
- MetricUpdates deltas = metricsContainer.getUpdates();
- if (deltas != null) {
- context.getMetrics().updatePhysical(inputBundle, deltas);


Reporting attempted counters during processing allows a tentative value to be read during processing. Is there a reason we removed that functionality?

To do this with the accumulable metrics containers would mean synchronization after each element is processed (as was done in the existing direct metrics implementation) as well as creating a separate metrics container instance for each element.

If reporting attempted counters during processing is a crucial feature for Direct runner we can return the previous implementation and not use the accumulable version. I think, however, that the reuse of code and reduction of LOC >> this fairly niche feature in the direct runner.

Ideas for other options that preserve functionality and code reduction:

Have the DirectRunner report tentative metrics every Nth element (reduces the amount of synchronization).

Allow copying the MetricsContainer. Then, the DirectRunner can create a copy and put it on a queue for a separate thread that actually does the update aggregation. This will avoid synchronization.

Extending (2) -- I believe it should be safe for multiple threads to read a MetricsContainer even if it is being modified by one thread, since it uses Atomic values. Given that, we could just pass a reference to the MetricsContainer to incorporate to the queue, and have it maintain control of the aggregated MetricsContainers.

2 or 3 would also be most like this would happen within a distributed runner, where the actual aggregation was happening outside of the processing thread.

Could this be a separate ticket and PR? or does this block merging this change? As master has already changed and this PR must be rebased again.

This could be done in a way similar to how we update the state of the WatermarkManager, where we toss something on a concurrent queue, try to acquire a lock and update the visible state but give up if the lock is already held (and eventually are updated the rest of the way) (e.g. https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/WatermarkManager.java#L912)

That sounds fine, but as we're approaching 2 weeks on this PR's review and the code on master has diverged yet again I suggest we open this as a future ticket. We can give code references and how suggest this change should be implemented and perhaps tag this with starter to encourage contributions.

@tgroh WDYT? Can you open such a ticket?

I guess this could be OK, although it is concerning that we are regressing existing behavior as part of a refactoring...

Alternatively we could not change direct runner to use accumulable metrics containers and open a task for someone to do that in a future PR.
I will go with this option.

Sounds good -- created https://issues.apache.org/jira/browse/BEAM-2186 to track direct runner updates.

bjchambers · 2017-04-24T17:16:32Z

runners/flink/src/main/java/org/apache/beam/runners/flink/metrics/FlinkMetricContainer.java

-
- String flinkMetricName = getFlinkMetricNameString(COUNTER_PREFIX, metricUpdate.getKey());
- Long update = metricUpdate.getUpdate();
+ Long update = metricResult.attempted();


Q (for my understanding): Are flink accumulators aggregated across all attempts at processing, or only across successfully checkpointed processing?

Actually neither: aggregator values are lost in case of failure so it's "aggregated since last failure".

bjchambers · 2017-04-24T17:18:40Z

sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/AccumulatedAttemptedMetricResults.java

+/**
+ * Implementation of attempted {@link MetricResults} using accumulated {@link MetricsContainers}.
+ */
+public class AccumulatedAttemptedMetricResults extends AccumulatedMetricResults {


What makes this special to attempted metrics? Couldn't it be used for committed metrics as well? Is it just that when given an AttemptedAndCommitted it extracts the Attempted? If yes, would it make sense to parameterize it on that part, so it could be used for either?

bjchambers · 2017-04-24T18:38:02Z

sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/CounterCell.java

- @Override
- public void inc() {
- add(1);
+ public void update(Long n) {


Could also solve this using the approach taken in Sum Long CombineFn. Specifically, instead of MetricCell<Counter, Long> make this MetricCell<Counter, long[]>, where the array is expected to be one element. This avoids the boxing/unboxing.

Alternative, the boxing/unboxing here may be OK as long as we aren't using a Long as the accumulator/value holder.

bjchambers · 2017-04-24T18:39:00Z

sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/MetricCell.java

+ /**
+ * Update value of this cell.
+ */
+ void update(DataT data);


Does this do a set or a merge with the given data? If it does a set, then the method below should be called merge or something else to distinguish the behaviors. Either way, doc should indicate.

It does a set, which is why I chose update over merge, void return type and stated in the Javadoc that it updates the value of the cell.
Is this unclear? Should I use a different method name?

I'd be tempted to call this set or setValue then, or at least update the Javadoc to say that it sets the value of this cell to {@code data} or something like that?

The thing is it isn't a set it combines the input with the existing value, which is why I opted for update as the name.
We could change this to combineAndSet or mergeAndSet if that is clearer.

bjchambers · 2017-04-24T18:40:29Z

sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/MetricsContainers.java

+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * Metrics containers by step.


May be worth calling this something like "MetricsContainerStepMap" or something, to make it clear the behavior this provides?

bjchambers · 2017-04-24T18:42:27Z

sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/MetricsContainers.java

+ }
+
+ public void updateAll(MetricsContainers other) {
+ for (Map.Entry<String, MetricsContainer> container : other.metricsContainers.entrySet()) {


Does this need to do any locking? Either way, should extend the documentation to describe how this should be used by the runner, and what requirements/guarantees it provides in terms of thread safety.

It depends on how the runner uses it. It isn't thread-safe. I'll add Javadoc to reflect this.

bjchambers · 2017-04-24T18:54:53Z

sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/AccumulatedMetricResults.java

+/**
+ * Implementation of {@link MetricResults} by means of accumulated {@link MetricsContainers}.
+ */
+public class AccumulatedMetricResults extends MetricResults {


I think this class should have more documentation and should have some tests written to verify/document its behavior.

coveralls · 2017-04-30T05:40:26Z

Coverage increased (+0.2%) to 69.986% when pulling b87a860 on aviemzur:accumulable-metricscontainer into 14d60b2 on apache:master.

aviemzur · 2017-04-30T07:14:39Z

Rebased on top of master
Addressed comments in the review.
PTAL: @bjchambers @aljoscha @staslev

staslev

As far as my experience with (Spark) accumulators goes, things might slip when moving from test setups (i.e. local cluster), to real-world distributed clusters.

Since I understand this has already been tested on an actual cluster, I have only a few minor style comments.

staslev · 2017-04-30T07:03:58Z

sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/MetricsContainerStepMap.java

+ * {@link MetricsContainerStepMap}.
+ */
+ public Map<String, MetricsContainer> asMap() {
+ return new HashMap<>(metricsContainers);


Why is a new HashMap instance needed here?

staslev · 2017-04-30T07:33:49Z

runners/spark/src/main/java/org/apache/beam/runners/spark/metrics/MetricsAccumulatorParam.java

 import org.apache.spark.AccumulatorParam;


 /**
 * Metrics accumulator param.
 */
-class MetricsAccumulatorParam implements AccumulatorParam<SparkMetricsContainer> {
+class MetricsAccumulatorParam implements AccumulatorParam<MetricsContainerStepMap> {


Perhaps this could be made an inner static class of MetricsContainerStepMap so that usages would look like so: new MetricsContainerStepMap.AccumulatorParam().

The problem with this is AccumulatorParam is a Spark interface.

Why is this an issue?

It is instantiated here, so instead of new MetricsAccumulatorParam() it could be new MetricsContainerStepMap.AccumulatorParam().

Am I missing something?

The problem is MetricsContainerStepMap is in sdks-java-core which is not dependent on Spark.

staslev · 2017-04-30T08:42:47Z

sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/AccumulatedMetricResults.java

+ * Implementation of {@link MetricResults} using
+ * {@link MetricsContainerStepMap MetricsContainerStepMaps}.
+ */
+public class AccumulatedMetricResults extends MetricResults {


I'm wondering if a better name could be given here.
Perhaps PipelineMetricResults to indicate that these MetricResults pertain to all steps in a given pipeline.

staslev · 2017-04-30T08:53:51Z

sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/AccumulatedMetricResults.java

+ *
+ * <p>This constructor is intended for runners which support both attempted and committed metrics.
+ */
+ public AccumulatedMetricResults(


I have a personal preference towards static factory methods here to avoid leaking the concrete type out, i.e.,
AccumulatedMetricResults.of(...), or PipelieMetricResults.of(...) if you adopt the naming suggested above.

This is however, a personal preference (by definition) so it extends MatterOfTaste.

coveralls · 2017-04-30T10:35:44Z

Coverage increased (+0.2%) to 69.978% when pulling b87a860 on aviemzur:accumulable-metricscontainer into 14d60b2 on apache:master.

coveralls · 2017-04-30T15:39:41Z

Coverage increased (+0.2%) to 69.993% when pulling aeb3000 on aviemzur:accumulable-metricscontainer into 14d60b2 on apache:master.

coveralls · 2017-04-30T17:02:50Z

Coverage increased (+0.2%) to 69.993% when pulling aeb3000 on aviemzur:accumulable-metricscontainer into 14d60b2 on apache:master.

aviemzur · 2017-05-01T10:11:09Z

@staslev Addressed your comments in my latest commit.

staslev · 2017-05-03T10:14:31Z

I'd import the static methods and use asMetricResults, asAttemptedOnlyMetricResults etc. directly, rather than in a qualified manner like so MetricsContainerStepMap.asAttemptedOnlyMetricResults which long-ish.

Other than that the parts I have looked at LGTM.

aljoscha · 2017-05-05T09:13:07Z

I looked over the Flink parts again. This LGTM, now that Flink Metrics support is retained. 👍

aviemzur · 2017-05-05T20:26:17Z

Rebased on top of master.
Removed all code changes from direct runner.
Will merge when tests pass.

coveralls · 2017-05-05T22:10:33Z

Coverage increased (+0.2%) to 70.646% when pulling f943e18 on aviemzur:accumulable-metricscontainer into a629f73 on apache:master.

aviemzur · 2017-05-06T05:28:38Z

Two test failures of flaky tests, verified they passed locally.

aviemzur force-pushed the accumulable-metricscontainer branch 3 times, most recently from 55bcc26 to 46e2ecf Compare April 23, 2017 05:00

aljoscha reviewed Apr 23, 2017

View reviewed changes

aviemzur force-pushed the accumulable-metricscontainer branch 2 times, most recently from 16dcc79 to ebbb72a Compare April 24, 2017 06:31

bjchambers requested changes Apr 24, 2017

View reviewed changes

aviemzur mentioned this pull request Apr 25, 2017

[BEAM-1672] Extract interface MetricData #2522

Closed

4 tasks

aviemzur force-pushed the accumulable-metricscontainer branch from ac9e386 to b87a860 Compare April 30, 2017 04:40

staslev reviewed Apr 30, 2017

View reviewed changes

aviemzur added 3 commits May 5, 2017 23:16

[BEAM-1672] Make MetricsContainers accumulable.

e0ca7a7

[BEAM-1672] Use Accumulable MetricsContainers in Spark runner.

71203d1

[BEAM-1672] Use Accumulable MetricsContainers in Flink runner.

f943e18

aviemzur force-pushed the accumulable-metricscontainer branch from aeb3000 to f943e18 Compare May 5, 2017 20:24

bjchambers approved these changes May 5, 2017

View reviewed changes

asfgit closed this in 019d300 May 6, 2017

aviemzur added a commit to aviemzur/beam that referenced this pull request May 6, 2017

Cherry-pick pull request apache#2649 into release-2.0.0 branch

bd5e347

asfgit pushed a commit that referenced this pull request May 9, 2017

Cherry-pick pull request #2649 into release-2.0.0 branch

3a4ffd2

[BEAM-1672] Accumulable MetricsContainers. #2649

[BEAM-1672] Accumulable MetricsContainers. #2649

Conversation

aviemzur commented Apr 22, 2017 • edited Loading

aviemzur commented Apr 23, 2017

Choose a reason for hiding this comment

aviemzur Apr 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aviemzur Apr 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aviemzur May 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Apr 30, 2017

aviemzur commented Apr 30, 2017

staslev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

staslev Apr 30, 2017 • edited Loading

Choose a reason for hiding this comment

aviemzur May 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

staslev Apr 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Apr 30, 2017

coveralls commented Apr 30, 2017

coveralls commented Apr 30, 2017

aviemzur commented May 1, 2017

staslev commented May 3, 2017

aljoscha commented May 5, 2017

aviemzur commented May 5, 2017

coveralls commented May 5, 2017

aviemzur commented May 6, 2017

aviemzur commented Apr 22, 2017 •

edited

Loading

aviemzur Apr 23, 2017 •

edited

Loading

aviemzur Apr 29, 2017 •

edited

Loading

aviemzur May 5, 2017 •

edited

Loading

staslev Apr 30, 2017 •

edited

Loading

aviemzur May 1, 2017 •

edited

Loading

staslev Apr 30, 2017 •

edited

Loading