[FLINK-9489] Checkpoint timers as part of managed keyed state instead of raw keyed state #6333

StefanRRichter · 2018-07-14T06:56:40Z

What is the purpose of the change

This PR integrates priority queue state (timers) with the snapshotting of Flink's state backend ans also already includes backwards compatibility (FLINK-9490). Core idea is to have a common abstraction for how state is registered in the state backend and how snapshots operator on such state (StateSnapshotRestore, RegisteredStateMetaInfoBase). With this, the new state integrates more or less seemless with existing snapshot logic. The notable exception is a current lack of integration of RocksDB state backend with heap-based priority queue state. This can currently still use the old snapshot code without causing any regression using a temporary path (see AbstractStreamOperator.snapshotState(...). As a result, after this PR Flink supports asynchronous snapshots for (heap kv / heap queue), (rocks kv / rocks queue) (full and incremental), (rocks kv / heap queue) (only full) and still uses synchronous snapshots for (rocks kv / heap queue) (only incremental).

DISCLAIMER: This work was created in a bit of a rush to make it into the 1.6 release and still has some known rough edges and could have some bugs left that we could fix up in the test phase. Here is a list of some things that already come to my mind:

Integrate heap-based timers with incremental RocksDB snapshots, then kick out some code.
Check proper integration with serializer upgrade story (!!)
After that, we can also remove the key-partitioning in the set structure from HeapPriorityQueueSet.
Improve integration of the batch wrapper.
Improve general state registration logic in the backends, there is potential to remove duplicated code, and generally still improve the integration of the queue state a bit.
Improve performance of RocksDB based timers, e.g. byte[] based cache, seek sharp to the next potential timer instead of seeking to the key-group start, bulkPoll.
Improve some class/interface/method names.
Defensive checks against attempts to register a different state type under an existing name.
Add tests, e.g. bulkPoll etc.

Verifying this change

This change is already covered by existing tests, but I would add some more eventually. You can activate RocksDB based timers by using the RocksDB backend and setting RockDBBackendOptions.PRIORITY_QUEUE_STATE_TYPE to ROCKS.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (yes)
The runtime per-record code paths (performance sensitive): (yes)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (yes)
If yes, how is the feature documented? (JavaDocs only for now)

… of raw keyed state

StefanRRichter · 2018-07-14T06:56:57Z

CC @tillrohrmann

sihuazhou · 2018-07-14T14:45:11Z

flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupPartitioner.java

+
+ /**
+ * General algorithm to read key-grouped state that was written from a {@link PartitioningResult}
+ * @param <T>


description for T is missing.

sihuazhou · 2018-07-14T14:59:35Z

flink-runtime/src/main/java/org/apache/flink/runtime/state/TieBreakingPriorityComparator.java

- if (o1.equals(o2)) {
- return 0;
- }
+// // we catch this case before moving to more expensive tie breaks.


For what reason we need to comment this.

I think this is some commented out code which should be removed.

sihuazhou · 2018-07-14T15:02:40Z

...ntime/src/main/java/org/apache/flink/runtime/state/heap/CachingInternalPriorityQueueSet.java

@@ -305,6 +351,6 @@ private void checkRefillCacheFromStore() {
 * after usage.
 */
 @Nonnull
- CloseableIterator<E> orderedIterator();
+ CloseableIterator<E> orderedIterator();;


a duplicated ;

sihuazhou · 2018-07-14T15:54:36Z

...me/src/main/java/org/apache/flink/runtime/state/RegisteredBroadcastStateBackendMetaInfo.java

+ if (precomputedSnapshot == null) {
+ precomputedSnapshot = precomputeSnapshot();
+ }
+ return precomputedSnapshot;


What if the serializers are not all immutable? Should we need a immutable field for it? Only when it is true we return the precomputeSnapshot.

As an easy fix, we could remove the precomputedSnapshot field and keep it like it was before that the snapshot was computed with every snapshot call.

sihuazhou · 2018-07-14T16:15:04Z

...c/main/java/org/apache/flink/runtime/state/heap/HeapPriorityQueueSnapshotRestoreWrapper.java

+ return new HeapPriorityQueueStateSnapshot<>(
+ queueDump,
+ keyExtractorFunction,
+ metaInfo,


We only dump the queued elements here, should we also need to take a snapshot of the metaInfo in case something of it are might not immutable?

sihuazhou · 2018-07-14T16:24:16Z

flink-runtime/src/main/java/org/apache/flink/runtime/state/heap/HeapKeyedStateBackend.java

@@ -446,8 +485,10 @@ public String toString() {
 @Override
 public int numStateEntries() {
 int sum = 0;
- for (StateTable<K, ?, ?> stateTable : stateTables.values()) {
- sum += stateTable.size();
+ for (StateSnapshotRestore stateTable : registeredStates.values()) {


nit: the name stateTable is a bit confusion, since it is the RegisteredState(which might not be StateTable) now...

sihuazhou · 2018-07-14T16:26:11Z

flink-runtime/src/main/java/org/apache/flink/runtime/state/PriorityComparable.java

+
+/**
+ *
+ * @param <T>


Description for T is missing

tillrohrmann

First half of minor comments. Will continue reviewing the second half.

tillrohrmann · 2018-07-15T19:56:18Z

flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyExtractorFunction.java

+ public Object extractKeyFromElement(@Nonnull Keyed<?> element) {
+ return element.getKey();
+ }
+ };


Could we move this extractor into its own KeyedKeyExtractorFunction singleton?

tillrohrmann · 2018-07-15T19:58:43Z

flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupPartitioner.java

@@ -264,6 +265,42 @@ public void writeMappingsInKeyGroup(@Nonnull DataOutputView dov, int keyGroupId)
 }
 }

+ public static <T> StateSnapshotKeyGroupReader createKeyGroupPartitionReader(
+ @Nonnull ElementReaderFunction<T> readerFunction,
+ @Nonnull KeyGroupElementsConsumer<T> elementConsumer) {


Indenting these parameter one more level would help to distinguish the body from the parameter list.

tillrohrmann · 2018-07-15T19:59:54Z

flink-runtime/src/main/java/org/apache/flink/runtime/state/PriorityComparable.java

+import javax.annotation.Nonnull;
+
+/**
+ *


JavaDocs missing

tillrohrmann · 2018-07-15T20:02:25Z

...me/src/main/java/org/apache/flink/runtime/state/RegisteredBroadcastStateBackendMetaInfo.java

- final TypeSerializer<V> valueSerializer) {
+ /** The precomputed immutable snapshot of this state */
+ @Nullable
+ private StateMetaInfoSnapshot precomputedSnapshot;


nit: Maybe rename to precomputedStateMetaInfoSnapshot

tillrohrmann · 2018-07-15T20:06:06Z

...me/src/main/java/org/apache/flink/runtime/state/RegisteredBroadcastStateBackendMetaInfo.java

+ if (precomputedSnapshot == null) {
+ precomputedSnapshot = precomputeSnapshot();
+ }
+ return precomputedSnapshot;


As an easy fix, we could remove the precomputedSnapshot field and keep it like it was before that the snapshot was computed with every snapshot call.

tillrohrmann · 2018-07-15T20:20:28Z

flink-runtime/src/main/java/org/apache/flink/runtime/state/TieBreakingPriorityComparator.java

- if (o1.equals(o2)) {
- return 0;
- }
+// // we catch this case before moving to more expensive tie breaks.


I think this is some commented out code which should be removed.

tillrohrmann · 2018-07-15T20:27:08Z

...ntime/src/main/java/org/apache/flink/runtime/state/heap/CachingInternalPriorityQueueSet.java

+ }
+ }
+ } catch (Exception e) {
+ throw new FlinkRuntimeException("Exception while bulk polling store.", e);


I would prefer throwing a checked exception here.

Why would you prefer it? I think there is no better way that caller can handle problems in this call than failing the job (rocksdb problems)?

Because it makes it more explicit that there are things which can go wrong. With checked exceptions you still have the chance to let the program fail. But without them, the caller needs to know that there are unchecked exception in order to do any recovery operation.

Moreover, I'm not sure whether we should manifest on this level how recovery is done or not done. For example, maybe the caller can fetch the latest checkpoint data again and replay all in-between elements in order to recompute the state. This is something which the priority queue should not need to bother about.

tillrohrmann · 2018-07-15T20:36:34Z

flink-runtime/src/main/java/org/apache/flink/runtime/state/heap/HeapKeyedStateBackend.java

- stateTables.put(restoredMetaInfo.getName(), stateTable);
+ snapshotRestore = snapshotStrategy.newStateTable(registeredKeyedBackendStateMetaInfo);
+ registeredStates.put(restoredMetaInfo.getName(), snapshotRestore);
+ } else {


Maybe check that (restoredMetaInfo.getBackendStateType() == PRIORITY_QUEUE

tillrohrmann · 2018-07-15T20:40:53Z

flink-runtime/src/main/java/org/apache/flink/runtime/state/heap/HeapKeyedStateBackend.java

+ for (StateSnapshotRestore stateTable : registeredStates.values()) {
+ if (stateTable instanceof StateTable) {
+ sum += ((StateTable<?, ?, ?>) stateTable).size();
+ }


Why does the timers don't count for the total number of state entries?

This method is only used for some tests, and to be on the safe side it probably only expected to count the keyed state and not some timers.

tillrohrmann · 2018-07-15T20:43:28Z

...untime/src/main/java/org/apache/flink/runtime/state/heap/HeapPriorityQueueStateSnapshot.java

@@ -60,36 +61,35 @@

 /** Result of partitioning the snapshot by key-group. */
 @Nullable
- private KeyGroupPartitionedSnapshot partitionedSnapshot;
+ private StateKeyGroupWriter partitionedSnapshot;


nit: rename field

… of raw keyed state Optimization for relaxed bulk polls Deactivate optimization for now because it still contains a bug This closes apache#6333.

tillrohrmann

The changes look good to me. Thanks a lot for your work @StefanRRichter!

One thing we should add as a follow up is an end-to-end test which verifies that timers are now scalable. Moreover, I think we should also support that we can configure the state backend as we create it. Similar to the incremental checkpointing. Otherwise it won't be possible to run jobs with different timer service implementations on the same cluster.

Merging this PR once Travis gives green light.

… of raw keyed state Optimization for relaxed bulk polls Deactivate optimization for now because it still contains a bug This closes apache#6333.

StefanRRichter added 3 commits July 14, 2018 08:35

[FLINK-9489] Checkpoint timers as part of managed keyed state instead…

1bb8f70

… of raw keyed state

Renaming of some classes/interfaces

4db1bea

Optimization for relaxed bulk polls

fc20df8

Deactivate optimization for now because it still contains a bug

1640e17

sihuazhou reviewed Jul 14, 2018

View reviewed changes

tillrohrmann reviewed Jul 15, 2018

View reviewed changes

Most comments addressed

ba3d235

tillrohrmann approved these changes Jul 16, 2018

View reviewed changes

asfgit closed this in dbddf00 Jul 16, 2018

rmetzger added the component=Runtime/StateBackends label Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-9489] Checkpoint timers as part of managed keyed state instead of raw keyed state #6333

[FLINK-9489] Checkpoint timers as part of managed keyed state instead of raw keyed state #6333

StefanRRichter commented Jul 14, 2018 •

edited

Loading

StefanRRichter commented Jul 14, 2018

sihuazhou Jul 14, 2018

sihuazhou Jul 14, 2018

tillrohrmann Jul 15, 2018

sihuazhou Jul 14, 2018

sihuazhou Jul 14, 2018

tillrohrmann Jul 15, 2018

sihuazhou Jul 14, 2018

sihuazhou Jul 14, 2018

sihuazhou Jul 14, 2018

tillrohrmann left a comment

tillrohrmann Jul 15, 2018

tillrohrmann Jul 15, 2018

tillrohrmann Jul 15, 2018

tillrohrmann Jul 15, 2018

tillrohrmann Jul 15, 2018

tillrohrmann Jul 15, 2018

tillrohrmann Jul 15, 2018

StefanRRichter Jul 15, 2018

tillrohrmann Jul 16, 2018

tillrohrmann Jul 15, 2018

tillrohrmann Jul 15, 2018

StefanRRichter Jul 15, 2018

tillrohrmann Jul 15, 2018

tillrohrmann left a comment

[FLINK-9489] Checkpoint timers as part of managed keyed state instead of raw keyed state #6333

[FLINK-9489] Checkpoint timers as part of managed keyed state instead of raw keyed state #6333

Conversation

StefanRRichter commented Jul 14, 2018 • edited Loading

What is the purpose of the change

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

StefanRRichter commented Jul 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann left a comment

Choose a reason for hiding this comment

StefanRRichter commented Jul 14, 2018 •

edited

Loading