[FLINK-10020] [kinesis] Support recoverable exceptions in listShards. #6482

tweise · 2018-08-03T00:57:51Z

This change fixes the retry behavior of listShards to match what getRecords already supports. Importantly this will prevent the subtask from failing on transient listShards errors that we can identify based on well known exceptions. These are recoverable and should not lead to unnecessary recovery cycles that cause downtime.

R: @glaksh100 @jgrier @tzulitai

tzulitai · 2018-08-03T05:56:12Z

...-kinesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/proxy/KinesisProxy.java

+ listShardsBaseBackoffMillis, listShardsMaxBackoffMillis, listShardsExpConstant, attemptCount++);
+ LOG.warn("Got SdkClientException when listing shards from stream {}. Backing off for {} millis.",
+ streamName, backoffMillis);
+ Thread.sleep(backoffMillis);


I'm wondering what kind of SdkClientExceptions there are. Do we really need to have a backoff here before retrying?

Please see the JIRA for an example of such exception. These are really the same type of exceptions that we don't want getRecords to fail on and I believe we should be consistent with the backoff. Since listShards isn't latency sensitive it won't hurt to error on the conservative side.

kailashhd · 2018-08-03T18:01:41Z

...-kinesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/proxy/KinesisProxy.java

@@ -409,7 +416,7 @@ private ListShardsResult listShards(String streamName, @Nullable String startSha
 int attemptCount = 0;
 // List Shards returns just the first 1000 shard entries. Make sure that all entries
 // are taken up.
- while (listShardsResults == null) { // retry until we get a result
+ while (attemptCount <= listShardsMaxAttempts && listShardsResults == null) { // retry until we get a result


The earlier contract was to wait till we get a result. https://issues.apache.org/jira/browse/FLINK-10020 does not talk about breaking this contract. I personally believe maxAttemptCount is better since listShard works in a periodic thread and we are bound to try again after 'X' seconds. Just wanted to point this out. I like this approach better.

I too think that this is better since it provides the user more flexibility. Setting the retry count to max practically achieves the previous behavior. Perhaps we should up the default retry count?

kailashhd · 2018-08-03T18:02:12Z

...esis/src/test/java/org/apache/flink/streaming/connectors/kinesis/proxy/KinesisProxyTest.java

@@ -151,6 +190,45 @@ public void testGetShardList() throws Exception {
 expectedStreamShard.toArray(new StreamShardHandle[actualShardList.size()])));
 }

+ @Test
+ public void testGetShardListRetry() throws Exception {


Would it make sense to also have a test where we exceed the number of configured retries? In that case we should not get any result.

Good point, expanded the test to cover this.

tweise · 2018-08-09T03:08:25Z

@tzulitai PTAL

…ies exceeded.

jgrier · 2018-08-16T19:12:23Z

👍

tzulitai · 2018-08-17T04:15:23Z

Changes LGTM, +1, thanks @tweise.
Merging this ..

This closes #6482.

[FLINK-10020] [kinesis] Support recoverable exceptions in listShards.

5b39150

tzulitai reviewed Aug 3, 2018

View reviewed changes

kailashhd reviewed Aug 3, 2018

View reviewed changes

tweise added 2 commits August 5, 2018 19:01

Expand test to cover retry exceeded path.

4825787

Increase max retries for listShards.

f78eff4

Fix misleading variable naming and ensure listShards fails after retr…

e8fd071

…ies exceeded.

tweise force-pushed the FLINK-10020.listShardsRetry branch from bf2e212 to e8fd071 Compare August 13, 2018 16:15

asfgit pushed a commit that referenced this pull request Aug 17, 2018

[FLINK-10020] [kinesis] Support recoverable exceptions in listShards.

decc0bf

This closes #6482.

asfgit closed this in 50d076a Aug 17, 2018

rmetzger added the component=Connectors/Kinesis label Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-10020] [kinesis] Support recoverable exceptions in listShards. #6482

[FLINK-10020] [kinesis] Support recoverable exceptions in listShards. #6482

tweise commented Aug 3, 2018

tzulitai Aug 3, 2018

tweise Aug 6, 2018

kailashhd Aug 3, 2018

tweise Aug 6, 2018

kailashhd Aug 3, 2018

tweise Aug 6, 2018

tweise commented Aug 9, 2018

jgrier commented Aug 16, 2018

tzulitai commented Aug 17, 2018

[FLINK-10020] [kinesis] Support recoverable exceptions in listShards. #6482

[FLINK-10020] [kinesis] Support recoverable exceptions in listShards. #6482

Conversation

tweise commented Aug 3, 2018

tzulitai Aug 3, 2018

Choose a reason for hiding this comment

tweise Aug 6, 2018

Choose a reason for hiding this comment

kailashhd Aug 3, 2018

Choose a reason for hiding this comment

tweise Aug 6, 2018

Choose a reason for hiding this comment

kailashhd Aug 3, 2018

Choose a reason for hiding this comment

tweise Aug 6, 2018

Choose a reason for hiding this comment

tweise commented Aug 9, 2018

jgrier commented Aug 16, 2018

tzulitai commented Aug 17, 2018