Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing failing compaction/parallel index jobs during upgrade due to new actions being available on the overlord. #15430

Merged
merged 5 commits into from
Nov 25, 2023

Conversation

cryptoe
Copy link
Contributor

@cryptoe cryptoe commented Nov 24, 2023

With #15039, a new action got introduced on the overlord RetrieveSegmentsToReplaceAction. This might not be available on the overlord during upgrade resulting in compaction tasks/reindex tasks failing until the overlord is upgraded.

I fixed it by adding an undocumented runtime property enableConcurrentAppendAndReplace which can be set by cluster admins in case they want to try out concurrentAppendAndReplace.

This PR also changes RetrieveSegmentsToReplaceAction to take a list<Intervals> as a input since that is required for MSQ jobs as discussed here : #15284 (comment)

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for catching this, @cryptoe ! The changes look good to me.

@@ -67,16 +68,16 @@ public class RetrieveSegmentsToReplaceAction implements TaskAction<Collection<Da
private final String dataSource;

@JsonIgnore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had missed this earlier when this class was created. Not sure why we need @JsonIgnore here.

@@ -84,19 +84,19 @@ default Collection<DataSegment> retrieveUsedSegmentsForInterval(
/**
*
* Retrieve all published segments which are marked as used and the created_date of these segments belonging to the
* given data source and interval from the metadata store.
* given data source and List<Interval> from the metadata store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This doesn't render correctly in javadoc. Use intervals or List of intervals instead of this.

// Do not need an interval condition if the interval is ETERNITY
if (!Intervals.isEternity(interval)) {
intervals.add(interval);
boolean intervalsAreEternity = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
boolean intervalsAreEternity = false;
boolean hasEternityInterval = false;

@cryptoe
Copy link
Contributor Author

cryptoe commented Nov 25, 2023

I just pushed up a patch with the review comments. cc @kfaraz

@cryptoe cryptoe added this to the 28.0 milestone Nov 25, 2023
@cryptoe cryptoe merged commit a018819 into apache:master Nov 25, 2023
83 checks passed
@cryptoe
Copy link
Contributor Author

cryptoe commented Nov 28, 2023

The error message that comes without this patch is

2023-11-27T16:44:44,180 ERROR [[coordinator-issued_compact_kttm_transformed_cipcheah_2023-11-27T16:44:44.023Z]-threading-task-runner-executor-3] org.apache.druid.indexing.common.task.IndexTask - Encountered exception in BUILD_SEGMENTS.
java.lang.RuntimeException: org.apache.druid.java.util.common.IOE: Error with status[400 Bad Request] and message[{"error":"Please make sure to load all the necessary extensions and jars with type 'retrieveSegmentsToReplace' on 'druid/coordinator' service. Could not resolve type id 'retrieveSegmentsToReplace' as a subtype of `org.apache.druid.indexing.common.actions.TaskAction` known type ids = [checkPointDataSourceMetadata, lockAcquire, lockList, lockRelease, lockTryAcquire, markSegmentsAsUnused, resetDataSourceMetadata, segmentAllocate, segmentInsertion, segmentListUnused, segmentListUsed, segmentLockAcquire, segmentLockTryAcquire, segmentMetadataUpdate, segmentNuke, segmentTransactionalInsert, surrogateAction, updateLocation, updateStatus] (for POJO property 'action')\n at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 2090] (through reference chain: org.apache.druid.indexing.common.actions.TaskActionHolder[\"action\"])"}]. Check overlord logs for details.
	at org.apache.druid.indexing.input.DruidInputSource.getTimelineForInterval(DruidInputSource.java:561) ~[classes/:?]
	at org.apache.druid.indexing.input.DruidInputSource.createTimeline(DruidInputSource.java:368) ~[classes/:?]
	at org.apache.druid.indexing.input.DruidInputSource.fixedFormatReader(DruidInputSource.java:296) ~[classes/:?]
	at org.apache.druid.data.input.AbstractInputSource.reader(AbstractInputSource.java:48) ~[classes/:?]
	at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.inputSourceReader(AbstractBatchIndexTask.java:215) ~[classes/:?]
	at org.apache.druid.indexing.common.task.InputSourceProcessor.process(InputSourceProcessor.java:82) ~[classes/:?]

cryptoe added a commit to cryptoe/druid that referenced this pull request Nov 29, 2023
…ew actions being available on the overlord. (apache#15430)

* Fixing failing compaction/parallel index jobs during upgrade due to new actions not available on the overlord.

* Fixing build

* Removing extra space.

* Fixing json getter.

* Review comments.

(cherry picked from commit a018819)
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
…ew actions being available on the overlord. (apache#15430)

* Fixing failing compaction/parallel index jobs during upgrade due to new actions not available on the overlord.

* Fixing build

* Removing extra space.

* Fixing json getter.

* Review comments.
cryptoe added a commit that referenced this pull request Dec 1, 2023
…ew actions being available on the overlord. (#15430) (#15450)

* Fixing failing compaction/parallel index jobs during upgrade due to new actions not available on the overlord.
(cherry picked from commit a018819)
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
…ew actions being available on the overlord. (apache#15430)

* Fixing failing compaction/parallel index jobs during upgrade due to new actions not available on the overlord.

* Fixing build

* Removing extra space.

* Fixing json getter.

* Review comments.
@LakshSingla LakshSingla modified the milestones: 28.0, 28.0.1 Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants