-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Fix repeated migration of colocate tablets #53099
base: main
Are you sure you want to change the base?
Conversation
if (!olapTbl.needSchedule(isLocalBalance)) { | ||
continue; | ||
} | ||
|
||
for (Partition partition : globalStateMgr.getLocalMetastore().getAllPartitionsIncludeRecycleBin(olapTbl)) { | ||
partitionChecked++; | ||
if (partitionChecked % partitionBatchNum == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most risky bug in this code is:
The changes are adding calls to olapTable.needSchedule()
and olapTbl.needSchedule()
, leading to skipped iterations when scheduling is not needed. If needSchedule
logic is incorrect, or if skipping these parts without proper handling causes data imbalance or system performance issues, it may introduce a risk of improperly balanced clusters or missed scheduling opportunities.
You can modify the code like this:
// Assess if the conditions under which 'needSchedule' returns false might lead to critical operations being skipped inadvertently,
// and ensure that either strategic logging or alternative operations handle such cases to preserve cluster health.
private List<TabletSchedCtx> balanceClusterDisk(ClusterLoadStatistic clusterStat) {
for (/* iteration variables */) {
if (!olapTable.needSchedule(false)) {
// Add logging or checks to confirm skipping doesn't cause issues
continue;
}
if (isDestBackendLocationMismatch(olapTable, hBackend.getId(), lBackend.getId(),
physicalPartition.getParentId(), tabletId)) {
continue;
}
// rest of the function...
}
}
private void balanceBackendDisk(TStorageMedium medium, double avgUsedPercent) {
for (/* iteration variables */) {
if (olapTable == null) {
continue;
}
if (!olapTable.needSchedule(true)) {
// Add logging or checks to assess impact
continue;
}
if (isTabletUnhealthy(tabletMeta.getDbId(), olapTable, tabletId, tabletMeta, aliveBeIds)) {
continue;
}
// rest of the function...
}
}
private Map<Pair<Long, Long>, PartitionStat> getPartitionStats(TStorageMedium medium, boolean isLocalBalance) {
for (/* iteration variables */) {
if (!olapTbl.needSchedule(isLocalBalance)) {
// Consider implementing fallback checks here
continue;
}
for (Partition partition : globalStateMgr.getLocalMetastore().getAllPartitionsIncludeRecycleBin(olapTbl)) {
partitionChecked++;
if (partitionChecked % partitionBatchNum == 0) {
// additional logic...
}
}
// rest of the function...
}
}
Ensure you validate whether operations that are bypassed should still partially execute or trigger alerts/logs for evaluation.
5173378
to
b8619d0
Compare
Signed-off-by: Jiao Mingye <[email protected]>
Quality Gate passedIssues Measures |
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]❌ fail : 3 / 6 (50.00%) file detail
|
[BE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
Co-authored-by: miomiocat [email protected]
Why I'm doing:
What I'm doing:
Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: