Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: suppress compaction/gc errors while stopping #5670

Merged
merged 1 commit into from
Oct 26, 2023

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Oct 26, 2023

Problem

Tenant deletions would sometimes be accompanied by compaction stack traces, because shutdown() puts the tenant into stopping state before it joins background tasks.

Summary of changes

Treat GC+Compaction as no-ops on a Stopping tenant.

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Oct 26, 2023
@jcsp jcsp requested a review from a team as a code owner October 26, 2023 09:11
@jcsp jcsp requested review from problame and removed request for a team October 26, 2023 09:11
@github-actions
Copy link

2346 tests run: 2229 passed, 0 failed, 117 skipped (full report)


Flaky tests (1)

Postgres 15

  • test_timeline_init_break_before_checkpoint: debug

Code coverage (full report)

  • functions: 53.6% (8576 of 16010 functions)
  • lines: 80.9% (49768 of 61484 lines)

The comment gets automatically updated with the latest test results
f5b5444 at 2023-10-26T09:52:48.261Z :recycle:

@jcsp jcsp enabled auto-merge (squash) October 26, 2023 09:54
@jcsp jcsp merged commit e0ebdfc into main Oct 26, 2023
42 checks passed
@jcsp jcsp deleted the jcsp/log-hygiene-compaction branch October 26, 2023 09:59
Comment on lines +1827 to 1834
// Don't start doing work during shutdown
if let TenantState::Stopping { .. } = self.current_state() {
return Ok(GcResult::default());
}

// there is a global allowed_error for this
anyhow::ensure!(
self.is_active(),
Copy link
Member

@koivunej koivunej Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there now repetition with the ensure in both (compaction, gc) cases?

We do not start the background loops before we are active, next possible steps are broken and stopping.

I am also surprised how did you get to merge this without conflicts since I did make the CompactionError visible in the #4938.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repetition is a pre-existing thing: the gc_loop and compaction_loop code paths are very similar -- here I'm aiming to make the behavioral change rather than refactor to reduce duplication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant repeatition between:

if let TenantState::Stopping { .. } = self.current_state()

and

anyhow::ensure!(
    self.is_active(),

And how the stopping OR broken can be the only next state.

Perhaps the suggestion I was working towards is should we instead have:

-if let TenantState::Stopping { .. } = self.current_state() {
+if !self.is_active() {
     return Ok(GcResult::default());
 }
 
-// there is a global allowed_error for this
-anyhow::ensure!(
-    self.is_active()
-);

(did not check how the ensure! block would correctly end).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. The reason I didn't do that is because this function is called from two places: in the compaction loop I would always want to drop out with Ok(()) if we have no work, but when called via the API (e.g. in a test), I would prefer to give the caller an error if they have tried to compact prematurely during activation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today I found out that the Tenant::compaction_iteration is not called from tests. I see that GC of course is.. I'll do what I proposed in #5881 for that so I can use the better error type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to give the caller an error if they have tried to compact prematurely during activation

Related: there is a PR I intend to merge from external contributor I can never find the number of, but it would add waiting until tenants have finished loading by default to all tests so this reason would go away for gc as well.

It never existed for compaction though, as Tenant::compaction_iteration was never exposed like Tenant::gc is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants