Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.x porting - add remonitor code to DOWN message (#3144) #3250

Merged
merged 2 commits into from
Nov 7, 2020

Conversation

jiangphcn
Copy link
Contributor

@jiangphcn jiangphcn commented Nov 6, 2020

Overview

Smoosh monitors the compactor pid to determine when the compaction jobs
finishes, and uses this for its idea of concurrency. However, this isn't
accurate in the case where the compaction job has to re-spawn to catch up on
intervening changes since the same logical compaction job continues with
another pid and smoosh is not aware. In such cases, a smoosh channel with
concurrency one can start arbitrarily many additional database compaction jobs.

To solve this problem, we added a check to see if a compaction PID exists for
a db in start_compact. But wee need to add another check because this check
is only for shard that comes off the queue. So the following can still occur:

  1. Enqueue a bunch of stuff into channel with concurrency 1
  2. Begin highest priority job, Shard1, in channel
  3. Compaction finishes, discovers compaction file is behind main file
  4. Smoosh-monitored PID for Shard1 exits, a new one starts to finish the job
  5. Smoosh receives the 'DOWN' message, begins the next highest priority job,
    Shard2
  6. Channel concurrency is now 2, not 1

This change adds another check into the 'DOWN' message so that it checks for
that specific shard. If the compaction PID exists then it means a new process
was spawned and we just monitor that one and add it back to the queue. The
length of the queue does not change and therefore we won’t spawn new
compaction jobs.

Testing recommendations

Related Issues or Pull Requests

Current PR is the porting of PR #3144 and #3150 to 3.x

Checklist

@jiangphcn jiangphcn force-pushed the 3.x-re-monitor-compaction-pid branch 2 times, most recently from 1949934 to eff0f2c Compare November 7, 2020 00:26
Smoosh monitors the compactor pid to determine when the compaction jobs
finishes, and uses this for its idea of concurrency. However, this isn't
accurate in the case where the compaction job has to re-spawn to catch up on
intervening changes since the same logical compaction job continues with
another pid and smoosh is not aware. In such cases, a smoosh channel with
concurrency one can start arbitrarily many additional database compaction jobs.

To solve this problem, we added a check to see if a compaction PID exists for
a db in `start_compact`. But wee need to add another check because this check
is only for shard that comes off the queue. So the following can still occur:

1. Enqueue a bunch of stuff into channel with concurrency 1
2. Begin highest priority job, Shard1, in channel
3. Compaction finishes, discovers compaction file is behind main file
4. Smoosh-monitored PID for Shard1 exits, a new one starts to finish the job
5. Smoosh receives the 'DOWN' message, begins the next highest priority job,
Shard2
6. Channel concurrency is now 2, not 1

This change adds another check into the 'DOWN' message so that it checks for
that specific shard. If the compaction PID exists then it means a new process
was spawned and we just monitor that one and add it back to the queue. The
length of the queue does not change and therefore we won’t spawn new
compaction jobs.
This fixes a94e693 because a race
condition exisited where the 'DOWN' message could be received
before the compactor pid is spawned. Adding a synchronous call to
get the compactor pid guarantees that the couch_db_updater process
handling of finish_compaction has occurred.
@jiangphcn jiangphcn force-pushed the 3.x-re-monitor-compaction-pid branch from eff0f2c to ca9df69 Compare November 7, 2020 00:27
@jiangphcn jiangphcn merged commit c563243 into 3.x Nov 7, 2020
@jiangphcn jiangphcn deleted the 3.x-re-monitor-compaction-pid branch November 7, 2020 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants