Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"not_in_range" failure while resharding database #4624

Closed
jcoglan opened this issue May 25, 2023 · 3 comments · Fixed by #4626
Closed

"not_in_range" failure while resharding database #4624

jcoglan opened this issue May 25, 2023 · 3 comments · Fixed by #4626

Comments

@jcoglan
Copy link
Contributor

jcoglan commented May 25, 2023

While attempting to shard-split a q=16 database on a 3-node cluster, we found that all reshard jobs failed, and GET /_reshard/jobs stopped responding to requests. The logs reveal a not_in_range failure in mem3_reshard_job.

Description

We are attempting to reshard a database from q=8 to q=32, using the following script: https://gist.github.com/jcoglan/ad2b631664bc436c48e4274718a0acd6. This worked to get from q=8 to q=16, but failed the second step to get to q=32.

/_reshard shows that all jobs failed:

{
  "state": "running",
  "state_reason": null,
  "completed": 0,
  "failed": 48,
  "running": 0,
  "stopped": 0,
  "total": 48
}

Also, /_reshard/jobs does not respond at all, the request hangs with no activity visible in the logs.

We observed many messages like the following while the jobs were running:

[error] 2023-05-24T20:42:03.508280Z [email protected] <0.22158.5016>
-------- mem3_reshard_job worker error "#job{001-2ce218fc7c061f93575765
4236cbb4b25aceab927d7c433d377c72f0dee42bca shards/c0000000-cfffffff/some-db.15
66089635 /2 job_state:running split_state:initial_copy pid:<0.22158.5016>}"
{{badkey,not_i n_range},[{erlang,map_get,[not_in_range,#{[3221225472,3355443199]
=> {target,{db,1,<<"shar
ds/c0000000-c7ffffff/some-db.1566089635">>,"/mnt/couchdb/data/shards/c0000000-
c7ffffff/some-db.1566089635.couch",{couch_bt_engine,{st,"/mnt/couchdb/data/sha
rds/c0000000-c7ffffff/some-db.1566089635.couch",<0.11973.5023>,#Ref<0.18983093
87.670826504.2396>,undefined,{db_header,8,2343426,0,{27701359082,{320334,10,{size_info,276
36410613,166933365080}},34564353},{27708861596,320344,31829599},{27708882839,[],20290},nil
,nil,27708887320,1000,<<"47f4490fe284680faaa83e44186c9037">>,[{'node.example.com',0}],0,1000,undefined},true,{btree,<0.11973.5023>,{2770135
9082,{320334,10,{size_info,27636410613,166933365080}},34564353},fun
couch_bt_engine:id_tre e_split/1,fun
couch_bt_engine:id_tree_join/2,undefined,fun couch_bt_engine:id_tree_reduce/
2,snappy},{btree,<0.11973.5023>,{27708861596,320344,31829599},fun
couch_bt_engine:seq_tree _split/1,fun
couch_bt_engine:seq_tree_join/2,undefined,fun couch_bt_engine:seq_tree_reduce
/2,snappy},{btree,<0.11973.5023>,{27708882839,[],20290},fun
couch_bt_engine:local_tree_spl it/1,fun
couch_bt_engine:local_tree_join/2,undefined,nil,snappy},snappy,{btree,<0.11973.50
23>,nil,fun couch_bt_engine:purge_tree_split/1,fun
couch_bt_engine:purge_tree_join/2,undef ined,fun
couch_bt_engine:purge_tree_reduce/2,snappy},{btree,<0.11973.5023>,nil,fun
couch_b t_engine:purge_seq_tree_split/1,fun
couch_bt_engine:purge_seq_tree_join/2,undefined,fun co
uch_bt_engine:purge_tree_reduce/2,snappy}}},<0.13348.5022>,nil,0,<<"1684957391682422">>,{u
ser_ctx,null,[],undefined},[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>
,{[{<<"roles">>,[<<"_admin">>]}]}}],undefined,nil,nil,undefined,[{default_security_object,
[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>,{[{<<"roles">>,[<<"_admin"
>>]}]}}]},{user_ctx,{user_ctx,null,[<<"_admin">>],undefined}},{props,[]}],undefined},undef
ined,[{43468,<<"48794d4718e1452be1b90573fdbf8bd8">>,<<"-L-3nWkJHMoh8BCsOK31">>,[{2,<<58,60
,181,83,213,161,108,4,33,6,123,250,199,83,148,163>>}]},{43467,<<"48794d4718e1452be1b90573f
dbf853e">>,
... followed by a lot of binary data
[error] 2023-05-24T20:42:03.483993Z [email protected] emulator
-------- Error in process <0.23943.5023> on node '[email protected]' with
exit value:
{{badkey,not_in_range},[{erlang,map_get,[not_in_range,#{[3221225472,3355443199]
=> {target
,{db,1,<<"shards/c0000000-c7ffffff/some-db.1566089635">>,"/mnt/couchdb/data/sh
ards/c0000000-c7ffffff/some-db.1566089635.couch",{couch_bt_engine,{st,"/mnt/co
uchdb/data/shards/c0000000-c7ffffff/some-db.1566089635.couch",<0.11973.5023>,#
Ref<0.1898309387.670826504.2396>,undefined,{db_header,8,2343426,0,{27701359082,{320334,10,
{size_info,27636410613,166933365080}},34564353},{27708861596,320344,31829599},{27708882839
,[],20290},nil,nil,27708887320,1000,<<"47f4490fe284680faaa83e44186c9037">>,[{'[email protected]',0}],0,1000,undefined},true,{btree,<0.11973.
5023>,{27701359082,{320334,10,{size_info,27636410613,166933365080}},34564353},fun
couch_bt _engine:id_tree_split/1,fun
couch_bt_engine:id_tree_join/2,undefined,fun couch_bt_engine:i
d_tree_reduce/2,snappy},{btree,<0.11973.5023>,{27708861596,320344,31829599},fun
couch_bt_e ngine:seq_tree_split/1,fun
couch_bt_engine:seq_tree_join/2,undefined,fun couch_bt_engine:s
eq_tree_reduce/2,snappy},{btree,<0.11973.5023>,{27708882839,[],20290},fun
couch_bt_engine: local_tree_split/1,fun
couch_bt_engine:local_tree_join/2,undefined,nil,snappy},snappy,{btr
ee,<0.11973.5023>,nil,fun couch_bt_engine:purge_tree_split/1,fun
couch_bt_engine:purge_tre e_join/2,undefined,fun
couch_bt_engine:purge_tree_reduce/2,snappy},{btree,<0.11973.5023>,n il,fun
couch_bt_engine:purge_seq_tree_split/1,fun
couch_bt_engine:purge_seq_tree_join/2,un defined,fun
couch_bt_engine:purge_tree_reduce/2,snappy}}},<0.13348.5022>,nil,0,<<"16849573
91682422">>,{user_ctx,null,[],undefined},[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}}
,{<<"admins">>,{[{<<"roles">>,[<<"_admin">>]}]}}],undefined,nil,nil,undefined,[{default_se
curity_object,[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>,{[{<<"roles"
>>,[<<"_admin">>]}]}}]},{user_ctx,{user_ctx,null,[<<"_admin">>],undefined}},{props,[]}],un
defined},undefined,[{43468,<<"48794d4718e1452be1b90573fdbf8bd8">>
... followed by a lot of binary data

The database's current shards are as follows:

{
  "shards": {
    "00000000-0fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "10000000-1fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "20000000-2fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "30000000-3fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "40000000-4fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "50000000-5fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "60000000-6fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "70000000-7fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "80000000-8fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "90000000-9fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "a0000000-afffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "b0000000-bfffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "c0000000-cfffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "d0000000-dfffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "e0000000-efffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "f0000000-ffffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ]
  }
}

And the database info looks like this:

{
  "instance_start_time": "1566089635",
  "db_name": "some-db",
  "purge_seq": "1032970-<redacted>",
  "update_seq": "37489526-<redacted>",
  "sizes": {
    "file": 890080785312,
    "external": 5343640921800,
    "active": 887177503725
  },
  "props": {},
  "doc_del_count": 405,
  "doc_count": 10255536,
  "disk_format_version": 8,
  "compact_running": false,
  "cluster": {
    "q": 16,
    "n": 3,
    "w": 2,
    "r": 2
  }
}

Your Environment

  • CouchDB version used: 3.3.1
  • Operating system and version: Debian 11.7
@nickva
Copy link
Contributor

nickva commented May 25, 2023

Thanks for the detailed report @jcoglan. Would you be able to share few more logs starting a bit before the first error include a bit more after the ones showed if they have any stack function names and line numbers.

Do you see anywhere in the logs the string not in any target ranges or not_in_target_ranges. That would indicate that the initial split didn't go well and during the second one we're just seeing the effects of that.

In remsh what does the config:get("mem3"). return?

If possible share the full _dbs document curl $URL/_node/_local/_dbs/some_db it should have shard ranges and also the change records from the first split in it.

@nickva
Copy link
Contributor

nickva commented May 25, 2023

I wonder if this happened: after the first split (to Q=16), we ran the internal replicator to top off changes from source to the targets. When doing so we also pushed purges to the new targets, however, in the internal replicator we didn't pick them by the hash function of target range, but copied them as is. So splitting S->T1|T2 both T1 and T2 might have gotten purge_info with DocID="a", even though the hashing would have put it only on T1.

pull_purges_multi(#acc{source = Source} = Acc0) ->
#acc{batch_size = Count, seq = UpdateSeq, targets = Targets0} = Acc0,
with_src_db(Acc0, fun(Db) ->
Targets = maps:map(
fun(_, #tgt{} = T) ->
pull_purges(Db, Count, Source, T)
end,
reset_remaining(Targets0)
),
Remaining = maps:fold(
fun(_, #tgt{remaining = R}, Sum) ->
Sum + R
end,
0,
Targets
),
if
Remaining == 0 ->
Acc0#acc{targets = Targets};
true ->
PurgeSeq = couch_db:get_purge_seq(Db),
OldestPurgeSeq = couch_db:get_oldest_purge_seq(Db),
PurgesToPush = PurgeSeq - OldestPurgeSeq,
Changes = couch_db:count_changes_since(Db, UpdateSeq),
Pending = Remaining + PurgesToPush + Changes,
throw({finished, Pending})
end
end).

On next split 16 -> 32 we start by doing the initial copy in

purge_cb({_PSeq, _UUID, Id, _Revs} = PI, #state{targets = Targets} = State) ->
{Key, Target} = pick_target(Id, State, Targets),
MaxBuffer = State#state.max_buffer_size,
Target1 = acc_and_flush(PI, Target, MaxBuffer, fun commit_purge_infos/1),
{ok, State#state{targets = Targets#{Key => Target1}}}.
and when we get to purge infos we do the right thing and pick a target according to the hash function, however we now find out that some purge infos do not belong to any of the targets. So, if T2 -> T21|T22 is the split configuration and purge info for DocID="a" doesn't belong to T2 to start with we'd get that exception we got above in the issue report.

To confirm this theory, the first step would be to double-check that the actual document IDs are on the shards they are supposed to be in. That is, we'd assert that the Q=16 document placement looks sane. Then, do the same for each purge info and here we should find that some purge infos do no belong on those shards.

If this is the cause of the issue we saw above, then we could delete the extra purge info copies on the target shards then set [mem3] replicate_purges = false to prevent internal replicator to replicate purges. We should then make sure the purges have been processed before shard splitting starts.

nickva added a commit that referenced this issue May 27, 2023
Previously, internal replicator (mem3_rep) replicated purge infos to/from all
the target shards. Instead, it should push/pull changes only to
appropriate ranges if those purge infos belong there based on database's hash
function.

Users experienced this error as a failure in database which contains purges,
which was split twice in a row. For example, if a Q=8 database is split to
Q=16, then split again from Q=16 to Q=32, the second split operation might fail
with a `split_state:initial_copy ...{{badkey,not_in_range}` error. The
misplaced purge infos would be noticed only during the second split, when the
initial copy phase would crash because some purge infos do not hash to neither
one of the two target ranges. Moreover, the crash would lead to repeated
retries, which generated a huge job history log.

The fix consists of three improvements:

  1) Internal replicator is updated to filter purge infos based on the db hash.

  2) Account for the fact that some users' dbs might already contain misplaced
    purge infos. Since it's a known bug, we anticipate that error and ignore
    misplaced purge info during the second shard split operation with a warning
    emitted in the logs.

  3) Make similar range errors fatal, and emit a clear error in the logs and
     job history so any future range errors are immediately obvious.

Fixes #4624
@nickva
Copy link
Contributor

nickva commented May 27, 2023

This PR should fix the issue #4626

jcoglan pushed a commit to neighbourhoodie/couchdb that referenced this issue May 30, 2023
Previously, internal replicator (mem3_rep) replicated purge infos to/from all
the target shards. Instead, it should push/pull changes only to
appropriate ranges if those purge infos belong there based on database's hash
function.

Users experienced this error as a failure in database which contains purges,
which was split twice in a row. For example, if a Q=8 database is split to
Q=16, then split again from Q=16 to Q=32, the second split operation might fail
with a `split_state:initial_copy ...{{badkey,not_in_range}` error. The
misplaced purge infos would be noticed only during the second split, when the
initial copy phase would crash because some purge infos do not hash to neither
one of the two target ranges. Moreover, the crash would lead to repeated
retries, which generated a huge job history log.

The fix consists of three improvements:

  1) Internal replicator is updated to filter purge infos based on the db hash.

  2) Account for the fact that some users' dbs might already contain misplaced
    purge infos. Since it's a known bug, we anticipate that error and ignore
    misplaced purge info during the second shard split operation with a warning
    emitted in the logs.

  3) Make similar range errors fatal, and emit a clear error in the logs and
     job history so any future range errors are immediately obvious.

Fixes apache#4624
nickva added a commit that referenced this issue May 30, 2023
Previously, internal replicator (mem3_rep) replicated purge infos to/from all
the target shards. Instead, it should push/pull changes only to
appropriate ranges if those purge infos belong there based on database's hash
function.

Users experienced this error as a failure in database which contains purges,
which was split twice in a row. For example, if a Q=8 database is split to
Q=16, then split again from Q=16 to Q=32, the second split operation might fail
with a `split_state:initial_copy ...{{badkey,not_in_range}` error. The
misplaced purge infos would be noticed only during the second split, when the
initial copy phase would crash because some purge infos do not hash to neither
one of the two target ranges. Moreover, the crash would lead to repeated
retries, which generated a huge job history log.

The fix consists of three improvements:

  1) Internal replicator is updated to filter purge infos based on the db hash.

  2) Account for the fact that some users' dbs might already contain misplaced
    purge infos. Since it's a known bug, we anticipate that error and ignore
    misplaced purge info during the second shard split operation with a warning
    emitted in the logs.

  3) Make similar range errors fatal, and emit a clear error in the logs and
     job history so any future range errors are immediately obvious.

Fixes #4624
nickva added a commit that referenced this issue Nov 22, 2023
Previously, internal replicator (mem3_rep) replicated purge infos to/from all
the target shards. Instead, it should push/pull changes only to
appropriate ranges if those purge infos belong there based on database's hash
function.

Users experienced this error as a failure in database which contains purges,
which was split twice in a row. For example, if a Q=8 database is split to
Q=16, then split again from Q=16 to Q=32, the second split operation might fail
with a `split_state:initial_copy ...{{badkey,not_in_range}` error. The
misplaced purge infos would be noticed only during the second split, when the
initial copy phase would crash because some purge infos do not hash to neither
one of the two target ranges. Moreover, the crash would lead to repeated
retries, which generated a huge job history log.

The fix consists of three improvements:

  1) Internal replicator is updated to filter purge infos based on the db hash.

  2) Account for the fact that some users' dbs might already contain misplaced
    purge infos. Since it's a known bug, we anticipate that error and ignore
    misplaced purge info during the second shard split operation with a warning
    emitted in the logs.

  3) Make similar range errors fatal, and emit a clear error in the logs and
     job history so any future range errors are immediately obvious.

Fixes #4624
nickva added a commit that referenced this issue Nov 22, 2023
Previously, internal replicator (mem3_rep) replicated purge infos to/from all
the target shards. Instead, it should push/pull changes only to
appropriate ranges if those purge infos belong there based on database's hash
function.

Users experienced this error as a failure in database which contains purges,
which was split twice in a row. For example, if a Q=8 database is split to
Q=16, then split again from Q=16 to Q=32, the second split operation might fail
with a `split_state:initial_copy ...{{badkey,not_in_range}` error. The
misplaced purge infos would be noticed only during the second split, when the
initial copy phase would crash because some purge infos do not hash to neither
one of the two target ranges. Moreover, the crash would lead to repeated
retries, which generated a huge job history log.

The fix consists of three improvements:

  1) Internal replicator is updated to filter purge infos based on the db hash.

  2) Account for the fact that some users' dbs might already contain misplaced
    purge infos. Since it's a known bug, we anticipate that error and ignore
    misplaced purge info during the second shard split operation with a warning
    emitted in the logs.

  3) Make similar range errors fatal, and emit a clear error in the logs and
     job history so any future range errors are immediately obvious.

Fixes #4624
nickva added a commit that referenced this issue Nov 23, 2023
Previously, internal replicator (mem3_rep) replicated purge infos to/from all
the target shards. Instead, it should push/pull changes only to
appropriate ranges if those purge infos belong there based on database's hash
function.

Users experienced this error as a failure in database which contains purges,
which was split twice in a row. For example, if a Q=8 database is split to
Q=16, then split again from Q=16 to Q=32, the second split operation might fail
with a `split_state:initial_copy ...{{badkey,not_in_range}` error. The
misplaced purge infos would be noticed only during the second split, when the
initial copy phase would crash because some purge infos do not hash to neither
one of the two target ranges. Moreover, the crash would lead to repeated
retries, which generated a huge job history log.

The fix consists of three improvements:

  1) Internal replicator is updated to filter purge infos based on the db hash.

  2) Account for the fact that some users' dbs might already contain misplaced
    purge infos. Since it's a known bug, we anticipate that error and ignore
    misplaced purge info during the second shard split operation with a warning
    emitted in the logs.

  3) Make similar range errors fatal, and emit a clear error in the logs and
     job history so any future range errors are immediately obvious.

Fixes #4624
nickva added a commit that referenced this issue Nov 23, 2023
Previously, internal replicator (mem3_rep) replicated purge infos to/from all
the target shards. Instead, it should push/pull changes only to
appropriate ranges if those purge infos belong there based on database's hash
function.

Users experienced this error as a failure in database which contains purges,
which was split twice in a row. For example, if a Q=8 database is split to
Q=16, then split again from Q=16 to Q=32, the second split operation might fail
with a `split_state:initial_copy ...{{badkey,not_in_range}` error. The
misplaced purge infos would be noticed only during the second split, when the
initial copy phase would crash because some purge infos do not hash to neither
one of the two target ranges. Moreover, the crash would lead to repeated
retries, which generated a huge job history log.

The fix consists of three improvements:

  1) Internal replicator is updated to filter purge infos based on the db hash.

  2) Account for the fact that some users' dbs might already contain misplaced
    purge infos. Since it's a known bug, we anticipate that error and ignore
    misplaced purge info during the second shard split operation with a warning
    emitted in the logs.

  3) Make similar range errors fatal, and emit a clear error in the logs and
     job history so any future range errors are immediately obvious.

Fixes #4624
nickva added a commit that referenced this issue Nov 23, 2023
Previously, internal replicator (mem3_rep) replicated purge infos to/from all
the target shards. Instead, it should push/pull changes only to
appropriate ranges if those purge infos belong there based on database's hash
function.

Users experienced this error as a failure in database which contains purges,
which was split twice in a row. For example, if a Q=8 database is split to
Q=16, then split again from Q=16 to Q=32, the second split operation might fail
with a `split_state:initial_copy ...{{badkey,not_in_range}` error. The
misplaced purge infos would be noticed only during the second split, when the
initial copy phase would crash because some purge infos do not hash to neither
one of the two target ranges. Moreover, the crash would lead to repeated
retries, which generated a huge job history log.

The fix consists of three improvements:

  1) Internal replicator is updated to filter purge infos based on the db hash.

  2) Account for the fact that some users' dbs might already contain misplaced
    purge infos. Since it's a known bug, we anticipate that error and ignore
    misplaced purge info during the second shard split operation with a warning
    emitted in the logs.

  3) Make similar range errors fatal, and emit a clear error in the logs and
     job history so any future range errors are immediately obvious.

Fixes #4624
nickva added a commit that referenced this issue Nov 24, 2023
Previously, internal replicator (mem3_rep) replicated purge infos to/from all
the target shards. Instead, it should push/pull changes only to
appropriate ranges if those purge infos belong there based on database's hash
function.

Users experienced this error as a failure in database which contains purges,
which was split twice in a row. For example, if a Q=8 database is split to
Q=16, then split again from Q=16 to Q=32, the second split operation might fail
with a `split_state:initial_copy ...{{badkey,not_in_range}` error. The
misplaced purge infos would be noticed only during the second split, when the
initial copy phase would crash because some purge infos do not hash to neither
one of the two target ranges. Moreover, the crash would lead to repeated
retries, which generated a huge job history log.

The fix consists of three improvements:

  1) Internal replicator is updated to filter purge infos based on the db hash.

  2) Account for the fact that some users' dbs might already contain misplaced
    purge infos. Since it's a known bug, we anticipate that error and ignore
    misplaced purge info during the second shard split operation with a warning
    emitted in the logs.

  3) Make similar range errors fatal, and emit a clear error in the logs and
     job history so any future range errors are immediately obvious.

Fixes #4624
nickva added a commit that referenced this issue Nov 27, 2023
Previously, internal replicator (mem3_rep) replicated purge infos to/from all
the target shards. Instead, it should push/pull changes only to
appropriate ranges if those purge infos belong there based on database's hash
function.

Users experienced this error as a failure in database which contains purges,
which was split twice in a row. For example, if a Q=8 database is split to
Q=16, then split again from Q=16 to Q=32, the second split operation might fail
with a `split_state:initial_copy ...{{badkey,not_in_range}` error. The
misplaced purge infos would be noticed only during the second split, when the
initial copy phase would crash because some purge infos do not hash to neither
one of the two target ranges. Moreover, the crash would lead to repeated
retries, which generated a huge job history log.

The fix consists of three improvements:

  1) Internal replicator is updated to filter purge infos based on the db hash.

  2) Account for the fact that some users' dbs might already contain misplaced
    purge infos. Since it's a known bug, we anticipate that error and ignore
    misplaced purge info during the second shard split operation with a warning
    emitted in the logs.

  3) Make similar range errors fatal, and emit a clear error in the logs and
     job history so any future range errors are immediately obvious.

Fixes #4624
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants