"not_in_range" failure while resharding database #4624

jcoglan · 2023-05-25T09:23:47Z

While attempting to shard-split a q=16 database on a 3-node cluster, we found that all reshard jobs failed, and GET /_reshard/jobs stopped responding to requests. The logs reveal a not_in_range failure in mem3_reshard_job.

Description

We are attempting to reshard a database from q=8 to q=32, using the following script: https://gist.github.com/jcoglan/ad2b631664bc436c48e4274718a0acd6. This worked to get from q=8 to q=16, but failed the second step to get to q=32.

/_reshard shows that all jobs failed:

{
  "state": "running",
  "state_reason": null,
  "completed": 0,
  "failed": 48,
  "running": 0,
  "stopped": 0,
  "total": 48
}

Also, /_reshard/jobs does not respond at all, the request hangs with no activity visible in the logs.

We observed many messages like the following while the jobs were running:

[error] 2023-05-24T20:42:03.508280Z [email protected] <0.22158.5016>
-------- mem3_reshard_job worker error "#job{001-2ce218fc7c061f93575765
4236cbb4b25aceab927d7c433d377c72f0dee42bca shards/c0000000-cfffffff/some-db.15
66089635 /2 job_state:running split_state:initial_copy pid:<0.22158.5016>}"
{{badkey,not_i n_range},[{erlang,map_get,[not_in_range,#{[3221225472,3355443199]
=> {target,{db,1,<<"shar
ds/c0000000-c7ffffff/some-db.1566089635">>,"/mnt/couchdb/data/shards/c0000000-
c7ffffff/some-db.1566089635.couch",{couch_bt_engine,{st,"/mnt/couchdb/data/sha
rds/c0000000-c7ffffff/some-db.1566089635.couch",<0.11973.5023>,#Ref<0.18983093
87.670826504.2396>,undefined,{db_header,8,2343426,0,{27701359082,{320334,10,{size_info,276
36410613,166933365080}},34564353},{27708861596,320344,31829599},{27708882839,[],20290},nil
,nil,27708887320,1000,<<"47f4490fe284680faaa83e44186c9037">>,[{'node.example.com',0}],0,1000,undefined},true,{btree,<0.11973.5023>,{2770135
9082,{320334,10,{size_info,27636410613,166933365080}},34564353},fun
couch_bt_engine:id_tre e_split/1,fun
couch_bt_engine:id_tree_join/2,undefined,fun couch_bt_engine:id_tree_reduce/
2,snappy},{btree,<0.11973.5023>,{27708861596,320344,31829599},fun
couch_bt_engine:seq_tree _split/1,fun
couch_bt_engine:seq_tree_join/2,undefined,fun couch_bt_engine:seq_tree_reduce
/2,snappy},{btree,<0.11973.5023>,{27708882839,[],20290},fun
couch_bt_engine:local_tree_spl it/1,fun
couch_bt_engine:local_tree_join/2,undefined,nil,snappy},snappy,{btree,<0.11973.50
23>,nil,fun couch_bt_engine:purge_tree_split/1,fun
couch_bt_engine:purge_tree_join/2,undef ined,fun
couch_bt_engine:purge_tree_reduce/2,snappy},{btree,<0.11973.5023>,nil,fun
couch_b t_engine:purge_seq_tree_split/1,fun
couch_bt_engine:purge_seq_tree_join/2,undefined,fun co
uch_bt_engine:purge_tree_reduce/2,snappy}}},<0.13348.5022>,nil,0,<<"1684957391682422">>,{u
ser_ctx,null,[],undefined},[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>
,{[{<<"roles">>,[<<"_admin">>]}]}}],undefined,nil,nil,undefined,[{default_security_object,
[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>,{[{<<"roles">>,[<<"_admin"
>>]}]}}]},{user_ctx,{user_ctx,null,[<<"_admin">>],undefined}},{props,[]}],undefined},undef
ined,[{43468,<<"48794d4718e1452be1b90573fdbf8bd8">>,<<"-L-3nWkJHMoh8BCsOK31">>,[{2,<<58,60
,181,83,213,161,108,4,33,6,123,250,199,83,148,163>>}]},{43467,<<"48794d4718e1452be1b90573f
dbf853e">>,
... followed by a lot of binary data

[error] 2023-05-24T20:42:03.483993Z [email protected] emulator
-------- Error in process <0.23943.5023> on node '[email protected]' with
exit value:
{{badkey,not_in_range},[{erlang,map_get,[not_in_range,#{[3221225472,3355443199]
=> {target
,{db,1,<<"shards/c0000000-c7ffffff/some-db.1566089635">>,"/mnt/couchdb/data/sh
ards/c0000000-c7ffffff/some-db.1566089635.couch",{couch_bt_engine,{st,"/mnt/co
uchdb/data/shards/c0000000-c7ffffff/some-db.1566089635.couch",<0.11973.5023>,#
Ref<0.1898309387.670826504.2396>,undefined,{db_header,8,2343426,0,{27701359082,{320334,10,
{size_info,27636410613,166933365080}},34564353},{27708861596,320344,31829599},{27708882839
,[],20290},nil,nil,27708887320,1000,<<"47f4490fe284680faaa83e44186c9037">>,[{'[email protected]',0}],0,1000,undefined},true,{btree,<0.11973.
5023>,{27701359082,{320334,10,{size_info,27636410613,166933365080}},34564353},fun
couch_bt _engine:id_tree_split/1,fun
couch_bt_engine:id_tree_join/2,undefined,fun couch_bt_engine:i
d_tree_reduce/2,snappy},{btree,<0.11973.5023>,{27708861596,320344,31829599},fun
couch_bt_e ngine:seq_tree_split/1,fun
couch_bt_engine:seq_tree_join/2,undefined,fun couch_bt_engine:s
eq_tree_reduce/2,snappy},{btree,<0.11973.5023>,{27708882839,[],20290},fun
couch_bt_engine: local_tree_split/1,fun
couch_bt_engine:local_tree_join/2,undefined,nil,snappy},snappy,{btr
ee,<0.11973.5023>,nil,fun couch_bt_engine:purge_tree_split/1,fun
couch_bt_engine:purge_tre e_join/2,undefined,fun
couch_bt_engine:purge_tree_reduce/2,snappy},{btree,<0.11973.5023>,n il,fun
couch_bt_engine:purge_seq_tree_split/1,fun
couch_bt_engine:purge_seq_tree_join/2,un defined,fun
couch_bt_engine:purge_tree_reduce/2,snappy}}},<0.13348.5022>,nil,0,<<"16849573
91682422">>,{user_ctx,null,[],undefined},[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}}
,{<<"admins">>,{[{<<"roles">>,[<<"_admin">>]}]}}],undefined,nil,nil,undefined,[{default_se
curity_object,[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>,{[{<<"roles"
>>,[<<"_admin">>]}]}}]},{user_ctx,{user_ctx,null,[<<"_admin">>],undefined}},{props,[]}],un
defined},undefined,[{43468,<<"48794d4718e1452be1b90573fdbf8bd8">>
... followed by a lot of binary data

The database's current shards are as follows:

{
  "shards": {
    "00000000-0fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "10000000-1fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "20000000-2fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "30000000-3fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "40000000-4fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "50000000-5fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "60000000-6fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "70000000-7fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "80000000-8fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "90000000-9fffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "a0000000-afffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "b0000000-bfffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "c0000000-cfffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "d0000000-dfffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "e0000000-efffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ],
    "f0000000-ffffffff": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ]
  }
}

And the database info looks like this:

{
  "instance_start_time": "1566089635",
  "db_name": "some-db",
  "purge_seq": "1032970-<redacted>",
  "update_seq": "37489526-<redacted>",
  "sizes": {
    "file": 890080785312,
    "external": 5343640921800,
    "active": 887177503725
  },
  "props": {},
  "doc_del_count": 405,
  "doc_count": 10255536,
  "disk_format_version": 8,
  "compact_running": false,
  "cluster": {
    "q": 16,
    "n": 3,
    "w": 2,
    "r": 2
  }
}

Your Environment

CouchDB version used: 3.3.1
Operating system and version: Debian 11.7

The text was updated successfully, but these errors were encountered:

nickva · 2023-05-25T15:43:21Z

Thanks for the detailed report @jcoglan. Would you be able to share few more logs starting a bit before the first error include a bit more after the ones showed if they have any stack function names and line numbers.

Do you see anywhere in the logs the string not in any target ranges or not_in_target_ranges. That would indicate that the initial split didn't go well and during the second one we're just seeing the effects of that.

In remsh what does the config:get("mem3"). return?

If possible share the full _dbs document curl $URL/_node/_local/_dbs/some_db it should have shard ranges and also the change records from the first split in it.

nickva · 2023-05-25T20:55:11Z

I wonder if this happened: after the first split (to Q=16), we ran the internal replicator to top off changes from source to the targets. When doing so we also pushed purges to the new targets, however, in the internal replicator we didn't pick them by the hash function of target range, but copied them as is. So splitting S->T1|T2 both T1 and T2 might have gotten purge_info with DocID="a", even though the hashing would have put it only on T1.

couchdb/src/mem3/src/mem3_rep.erl

Lines 317 to 344 in c75c31d

 pull_purges_multi(#acc{source = Source} = Acc0) -> 

 #acc{batch_size = Count, seq = UpdateSeq, targets = Targets0} = Acc0, 

 with_src_db(Acc0, fun(Db) -> 

 Targets = maps:map( 

 fun(_, #tgt{} = T) -> 

 pull_purges(Db, Count, Source, T) 

 end, 

 reset_remaining(Targets0) 

 ), 

 Remaining = maps:fold( 

 fun(_, #tgt{remaining = R}, Sum) -> 

 Sum + R 

 end, 

 0, 

 Targets 

 ), 

 if 

 Remaining == 0 -> 

 Acc0#acc{targets = Targets}; 

 true -> 

 PurgeSeq = couch_db:get_purge_seq(Db), 

 OldestPurgeSeq = couch_db:get_oldest_purge_seq(Db), 

 PurgesToPush = PurgeSeq - OldestPurgeSeq, 

 Changes = couch_db:count_changes_since(Db, UpdateSeq), 

 Pending = Remaining + PurgesToPush + Changes, 

 throw({finished, Pending}) 

 end 

 end).

On next split 16 -> 32 we start by doing the initial copy in

couchdb/src/couch/src/couch_db_split.erl

Lines 340 to 344 in c75c31d

 purge_cb({_PSeq, _UUID, Id, _Revs} = PI, #state{targets = Targets} = State) -> 

 {Key, Target} = pick_target(Id, State, Targets), 

 MaxBuffer = State#state.max_buffer_size, 

 Target1 = acc_and_flush(PI, Target, MaxBuffer, fun commit_purge_infos/1), 

 {ok, State#state{targets = Targets#{Key => Target1}}}.

and when we get to purge infos we do the right thing and pick a target according to the hash function, however we now find out that some purge infos do not belong to any of the targets. So, if T2 -> T21|T22 is the split configuration and purge info for DocID="a" doesn't belong to T2 to start with we'd get that exception we got above in the issue report.

To confirm this theory, the first step would be to double-check that the actual document IDs are on the shards they are supposed to be in. That is, we'd assert that the Q=16 document placement looks sane. Then, do the same for each purge info and here we should find that some purge infos do no belong on those shards.

If this is the cause of the issue we saw above, then we could delete the extra purge info copies on the target shards then set [mem3] replicate_purges = false to prevent internal replicator to replicate purges. We should then make sure the purges have been processed before shard splitting starts.

Previously, internal replicator (mem3_rep) replicated purge infos to/from all the target shards. Instead, it should push/pull changes only to appropriate ranges if those purge infos belong there based on database's hash function. Users experienced this error as a failure in database which contains purges, which was split twice in a row. For example, if a Q=8 database is split to Q=16, then split again from Q=16 to Q=32, the second split operation might fail with a `split_state:initial_copy ...{{badkey,not_in_range}` error. The misplaced purge infos would be noticed only during the second split, when the initial copy phase would crash because some purge infos do not hash to neither one of the two target ranges. Moreover, the crash would lead to repeated retries, which generated a huge job history log. The fix consists of three improvements: 1) Internal replicator is updated to filter purge infos based on the db hash. 2) Account for the fact that some users' dbs might already contain misplaced purge infos. Since it's a known bug, we anticipate that error and ignore misplaced purge info during the second shard split operation with a warning emitted in the logs. 3) Make similar range errors fatal, and emit a clear error in the logs and job history so any future range errors are immediately obvious. Fixes #4624

nickva · 2023-05-27T08:07:16Z

This PR should fix the issue #4626

Previously, internal replicator (mem3_rep) replicated purge infos to/from all the target shards. Instead, it should push/pull changes only to appropriate ranges if those purge infos belong there based on database's hash function. Users experienced this error as a failure in database which contains purges, which was split twice in a row. For example, if a Q=8 database is split to Q=16, then split again from Q=16 to Q=32, the second split operation might fail with a `split_state:initial_copy ...{{badkey,not_in_range}` error. The misplaced purge infos would be noticed only during the second split, when the initial copy phase would crash because some purge infos do not hash to neither one of the two target ranges. Moreover, the crash would lead to repeated retries, which generated a huge job history log. The fix consists of three improvements: 1) Internal replicator is updated to filter purge infos based on the db hash. 2) Account for the fact that some users' dbs might already contain misplaced purge infos. Since it's a known bug, we anticipate that error and ignore misplaced purge info during the second shard split operation with a warning emitted in the logs. 3) Make similar range errors fatal, and emit a clear error in the logs and job history so any future range errors are immediately obvious. Fixes apache#4624

Previously, internal replicator (mem3_rep) replicated purge infos to/from all the target shards. Instead, it should push/pull changes only to appropriate ranges if those purge infos belong there based on database's hash function. Users experienced this error as a failure in database which contains purges, which was split twice in a row. For example, if a Q=8 database is split to Q=16, then split again from Q=16 to Q=32, the second split operation might fail with a `split_state:initial_copy ...{{badkey,not_in_range}` error. The misplaced purge infos would be noticed only during the second split, when the initial copy phase would crash because some purge infos do not hash to neither one of the two target ranges. Moreover, the crash would lead to repeated retries, which generated a huge job history log. The fix consists of three improvements: 1) Internal replicator is updated to filter purge infos based on the db hash. 2) Account for the fact that some users' dbs might already contain misplaced purge infos. Since it's a known bug, we anticipate that error and ignore misplaced purge info during the second shard split operation with a warning emitted in the logs. 3) Make similar range errors fatal, and emit a clear error in the logs and job history so any future range errors are immediately obvious. Fixes #4624

jcoglan added bug needs-triage labels May 25, 2023

nickva mentioned this issue May 27, 2023

Fix purge infos replicating to the wrong shards during shard splitting. #4626

Merged

nickva closed this as completed in #4626 May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"not_in_range" failure while resharding database #4624

"not_in_range" failure while resharding database #4624

jcoglan commented May 25, 2023

nickva commented May 25, 2023

nickva commented May 25, 2023 •

edited

Loading

nickva commented May 27, 2023

"not_in_range" failure while resharding database #4624

"not_in_range" failure while resharding database #4624

Comments

jcoglan commented May 25, 2023

Description

Your Environment

nickva commented May 25, 2023

nickva commented May 25, 2023 • edited Loading

nickva commented May 27, 2023

nickva commented May 25, 2023 •

edited

Loading