Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_changes?since=X results are incorrectly ordered after two rounds of shard splitting #4640

Open
jcoglan opened this issue Jun 13, 2023 · 4 comments

Comments

@jcoglan
Copy link
Contributor

jcoglan commented Jun 13, 2023

Description

We have observed that if we capture a seq value from a _changes response, and then perform two rounds of shard splitting on a database, requesting _changes?since=X with X set to the captured seq value returns results out of order; events from the same shard are not presented in ascending order. This might affect the behaviour of replicators/indexers relying on seq values for checkpointing.

Steps to Reproduce

The following script reproduces the behaviour on my machine:

#!/bin/bash

CDB_AUTH='admin:admin'
CDB_HOST='127.0.0.1'
CDB_PORT=5984

cdb () {
  curl -gs -H 'Content-Type: application/json' "$(cdb-host)$1" "${@:2}"
}

cdb-host () {
  echo "http:https://${CDB_AUTH}@${CDB_HOST}:${CDB_PORT}"
}

stop-jobs () {
  cdb '/_reshard/jobs' | jq -r '.jobs[] | .id' | while read -r id ; do
    cdb "/_reshard/jobs/$id" -X DELETE
  done
}

split-shards () {
  local db="$1"

  cdb "/$db/_shards" | jq -r '.shards | keys[]' | while read -r range ; do
    cdb '/_reshard/jobs' -X POST \
        -d '{ "type": "split", "db": "'$db'", "range": "'$range'" }'
  done
}

wait-for-q () {
  local db="$1"
  local q="$2"

  while [[ $(cdb "/$db" | jq -r '.cluster.q') != $q ]] ; do
    sleep 1
  done
}

db='test-db'

cdb "/$db" -X DELETE
cdb "/$db?q=2" -X PUT

for n in {1..10} ; do
  cdb "/$db/doc-$n" -X PUT -d '{ "n": '$n' }'
done

last_seq_q2="$(cdb "/$db/_changes" | jq -r '.last_seq')"

stop-jobs
split-shards "$db"
wait-for-q "$db" 4

last_seq_q4="$(cdb "/$db/_changes" | jq -r '.last_seq')"

cdb "/$db/_changes?since=$last_seq_q2" | jq

stop-jobs
split-shards "$db"
wait-for-q "$db" 8

last_seq_q8="$(cdb "/$db/_changes" | jq -r '.last_seq')"

cdb "/$db/_changes?since=$last_seq_q4" | jq
cdb "/$db/_changes?since=$last_seq_q2" | jq

The first two _changes requests return an empty results array as expected, showing that a seq value continues to work after one round of splitting. The final one, where a seq captured while q=2 is used when q=8 returns the entire history, with events out of order.

For example, on one run of this script the _changes response contained:

    {
      "seq": "30-g1AAAAKVeJzLYWBg4MhgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUoxJTIkyf___z8rgzmRJRcowJ6alGySlmqJTQMeY5IUgGSSPcykDKYUBtbigpzMErCZpslJZmbmBqSa6QAyMx5qJgPYJAPLRAsjcwtSTUoAmVRPVdflsQBJhgYgBTR2PshcNjRzzUxTTSzMU8gydwHE3P0InxsZJ5obmpuSZdoBiGn3qe3KBxBz_5NobhYAREXOrQ",
      "id": "doc-7",
      "changes": [
        {
          "rev": "1-f996a2834eb8c362514d171f99324abd"
        }
      ]
    },
    {
      "seq": "23-g1AAAAKBeJzLYWBg4MhgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUoxJTIkyf___z8rgzmRJRcowJ6alGySlmqJTQMeY5IUgGSSPdQkBohJQGMMTA1INckBZFI8ikkGlokWRuYWpJqUADKpHmoSI9ikREtTU_OkJBJNymMBkgwNQApo2HyQaWwZTCkMrMUFOZklYHPNTFNNLMxTyDJ3AcTc_Qj_Ghknmhuam5Jl2gGIafep7coHEHP_k2huFgBjZMkL",
      "id": "doc-2",
      "changes": [
        {
          "rev": "1-43a36d04d31d38efc5b9245c53e00627"
        }
      ]
    }

The descending value of the numeric part of the seq value looked curious to me so I ran these values through a parser tool and obtained:

  original: '30-g1AAAAKVeJzLYWBg4MhgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUoxJTIkyf___z8rgzmRJRcowJ6alGySlmqJTQMeY5IUgGSSPcykDKYUBtbigpzMErCZpslJZmbmBqSa6QAyMx5qJgPYJAPLRAsjcwtSTUoAmVRPVdflsQBJhgYgBTR2PshcNjRzzUxTTSzMU8gydwHE3P0InxsZJ5obmpuSZdoBiGn3qe3KBxBz_5NobhYAREXOrQ',
  parsed: [
    [ 'couchdb@localhost-00000000-1fffffff', 4 ],
    [ 'couchdb@localhost-20000000-3fffffff', 4 ],
    [ 'couchdb@localhost-40000000-5fffffff', 0 ],
    [ 'couchdb@localhost-60000000-7fffffff', 4 ],
    [ 'couchdb@localhost-80000000-9fffffff', 6 ],
    [ 'couchdb@localhost-a0000000-bfffffff', 0 ],
    [ 'couchdb@localhost-c0000000-dfffffff', 6 ],
    [ 'couchdb@localhost-e0000000-ffffffff', 6 ]
  ]

  original: '23-g1AAAAKBeJzLYWBg4MhgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUoxJTIkyf___z8rgzmRJRcowJ6alGySlmqJTQMeY5IUgGSSPdQkBohJQGMMTA1INckBZFI8ikkGlokWRuYWpJqUADKpHmoSI9ikREtTU_OkJBJNymMBkgwNQApo2HyQaWwZTCkMrMUFOZklYHPNTFNNLMxTyDJ3AcTc_Qj_Ghknmhuam5Jl2gGIafep7coHEHP_k2huFgBjZMkL',
  parsed: [
    [ 'couchdb@localhost-00000000-1fffffff', 4 ],
    [ 'couchdb@localhost-20000000-3fffffff', 0 ],
    [ 'couchdb@localhost-40000000-5fffffff', 0 ],
    [ 'couchdb@localhost-60000000-7fffffff', 1 ],
    [ 'couchdb@localhost-80000000-9fffffff', 6 ],
    [ 'couchdb@localhost-a0000000-bfffffff', 0 ],
    [ 'couchdb@localhost-c0000000-dfffffff', 6 ],
    [ 'couchdb@localhost-e0000000-ffffffff', 6 ]
  ]

The second event has a lower sequence number for shards 20-3f and 60-7f, showing events from some shards are being returned out of order.

Using the seq values in this response also returned events out of order, for example requesting /test-db/_changes?since=30-g1AAAAKVeJzLYWBg4MhgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUoxJTIkyf___z8rgzmRJRcowJ6alGySlmqJTQMeY5IUgGSSPcykDKYUBtbigpzMErCZpslJZmbmBqSa6QAyMx5qJgPYJAPLRAsjcwtSTUoAmVRPVdflsQBJhgYgBTR2PshcNjRzzUxTTSzMU8gydwHE3P0InxsZJ5obmpuSZdoBiGn3qe3KBxBz_5NobhYAREXOrQ returns some out of order events:

    {
      "seq": "27-g1AAAAKBeJyd0UsOgjAQANAG8LP1BHoEQNrCSm6inYKppIoJuNab6E30JnqTWiiGxBiTsplJpunLfCRCaCrcDM14eeIig1SWnElRVrXUTw5DMFdKFcJl3l4XJjnwaJsnvz78YWChI6w-knAyNKqOcle3JuZACPVtzbQx152JWslPWBzS2FbaNNK5k9xWYgnGFMBSOng6ootOGrv2WyMEAp_Z9mW0m9Hu_ZThktGA4kHaw2jPRht_3YHgPIppNsh9GVdZusUb18nIzw",
      "id": "doc-8",
      "changes": [
        {
          "rev": "1-817027c9296c2d2d688c568237e89624"
        }
      ]
    },
    {
      "seq": "27-g1AAAAJ3eJyd0VEOgjAMANAFEP31BHoEQLbBl9xE1w0zyRQT8FtvojfRm-hNcAwMiSEm8NMma_rSdgohNJO2QHOen7kUkKicMyXzolS6ZDEEi6qqMmkz56AfpinwcJfGfQ1_GFjqCOuvJC2BJsVJ7UtjYg6EUG-omdTmpjWRkbyYRQGNhkrbWrq0km0kFmNMAQZKR0dHdNVJY7fuaoSA77GhczXavdEe3ZbBilGf4lHas9FeteYajfs7xkkwSns3mrmb-_OrBKdhREWfkH0AQvjGHw",
      "id": "doc-10",
      "changes": [
        {
          "rev": "1-6829757f6498d33021e0a764f0a492af"
        }
      ]
    },
    {
      "seq": "22-g1AAAAJteJyd0F0OgjAMAOAJ-PPqCfQIMGSDJ7mJrgMyyRQT8FlvojfRm-hNcAwML8RkvLRJm35pKxFCC2EnaMmLCxcJxLLgTIqirKRqWQzBqq7rXNjMOarCPAW-ydJoaOAPA2sVYfuThJWgaXmWh0qbAQdCqGtqxo2560ykJTdiIaahqbRvpGsn2VpiURBQAEPp5KiIbiop7N5_jRDwXGa6V6s9Wu3ZX4l9Rj0ajNJerfZutJnWuJcxTvAo7dNq-m8TrYU-9lI8eGn-BZDCwuE",
      "id": "doc-1",
      "changes": [
        {
          "rev": "1-731bef401491606a3b246ed178e697c1"
        }
      ]
    },

Expected Behaviour

Results in the _changes response should be ordered such that events from the same shard are presented in ascending order. Ideally, the since value would continue to limit the results returned, instead of returning the entire history.

Your Environment

  • CouchDB version used: 3.3.2
  • Browser name and version: n/a
  • Operating system and version: macOS Ventura 13.3.1

CouchDB info:

{
  "couchdb": "Welcome",
  "version": "3.3.2",
  "git_sha": "11a234070",
  "uuid": "0b5ad401f2d593405eaeb94cc1210f9b",
  "features": [
    "access-ready",
    "partitioned",
    "pluggable-storage-engines",
    "reshard",
    "scheduler"
  ],
  "vendor": {
    "name": "The Apache Software Foundation"
  }
}
@rnewson
Copy link
Member

rnewson commented Jun 13, 2023

decoded the last three;

[{couchdb@localhost,[0,536870911],                                                                                                                                                                                                         {4,<<"ebc4fe9">>,couchdb@localhost}},                                                                                                                                                               {couchdb@localhost,[536870912,1073741823],
                     {4,{split,<<"5cb6670">>},couchdb@localhost}},
  {couchdb@localhost,[1073741824,1610612735],
                     {0,<<"09a8278">>,couchdb@localhost}},
  {couchdb@localhost,[1610612736,2147483647],
                     {3,<<"a9557bb">>,couchdb@localhost}},
  {couchdb@localhost,[2147483648,2684354559],
                     {4,<<"66b10a8">>,couchdb@localhost}},
  {couchdb@localhost,[2684354560,3221225471],
                     {0,<<"23a7175">>,couchdb@localhost}},
  {couchdb@localhost,[3221225472,3758096383],
                     {6,{split,<<"65e487d">>},couchdb@localhost}},
  {couchdb@localhost,[3758096384,4294967295],
                     {6,{split,<<"65e487d">>},couchdb@localhost}}],


{couchdb@localhost,[0,536870911],
                     {4,<<"ebc4fe9">>,couchdb@localhost}},
  {couchdb@localhost,[536870912,1073741823],
                     {4,{split,<<"5cb6670">>},couchdb@localhost}},
  {couchdb@localhost,[1073741824,1610612735],
                     {0,<<"09a8278">>,couchdb@localhost}},
  {couchdb@localhost,[1610612736,2147483647],
                     {3,<<"a9557bb">>,couchdb@localhost}},
  {couchdb@localhost,[2147483648,2684354559],
                     {4,<<"66b10a8">>,couchdb@localhost}},
  {couchdb@localhost,[2684354560,3221225471],
                     {0,<<"23a7175">>,couchdb@localhost}},
  {couchdb@localhost,[3221225472,3758096383],
                     {6,<<"c1fac62">>,couchdb@localhost}},
  {couchdb@localhost,[3758096384,4294967295],
                     {6,{split,<<"65e487d">>},couchdb@localhost}}],


{couchdb@localhost,[0,536870911],
                     {4,<<"ebc4fe9">>,couchdb@localhost}},
  {couchdb@localhost,[536870912,1073741823],
                     {4,{split,<<"5cb6670">>},couchdb@localhost}},
  {couchdb@localhost,[1073741824,1610612735],
                     {0,<<"09a8278">>,couchdb@localhost}},
  {couchdb@localhost,[1610612736,2147483647],
                     {3,<<"a9557bb">>,couchdb@localhost}},
  {couchdb@localhost,[2147483648,2684354559],
                     {4,<<"66b10a8">>,couchdb@localhost}},
  {couchdb@localhost,[2684354560,3221225471],
                     {0,<<"23a7175">>,couchdb@localhost}},
  {couchdb@localhost,[3221225472,3758096383],
                     {6,<<"c1fac62">>,couchdb@localhost}},
  {couchdb@localhost,[3758096384,4294967295],
                     {1,<<"8321e28">>,couchdb@localhost}}]

@nickva
Copy link
Contributor

nickva commented Jun 15, 2023

Thanks for creating the issue @jcoglan !

to the captured seq value returns results out of order

How did we determine they are out of order? Is that based on the N-... first part of the sequence number?

The descending value of the numeric part of the seq value looked curious to me so I ran these values through a parser tool and obtained:

original: '30-g1AAAAKVeJzLYWBg4MhgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUoxJTIkyf___z8rgzmRJRcowJ6alGySlmqJTQMeY5IUgGSSPcykDKYUBtbigpzMErCZpslJZmbmBqSa6QAyMx5qJgPYJAPLRAsjcwtSTUoAmVRPVdflsQBJhgYgBTR2PshcNjRzzUxTTSzMU8gydwHE3P0InxsZJ5obmpuSZdoBiGn3qe3KBxBz_5NobhYAREXOrQ',
  parsed: [
    [ 'couchdb@localhost-00000000-1fffffff', 4 ],
    [ 'couchdb@localhost-20000000-3fffffff', 4 ],
    [ 'couchdb@localhost-40000000-5fffffff', 0 ],
    [ 'couchdb@localhost-60000000-7fffffff', 4 ],
    [ 'couchdb@localhost-80000000-9fffffff', 6 ],
    [ 'couchdb@localhost-a0000000-bfffffff', 0 ],
    [ 'couchdb@localhost-c0000000-dfffffff', 6 ],
    [ 'couchdb@localhost-e0000000-ffffffff', 6 ]
  ]
  original: '23-g1AAAAKBeJzLYWBg4MhgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUoxJTIkyf___z8rgzmRJRcowJ6alGySlmqJTQMeY5IUgGSSPdQkBohJQGMMTA1INckBZFI8ikkGlokWRuYWpJqUADKpHmoSI9ikREtTU_OkJBJNymMBkgwNQApo2HyQaWwZTCkMrMUFOZklYHPNTFNNLMxTyDJ3AcTc_Qj_Ghknmhuam5Jl2gGIafep7coHEHP_k2huFgBjZMkL',
  parsed: [
    [ 'couchdb@localhost-00000000-1fffffff', 4 ],
    [ 'couchdb@localhost-20000000-3fffffff', 0 ],
    [ 'couchdb@localhost-40000000-5fffffff', 0 ],
    [ 'couchdb@localhost-60000000-7fffffff', 1 ],
    [ 'couchdb@localhost-80000000-9fffffff', 6 ],
    [ 'couchdb@localhost-a0000000-bfffffff', 0 ],
    [ 'couchdb@localhost-c0000000-dfffffff', 6 ],
    [ 'couchdb@localhost-e0000000-ffffffff', 6 ]
  ]

I think the parser tool is ignoring an important split marker and just shows the sequence number. Let's see what it's hiding:

fabric_view_changes:decode_seq(<<"30-g1AAAAKVeJzLYWBg4MhgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUoxJTIkyf___z8rgzmRJRcowJ6alGySlmqJTQMeY5IUgGSSPcykDKYUBtbigpzMErCZpslJZmbmBqSa6QAyMx5qJgPYJAPLRAsjcwtSTUoAmVRPVdflsQBJhgYgBTR2PshcNjRzzUxTTSzMU8gydwHE3P0InxsZJ5obmpuSZdoBiGn3qe3KBxBz_5NobhYAREXOrQ">>).
[{couchdb@localhost,[0,536870911],
                    {4,<<"ebc4fe9">>,couchdb@localhost}},
 {couchdb@localhost,[536870912,1073741823],
                    {4,{split,<<"5cb6670">>},couchdb@localhost}},
 {couchdb@localhost,[1073741824,1610612735],
                    {0,<<"09a8278">>,couchdb@localhost}},
 {couchdb@localhost,[1610612736,2147483647],
                    {4,{split,<<"5cb6670">>},couchdb@localhost}},
 {couchdb@localhost,[2147483648,2684354559],
                    {6,{split,<<"65e487d">>},couchdb@localhost}},
 {couchdb@localhost,[2684354560,3221225471],
                    {0,<<"23a7175">>,couchdb@localhost}},
 {couchdb@localhost,[3221225472,3758096383],
                    {6,{split,<<"65e487d">>},couchdb@localhost}},
 {couchdb@localhost,[3758096384,4294967295],
                    {6,{split,<<"65e487d">>},couchdb@localhost}}]

We notice how {4,{split,<<"5cb6670">>},couchdb@localhost}} looks a bit different than {0,<<"09a8278">>,couchdb@localhost}}. Specifically the {split,<<"5cb6670">>} part. That part shows a split "marker", that it looks that particular range was likely the result of a split from the original sequence with (Q=2) with original since seq node uuid for the now missing (split) range. The 4 part shows the original sequence from the Q=2 sequence.

In general looking at the N-.... is not too helpful. To make the sequences look "nicer" we could avoid adding the sequence to the total sum when computing N if the sequence is a split marker. That would make the N- value increment "nicer". But that value is effectively meaningless and it is thrown away when we parse the sequence.

It would still be a nice minor ergonomic fix but so far nothing here violates correctness, we're not throwing changes away or returning them out of order.

@nickva
Copy link
Contributor

nickva commented Jun 15, 2023

The first two _changes requests return an empty results array as expected, showing that a seq value continues to work after one round of splitting.

Ah, that's good to know that it works after one round of splitting.

The final one, where a seq captured while q=2 is used when q=8 returns the entire history, with events out of order.

So rewinding after two rounds of splitting is a more interesting issue. We may want make two separate issues: 1) make the N-... look a prettier after a split 2) "try to avoid rewinds after two rounds of splitting". The second one seems more pressing and harder solve.

nickva added a commit that referenced this issue Jun 16, 2023
 When we get sequence before the split, we'll fill in the missing (now split)
ranges with a special `{split, OldNodeUUid}` marker. However, when sequences
are emitted in the changes API, that will make the N- prefix (SeqSum) bounce
around from higher to lower numbers, while users expect those to be mostly
incrementing. So take a conservative approach and assume it will be full
rewind for that ramge, and use 0 for that range. This is a purely cosmetic
thing, when we decode the sequence that prefix gets thrown away anyway.

Fixes a part of issue: #4640
@nickva
Copy link
Contributor

nickva commented Jun 16, 2023

This should fix the apparent incorrect order #4643

nickva added a commit that referenced this issue Jun 16, 2023
When we get sequence before the split, we'll fill in the missing (now split)
ranges with a special `{split, OldNodeUUid}` marker. However, when sequences
are emitted in the changes API, that will make the N- prefix (SeqSum) bounce
around from higher to lower numbers, while users expect those to be mostly
incrementing. So take a conservative approach and assume it will be a full
rewind for that range, and use 0 instead. This is a purely cosmetic thing, when
we decode the sequence that prefix gets thrown away anyway.

To emphasize that the sequence prefix is mostly a visual aid, rename the seq/1
function to fake_packed_seq/1 with a comment about what it does and a note
about the special split case.

Fixes a part of issue: #4640
nickva added a commit that referenced this issue Jun 16, 2023
When we get sequence before the split, we'll fill in the missing (now split)
ranges with a special `{split, OldNodeUUid}` marker. However, when sequences
are emitted in the changes API, that will make the N- prefix (SeqSum) bounce
around from higher to lower numbers, while users expect those to be mostly
incrementing. So take a conservative approach and assume it will be a full
rewind for that range, and use 0 instead. This is a purely cosmetic thing, when
we decode the sequence that prefix gets thrown away anyway.

To emphasize that the sequence prefix is mostly a visual aid, rename the seq/1
function to fake_packed_seq/1 with a comment about what it does and a note
about the special split case.

Fixes a part of issue: #4640
nickva added a commit that referenced this issue Jun 16, 2023
When we get sequence before the split, we'll fill in the missing (now split)
ranges with a special `{split, OldNodeUUid}` marker. However, when sequences
are emitted in the changes API, that will make the N- prefix (SeqSum) bounce
around from higher to lower numbers, while users expect those to be mostly
incrementing. So take a conservative approach and assume it will be a full
rewind for that range, and use 0 instead. This is a purely cosmetic thing, when
we decode the sequence that prefix gets thrown away anyway.

To emphasize that the sequence prefix is mostly a visual aid, rename the seq/1
function to fake_packed_seq/1 with a comment about what it does and a note
about the special split case.

Fixes a part of issue: #4640
nickva added a commit that referenced this issue Jun 16, 2023
When we get sequence before the split, we'll fill in the missing (now split)
ranges with a special `{split, OldNodeUUid}` marker. However, when sequences
are emitted in the changes API, that will make the N- prefix (SeqSum) bounce
around from higher to lower numbers, while users expect those to be mostly
incrementing. So take a conservative approach and assume it will be a full
rewind for that range, and use 0 instead. This is a purely cosmetic thing, when
we decode the sequence that prefix gets thrown away anyway.

To emphasize that the sequence prefix is mostly a visual aid, rename the seq/1
function to fake_packed_seq/1 with a comment about what it does and a note
about the special split case.

Fixes a part of issue: #4640
nickva added a commit that referenced this issue Nov 22, 2023
When we get sequence before the split, we'll fill in the missing (now split)
ranges with a special `{split, OldNodeUUid}` marker. However, when sequences
are emitted in the changes API, that will make the N- prefix (SeqSum) bounce
around from higher to lower numbers, while users expect those to be mostly
incrementing. So take a conservative approach and assume it will be a full
rewind for that range, and use 0 instead. This is a purely cosmetic thing, when
we decode the sequence that prefix gets thrown away anyway.

To emphasize that the sequence prefix is mostly a visual aid, rename the seq/1
function to fake_packed_seq/1 with a comment about what it does and a note
about the special split case.

Fixes a part of issue: #4640
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants