Custom replication DBs never get /_scheduler/docs entries #506

wohali · 2017-05-03T22:00:35Z

Steps to recreate

Create database a and put a few documents in it. Ensure a database b does not exist.
Create a _replicator document of the form:

{ "_id": "foo_error_rep", "source": "https://127.0.0.1:15984/a", "target": "https://127.0.0.1:15984/b" }

Wait a bit and check _replicator/foo_error_rep. No state has been added. I would have expected one of crashing, running or pending.

Logfile excerpt

Note the throw of db_not_found.

[notice] 2017-05-03T21:49:42.495501Z [email protected] <0.309.0> 86f37313e7 127.0.0.1:15984 127.0.0.1 undefined GET /test_suite_db_ikklhwd%2F_replicator/foo_error_rep 200 ok 50
[notice] 2017-05-03T21:49:42.497481Z [email protected] <0.354.0> -------- starting new replication `b1137c827da5adb4376166374dcf79eb` at <0.1144.0> (`https://127.0.0.1:15984/test_suite_db_mxbhpygg/` -> `https://127.0.0.1:15984/nonexistent_test_db/`)
[notice] 2017-05-03T21:49:42.497981Z [email protected] <0.355.0> -------- couch_replicator_scheduler: Job {"b1137c827da5adb4376166374dcf79eb",[]} started as <0.1144.0>
[notice] 2017-05-03T21:49:42.547996Z [email protected] <0.309.0> 5ae0234b3c 127.0.0.1:15984 127.0.0.1 undefined GET /test_suite_db_ikklhwd%2F_replicator/foo_error_rep 200 ok 2
[notice] 2017-05-03T21:49:42.656934Z [email protected] <0.961.0> 7dd6682461 127.0.0.1:15984 127.0.0.1 undefined GET /test_suite_db_mxbhpygg/ 200 ok 147
[notice] 2017-05-03T21:49:42.657324Z [email protected] <0.309.0> 94c75a3e9a 127.0.0.1:15984 127.0.0.1 undefined GET /test_suite_db_ikklhwd%2F_replicator/foo_error_rep 200 ok 58
[notice] 2017-05-03T21:49:42.659778Z [email protected] <0.961.0> 286570172d 127.0.0.1:15984 127.0.0.1 undefined GET /nonexistent_test_db/ 404 ok 1
[error] 2017-05-03T21:49:42.660836Z [email protected] <0.1144.0> -------- throw:{db_not_found,<<"could not open https://127.0.0.1:15984/nonexistent_test_db/">>}: Replication failed to start for args {rep,{"b1137c827da5adb4376166374dcf79eb",[]},{httpdb,"https://127.0.0.1:15984/test_suite_db_mxbhpygg/",nil,[{"Accept","application/json"},{"User-Agent","CouchDB-Replicator/2.1.0-5cad2a4"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,nil,20,nil,undefined},{httpdb,"https://127.0.0.1:15984/nonexistent_test_db/",nil,[{"Accept","application/json"},{"User-Agent","CouchDB-Replicator/2.1.0-5cad2a4"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,nil,20,nil,undefined},[{checkpoint_interval,30000},{connection_timeout,30000},{http_connections,20},{retries,10},{socket_options,[{keepalive,true},{nodelay,false}]},{use_checkpoints,true},{worker_batch_size,500},{worker_processes,4}],{user_ctx,null,[],undefined},db,nil,<<"foo_error_rep">>,<<"shards/00000000-1fffffff/test_suite_db_ikklhwd/_replicator.1493848173">>,{1493,848182,496186}}: [{couch_replicator_api_wrap,db_open,3,[{file,"src/couch_replicator_api_wrap.erl"},{line,109}]},{couch_replicator_scheduler_job,init_state,1,[{file,"src/couch_replicator_scheduler_job.erl"},{line,568}]},{couch_replicator_scheduler_job,do_init,1,[{file,"src/couch_replicator_scheduler_job.erl"},{line,127}]},{couch_replicator_scheduler_job,handle_info,2,[{file,"src/couch_replicator_scheduler_job.erl"},{line,357}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,599}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,237}]}]
[notice] 2017-05-03T21:49:42.709448Z [email protected] <0.309.0> 5a9498eacf 127.0.0.1:15984 127.0.0.1 undefined GET /test_suite_db_ikklhwd%2F_replicator/foo_error_rep 200 ok 2

/cc @nickva I think this is a new bug related to the scheduling replicator work.

The text was updated successfully, but these errors were encountered:

nickva · 2017-05-03T22:15:09Z

@wohali

Thanks for taking a look!

By default scheduling replicator does not update replication documents with transient states such as "triggered", "crashing", "error". This is one of the improvements and it is what allows it to run more replication jobs (since they don't write to disk as much).

To observe the state of the replications created in by that document, try to do one of these things (or both):

Query https://127.0.0.1:15984/_scheduler/docs endpoint. You'd get all the replication documents there. Or query the specific document. https://127.0.0.1:15984/_scheduler/docs/foo_error_rep
Enable compatibility mode [replicator] update_docs = True then replicator will behave very similarly to the old one and you should see error or triggered there. And you can still query document as before.

wohali · 2017-05-03T23:25:09Z

OK, I've rebuilt from master, and I've shifted over to using /_scheduler/docs/foo_error_rep which always shows up as a 404. I'm still getting a failure:

[notice] 2017-05-03T22:57:34.454264Z [email protected] <0.309.0> cc86958549 127.0.0.1:15984 127.0.0.1 undefined PUT /test_suite_db_hhyyrpqf%2F_replicator/foo_error_rep 201 ok 99
[notice] 2017-05-03T22:57:34.486167Z [email protected] <0.309.0> 79abcb1691 127.0.0.1:15984 127.0.0.1 undefined GET /_scheduler/docs/foo_error_rep 404 ok 30
...
[notice] 2017-05-03T22:57:38.881115Z [email protected] <0.343.0> -------- couch_replicator_clustering : publish cluster `stable` event
[notice] 2017-05-03T22:57:38.881746Z [email protected] <0.398.0> -------- Started replicator db changes listener <0.769.0>
[notice] 2017-05-03T22:57:38.916464Z [email protected] <0.348.0> -------- starting new replication `6e12fe920e76120ef871df8e18d208b0` at <0.811.0> (`https://127.0.0.1:15984/test_suite_db_dhrvtir/` -> `https://127.0.0.1:15984/nonexistent_test_db/`)
[notice] 2017-05-03T22:57:38.916701Z [email protected] <0.349.0> -------- couch_replicator_scheduler: Job {"6e12fe920e76120ef871df8e18d208b0",[]} started as <0.811.0>
[notice] 2017-05-03T22:57:38.918651Z [email protected] <0.309.0> 9e5593616e 127.0.0.1:15984 127.0.0.1 undefined GET /_scheduler/docs/foo_error_rep 404 ok 1
[notice] 2017-05-03T22:57:38.970437Z [email protected] <0.309.0> bdd084312b 127.0.0.1:15984 127.0.0.1 undefined GET /_scheduler/docs/foo_error_rep 404 ok 1
[notice] 2017-05-03T22:57:39.013987Z [email protected] <0.712.0> 0bab9a6ac6 127.0.0.1:15984 127.0.0.1 undefined GET /test_suite_db_dhrvtir/ 200 ok 81
[notice] 2017-05-03T22:57:39.016448Z [email protected] <0.712.0> 116c67f1f5 127.0.0.1:15984 127.0.0.1 undefined GET /nonexistent_test_db/ 404 ok 1
[error] 2017-05-03T22:57:39.018030Z [email protected] <0.811.0> -------- throw:{db_not_found,<<"could not open https://127.0.0.1:15984/nonexistent_test_db/">>}: Replication failed to start for args {rep,{"6e12fe920e76120ef871df8e18d208b0",[]},{httpdb,"https://127.0.0.1:15984/test_suite_db_dhrvtir/",nil,[{"Accept","application/json"},{"User-Agent","CouchDB-Replicator/2.1.0-5cad2a4"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,nil,20,nil,undefined},{httpdb,"https://127.0.0.1:15984/nonexistent_test_db/",nil,[{"Accept","application/json"},{"User-Agent","CouchDB-Replicator/2.1.0-5cad2a4"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,nil,20,nil,undefined},[{checkpoint_interval,30000},{connection_timeout,30000},{http_connections,20},{retries,10},{socket_options,[{keepalive,true},{nodelay,false}]},{use_checkpoints,true},{worker_batch_size,500},{worker_processes,4}],{user_ctx,null,[],undefined},db,nil,<<"foo_error_rep">>,<<"shards/00000000-1fffffff/test_suite_db_hhyyrpqf/_replicator.1493852250">>,{1493,852258,915293}}: [{couch_replicator_api_wrap,db_open,3,[{file,"src/couch_replicator_api_wrap.erl"},{line,109}]},{couch_replicator_scheduler_job,init_state,1,[{file,"src/couch_replicator_scheduler_job.erl"},{line,568}]},{couch_replicator_scheduler_job,do_init,1,[{file,"src/couch_replicator_scheduler_job.erl"},{line,127}]},{couch_replicator_scheduler_job,handle_info,2,[{file,"src/couch_replicator_scheduler_job.erl"},{line,357}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,599}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,237}]}]
[notice] 2017-05-03T22:57:39.022584Z [email protected] <0.309.0> 46149cd223 127.0.0.1:15984 127.0.0.1 undefined GET /_scheduler/docs/foo_error_rep 404 ok 1

wohali · 2017-05-04T00:04:02Z

After investigation and chatting with you on IRC, I think this is because the standard JS harness is creating a custom replicator DB at test_suite_db_<letters>%2F_replicator and the scheduling replicator doesn't know what to do about that when a job fails.

After PUTting a doc into that database, waiting the requisite time, and getting a failure, I did a GET /_scheduler/docs and get back: {"total_rows":0,"offset":0,"docs":[]} so I think my hunch is correct.

It also occurs to me that you're going to have to de-duplicate doc IDs from different _replicator databases so perhaps the right thing to do is to prepend the replicator DB's name to the doc ID if it's not the default _replicator. In this case that means I'd have to GET /_scheduler/docs/test_suite_db_quzroaxh%2F_replicator/foo_error_rep which is acceptable, if ugly.

Edited log:

[notice] 2017-05-03T23:58:22.747195Z [email protected] <0.309.0> 9bde503c0a 127.0.0.1:15984 127.0.0.1 undefined PUT /test_suite_db_quzroaxh%2F_replicator/foo_error_rep 201 ok 136
[notice] 2017-05-03T23:58:27.179010Z [email protected] <0.348.0> -------- starting new replication `8742b3ea46dfe82f96a36485f46c6738` at <0.639.0> (`https://127.0.0.1:15984/test_suite_db_euuinovc/` -> `https://127.0.0.1:15984/nonexistent_test_db/`)
[notice] 2017-05-03T23:58:27.179239Z [email protected] <0.349.0> -------- couch_replicator_scheduler: Job {"8742b3ea46dfe82f96a36485f46c6738",[]} started as <0.639.0>
[error] 2017-05-03T23:58:27.303437Z [email protected] <0.639.0> -------- throw:{db_not_found,<<"could not open https://127.0.0.1:15984/nonexistent_test_db/">>}: Replication failed to start for args {rep,{"8742b3ea46dfe82f96a36485f46c6738",[]},{httpdb,"https://127.0.0.1:15984/test_suite_db_euuinovc/",nil,[{"Accept","application/json"},{"User-Agent","CouchDB-Replicator/2.1.0-5cad2a4"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,nil,20,nil,undefined},{httpdb,"https://127.0.0.1:15984/nonexistent_test_db/",nil,[{"Accept","application/json"},{"User-Agent","CouchDB-Replicator/2.1.0-5cad2a4"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,nil,20,nil,undefined},[{checkpoint_interval,30000},{connection_timeout,30000},{http_connections,20},{retries,10},{socket_options,[{keepalive,true},{nodelay,false}]},{use_checkpoints,true},{worker_batch_size,500},{worker_processes,4}],{user_ctx,null,[],undefined},db,nil,<<"foo_error_rep">>,<<"shards/00000000-1fffffff/test_suite_db_quzroaxh/_replicator.1493855893">>,{1493,855907,177746}}: [{couch_replicator_api_wrap,db_open,3,[{file,"src/couch_replicator_api_wrap.erl"},{line,109}]},{couch_replicator_scheduler_job,init_state,1,[{file,"src/couch_replicator_scheduler_job.erl"},{line,568}]},{couch_replicator_scheduler_job,do_init,1,[{file,"src/couch_replicator_scheduler_job.erl"},{line,127}]},{couch_replicator_scheduler_job,handle_info,2,[{file,"src/couch_replicator_scheduler_job.erl"},{line,357}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,599}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,237}]}]
[notice] 2017-05-03T23:58:27.752731Z [email protected] <0.309.0> 6e697a45b0 127.0.0.1:15984 127.0.0.1 undefined GET /_scheduler/docs 200 ok 4

sagelywizard · 2017-05-05T20:53:40Z

Most sensible solution to me solution is to make two changes:

add GET /_scheduler/docs/:db/:doc endpoint. :db would need to be URL-encoded for slashes. Should be a pretty straightforward change.
make the GET /_scheduler/docs endpoint use a couch_replicator_fabric:docs-type function to make requests to shards of all known replicator DBs rather than just _replicator.

nickva · 2017-05-08T15:38:45Z

Yap. Can be done relatively simply. Also it looks like we could even allow un-escaped dbname and then rebuild the db name from path. The reason this is possible is because _replicator is not a valid document ID.
This will be harder to implement. Currently _scheduler/docs mimics (and maybe ven shares some code with) _all_docs. Doing a multi-db _all_docs kind of thing might be tricky, especially in respect to handling limit, offset, total_rows.

Previously _scheduler/docs assumed only the default _replicator db. Now these kinds of path are accepted after `_scheduler/docs`: * `/` : all docs from default _replicator db * `/_replicator` : all docs from default replicator db * `/docid` : a specific doc from the default replicator db * `/other%2f_replicator` : non-default replicator db, urlencoded * `/other/_replicator` : non-default replicator db, unencoded * `/other%2f_replicator/docid` : doc from a non-default db, urlencoded * `/other/_replicator/docid` : doc from a non-default db, db is unencoded Because `_replicator` is not a valid document ID, it's possible to unambiguously parse unescaped db paths. Issue: apache#506

Previously _scheduler/docs assumed only the default _replicator db. Now these kinds of paths are accepted after `_scheduler/docs`: * `/` : all docs from default _replicator db * `/_replicator` : all docs from default replicator db * `/docid` : a specific doc from the default replicator db * `/other%2f_replicator` : non-default replicator db, urlencoded * `/other/_replicator` : non-default replicator db, unencoded * `/other%2f_replicator/docid` : doc from a non-default db, urlencoded * `/other/_replicator/docid` : doc from a non-default db, db is unencoded Because `_replicator` is not a valid document ID, it's possible to unambiguously parse unescaped db paths. Issue: apache#506

Previously _scheduler/docs assumed only the default _replicator db. To provide consistency and to allow disambiguation between a db named 'db/_replicator' and the document named 'db/_replicator' in the default replicator db, access to the single document API is changed to always require the replicator db. That is `/docid` should not be `/_replicator/docid`. Now these kinds of paths are accepted after `_scheduler/docs`: * `/` : all docs from default _replicator db * `/_replicator` : all docs from default replicator db * `/other%2f_replicator` : non-default replicator db, urlencoded * `/other/_replicator` : non-default replicator db, unencoded * `/other%2f_replicator/docid` : doc from a non-default db, urlencoded * `/other/_replicator/docid` : doc from a non-default db, db is unencoded Because `_replicator` is not a valid document ID, it's possible to unambiguously parse unescaped db paths. Issue: apache#506

Previously _scheduler/docs assumed only the default _replicator db. To provide consistency and to allow disambiguation between a db named 'db/_replicator' and the document named 'db/_replicator' in the default replicator db, access to the single document API is changed to always require the replicator db. That is `/docid` should not be `/_replicator/docid`. Now these kinds of paths are accepted after `_scheduler/docs`: * `/` : all docs from default _replicator db * `/_replicator` : all docs from default replicator db * `/other%2f_replicator` : non-default replicator db, urlencoded * `/other/_replicator` : non-default replicator db, unencoded * `/other%2f_replicator/docid` : doc from a non-default db, urlencoded * `/other/_replicator/docid` : doc from a non-default db, db is unencoded Because `_replicator` is not a valid document ID, it's possible to unambiguously parse unescaped db paths. Issue: #506

wohali · 2017-05-09T20:39:51Z

Fixed by #509.

Document these endpoints: * _scheduler/docs/{replicator_db} * _scheduler/docs/{replicator_db}/{docid} Update replicator example due to the API change in how single replicator doc info is retrieved. Issue apache/couchdb#506

Document these endpoints: * _scheduler/docs/{replicator_db} * _scheduler/docs/{replicator_db}/{docid} Update replicator example due to the API change in how single replicator doc info is retrieved. Issue apache#506

wohali added bug replication labels May 3, 2017

wohali added this to the 2.1.0 milestone May 3, 2017

wohali added the api label May 4, 2017

wohali changed the title ~~replication to non-existent target db never updates _replicator doc with state~~ Custom replication DBs never get /_scheduler/docs entries May 4, 2017

nickva mentioned this issue May 8, 2017

Handle non-default _replicator dbs in _scheduler/docs endpoint #509

Merged

3 tasks

wohali closed this as completed May 9, 2017

nickva mentioned this issue May 10, 2017

Update docs related to _scheduler/docs/{db}/{docid} endpoint apache/couchdb-documentation#131

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom replication DBs never get /_scheduler/docs entries #506

Custom replication DBs never get /_scheduler/docs entries #506

wohali commented May 3, 2017

nickva commented May 3, 2017

wohali commented May 3, 2017

wohali commented May 4, 2017

sagelywizard commented May 5, 2017

nickva commented May 8, 2017 •

edited

Loading

wohali commented May 9, 2017

Custom replication DBs never get /_scheduler/docs entries #506

Custom replication DBs never get /_scheduler/docs entries #506

Comments

wohali commented May 3, 2017

Steps to recreate

Logfile excerpt

nickva commented May 3, 2017

wohali commented May 3, 2017

wohali commented May 4, 2017

sagelywizard commented May 5, 2017

nickva commented May 8, 2017 • edited Loading

wohali commented May 9, 2017

nickva commented May 8, 2017 •

edited

Loading