Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add active_tasks for view builds using version stamps #3003

Merged
merged 3 commits into from
Jul 24, 2020

Conversation

tonysun83
Copy link
Contributor

@tonysun83 tonysun83 commented Jul 14, 2020

Overview

Active Tasks requires TotalChanges and ChangesDone to show the progress
of long running tasks. This requires count_changes_since/2 to be
implemented. Unfortunately, that is not easily done with
foundationdb. This commit replaces TotalChanges with the
versionstamp + the number of docs as a progress indicator. This can
possibly break existing api that relys on TotalChanges. ChangesDone
will still exist, but instead of relying on the current changes seq
it is simply a reflection of how many documents were written by the
updater process.

The new responses look like this:

[
    {
        "node": "[email protected]",
        "pid": "<0.622.0>",
        "changes_done": 199,
        "current_version_stamp": "8131141649532-0-198",
        "database": "testdb",
        "db_version_stamp": "8131141649532-0-999",
        "design_document": "_design/example",
        "started_on": 1594703583,
        "type": "indexer",
        "updated_on": 1594703586
    }
]

[
    {
        "node": "[email protected]",
        "pid": "<0.1030.0>",
        "changes_done": 1000,
        "current_version_stamp": "8131194932916-0-999",
        "database": "testdb",
        "db_version_stamp": "8131194932916-0-999",
        "design_document": "_design/example",
        "started_on": 1594703636,
        "type": "indexer",
        "updated_on": 1594703665
    }
]

Note that this only shows one process. IIUC, we can have multiple workers indexing in parallel. I think this approach also takes care of this with the ChangesDone only reflecting how many documents have been indexed for that range.

Testing recommendations

Tests need to be added. If this approach is accepted, I will add more precise tests. Currently ran manual tests.

Related Issues or Pull Requests

Checklist

@tonysun83 tonysun83 requested a review from davisp July 14, 2020 05:25
Copy link
Member

@garrensmith garrensmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good start. I'm not familiar with how active_tasks works but I'm not sure this is the best approach. All the information you need for view progress is in the job info. So when a query for _active_tasks is received I think you could just read the job queue and get all the information you need and then respond. Or is there more to this?

src/couch_views/src/couch_views_indexer.erl Outdated Show resolved Hide resolved
src/couch_views/src/couch_views_indexer.erl Outdated Show resolved Hide resolved
src/couch_views/src/couch_views_indexer.erl Outdated Show resolved Hide resolved
{type, indexer},
{database, Mrst#mrst.db_name},
{design_document, Mrst#mrst.idx_name},
{changes_done, 0},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will changes always be 0? What happens if an index is half-built and the indexer is starting up again?

Copy link
Contributor Author

@tonysun83 tonysun83 Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have a good solution to this. Mainly that the indexer process has to die first which means the active_task tuple is automatically removed from memory. I was thinking of writing a counter to fdb, but that opened up a new can of worms. The idea here is that is that each process will just own a portion of what they've indexed and not care about what's been done previously. In other words, if you have an index that built 900 documents out of a 1000 document db and it had to restart, it'll start from 0 and just get up to 100 for ChangesDone and finish.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically, changes_done refers to how many changes have been processed by this view update. It's for judging how much work is left before the view is up to date. One way to think of this would be if you had a 10,000,000 docs processed for a view, and then went to process another 1,000,000 docs to bring the view up to date, you wouldn't want to start at 99% finished as that'd be misleading to operators.

src/couch_views/src/couch_views_indexer.erl Outdated Show resolved Hide resolved
@nickva
Copy link
Contributor

nickva commented Jul 14, 2020

@tonysun83

Agree with @garrensmith, for reporting the job on all the nodes, it might be easier to think of something like this:

  • Add an <<"active_task_status">> section to each job's data section, already converted to json format.

  • In the http handler we would query all the active jobs and just emit those json objects. All active jobs are conveniently located under a single ?ACTIVITY subspace and can be queried with get_active_since/3 (it was done that way on purpose to make it simple to query all active jobs in order to report _active_tasks easier 😉 ).

  • To get all the possible job types can use get_types.

  • Since we'd be calling those outside of couch_jobs we could add a new function to couch_jobs, something like get_active_jobs perhaps.

Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me. I'm not sure whether we should try and compare the version stamps to give a rough approximation of the progress completed. I think for now it would be fine and if/when fdb gets a count_rows_betwee_keys we can replace it with something more accurate. I'm not sure if I want to require that or merely suggest so will wait for feedback on my feedback before approving.

src/couch_views/src/couch_views_indexer.erl Outdated Show resolved Hide resolved
Copy link
Member

@garrensmith garrensmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is looking much better. Nice work. Could you check if we have any elixir active tasks tests and make sure they are run with make check. It might be worth adding a few eunit tests around the changes_doc etc.

src/chttpd/src/chttpd_misc.erl Outdated Show resolved Hide resolved


active_tasks_info(ChangesDone, DbName, DDocId, LastSeq, DBSeq) ->
VS = case LastSeq of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like storing all this active task info in the job. All of this can be generated when CouchDB receive's a _active_tasks request. If we keep adding more and more information to the job info means we running the risk of exceeding the FDB value limit. Rather just add the current_version_stamp and db_version_stamp to the job options. Is indexer_pid actually useful? What would a user do with that?

Copy link
Contributor

@nickva nickva Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I had thought of it is that for replication tasks, for example, we we'd have to know to introspect a few other places specific to replicator to add "source", "target", etc. So active tasks would have to start knowing about how to extract fields from various applications. Perhaps we'd want a simple registration like we have for indices in fabric2_index: https://github.com/apache/couchdb/blob/prototype/fdb-layer/src/fabric/src/fabric2_index.erl#L51-L53 - modules (applications) register a callback for _active_tasks handler and call it when building the response? Or, perhaps we can do that as round 2 if we see the values getting close to 100KB size.

Copy link
Contributor Author

@tonysun83 tonysun83 Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garrensmith @nickva how about leaving just changes_done to the job_data and then computing the rest on the fly? It seems we should be adding an entirely new fabric_active_tasks module which allows the extractions to reside in there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good.

src/couch_views/src/couch_views_indexer.erl Outdated Show resolved Hide resolved
@nickva
Copy link
Contributor

nickva commented Jul 22, 2020

@tonysun83 looks better! Good improvement. I made a few comments-in line.

One part that I am not sure if it is worth starting two transactions per task item just to compute version stamps and maybe a few other fields. That seems quite expensive if, say, someone is polling status on a cluster with hundreds of tasks, given that jobs already have the dbs open and all that data available to start with. The reason we decided to do that was to ensure we are below 100KB for JobData, I think?

We'd have to make a convincing case that our active_jobs status section in each JobData would have a chance of pushing it over the 100KB value limit. I tried encoding a large-ish object similar to the one we are adding for the task status:

S = #{
    <<"database">> => << <<"x">> || _ <- lists:seq(1,256) >>,
    <<"changes_done">> => 999999999999999999999999999999,
    <<"design_document">> => << <<"d">> || _ <- lists:seq(1,512) >>,
    <<"current_version_stamp">> => <<"1111111111111111111111111111111111111111">>,
    <<"db_version_stamp">> => <<"222222222222222222222222222222222222222222222">>
}.
>> size(iolist_to_binary(jiffy:encode(S))).
984

So we'd be adding at most another 1KB (in the worst case) to get closer to the 100KB limit if we save the status in the job data that doesn't seem that bad perhaps and it would simplify the design quite a bit more.

@davisp
Copy link
Member

davisp commented Jul 22, 2020

I'm actually counting three possible transactions between opening the database if its not in cache, one for the reading the design doc, then one for the other two bits of info.

Can someone point me to the discussion where we decided to move away from storing active task info in the job state? I'm not seeing how this is a better approach or why the job state approach is unworkable.

@tonysun83
Copy link
Contributor Author

tonysun83 commented Jul 22, 2020

@davisp : Here was @garrensmith 's concern #3003 (comment). But I think I found a compromise and pushed (now the populate_tasks function is way simpler without the reads). We really just need to add two new items to the JobData info: <<"changes_done">> and <<"db_seq">>

@davisp
Copy link
Member

davisp commented Jul 22, 2020

@tonysun83 The history of this PR is super confusing to me. The last update appears to have reverted to storing all of the info back in the job state which means that this new populate_active_tasks is just "format_job_as_active_task". Which is fine I suppose. Though it seems like it'd be a lot more straightforward for the jobs to just store what they want their active_task output to look like in the first place?

Adding a whole registration/callback system for this seems much more complex than the original plan to just spit out the job data as JSON.

@tonysun83
Copy link
Contributor Author

@davisp: agreed that it's gotten way more complex. I think @nickva thought that the registration callback would help with the sizing issue, and as a side benefit, add flexibility for future active_task types (I know replication is next). I'm leaning toward the simpler method like we did previously without the callback but would like @nickva to chime in here since he would know what we would need for replication.

@nickva
Copy link
Contributor

nickva commented Jul 22, 2020

@tonysun83: Agree with @davisp, it's not worth having the registration system just for this.

We could still hide the <<"active_task_info">> field name with a simple library function like

fabric2_active_tasks:update_active_task_info(JobData, ActiveTaskInfo) ->
     JobData#{?ACTIVE_TASK_INFO => ActiveTaskInfo}.

And use the original design #3003 (comment)

I think @nickva thought that the registration callback would help with the sizing issue

The registration would have helped with encapsulation, if we had a size issue and had to compute the section on the fly to avoid the http handle hard-coding knowledge about the replication and indexing apps. But I think we determined computing on the fly was not as optimal, and it wasn't adding that much more to the job data size. If we ever do need to compute it on the fly, we could also come back add the complexity later.

@tonysun83
Copy link
Contributor Author

@nickva addressed your latest comments and added a test. Will add more tests later and then rebase with a proper commit history.


-spec get_types(jtx()) -> [job_type()] | {error, any()}.
get_types(Tx) ->
couch_jobs_fdb:tx(couch_jobs_fdb:get_jtx(Tx), fun(JTx) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: I think it's a 5 space indent instead of 4

@nickva
Copy link
Contributor

nickva commented Jul 23, 2020

@tonysun83 looks great! awesome work. Just a few minor nits, whitespace and indents and such

Copy link
Member

@garrensmith garrensmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good. I've left a note about the extra active task info. I'm still not sure why we need to have all of that in the job data.

[{[{node,Node} | Task]} || Task <- Tasks]
end, Replies),
send_json(Req, lists:sort(Response));
ActiveTasks = fabric2_active_tasks:get_active_tasks(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. This is great.

src/couch_views/src/couch_views_util.erl Show resolved Hide resolved
Copy link
Member

@garrensmith garrensmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. This looks great.



update_active_task_info(JobData, ActiveTaskInfo) ->
JobData#{?ACTIVE_TASK_INFO => ActiveTaskInfo}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit: 5 indent instead of 4, add a newline at the end of the file as well, just for so GH diff doesn't show the red warning :-P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I thought I changed this. weird. thx

Copy link
Contributor

@nickva nickva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Nice work, Tony!

Made a small comment about an indent. Run the new files through emillio, make sure it's happy in case we missed some formatting issues.

Then don't forget to squash some commits, maybe group them into 2 or 3, something like below or whatever looks better to you:

  1. couch_jobs API additions
  2. fabric2 + chttpd part
  3. couch_views bits

We expose get_types in couch_jobs and also add get_active_jobs_ids
to get the active job ids given a certain type.
Instead of relying on couch_task_status, we use fabric2_active_tasks
to construct active_task info via couch_jobs.
Active Tasks requires TotalChanges and ChangesDone to show the progress
of long running tasks. This requires count_changes_since to be
implemented. Unfortunately, that is not easily done via with
foundationdb. This commit replaces TotalChanges with the
versionstamp + the number of docs as a progress indicator. This can
possibly break existing api that relys on TotalChanges. ChangesDone
will still exist, but instead of relying on the current changes seq
it is simply a reflection of how many documents were written by the
updater process.
@tonysun83 tonysun83 merged commit 0c7c77e into prototype/fdb-layer Jul 24, 2020
@tonysun83 tonysun83 deleted the add_active_tasks_fdb branch July 24, 2020 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants