Prototype/fdb layer replace couch rate #3127

davisp · 2020-09-03T20:24:59Z

Overview

Replace the couch_rate rate limiting algorithm with couch_views_batch to dynamically optimize the indexer batch size during view updates.

While working on optimizing couch_views I ran into couch_rate behaving oddly. Reading through the implementation it was nearly impossible to predict how changing various parameters would affect the behavior under load. The rate limiting algorithms were also a bit misapplied. We don't really want a rate "limiter" so much as a rate "maximizer".

After failing to comprehend couch_rate and realizing that it was missing some fairly important signals (specifically, the approximate transaction size) I decided to take a whack at simplifying things. The new approach is quite a bit simpler and results in the ability to both successfully index large views (1,000,000 doc ranges were tested) but also has a more predictable and simple behavior.

The graph above is a representative of the behavior. Batch sizes start out small when a view indexer starts. The batch size is quickly ramped up to search for the threshold of a batch size, and then batch sizes are adjusted in much smaller increments to follow the optimal batch size.

Matching the old behavior of couch_rate to attempt to maintain a consistent transaction time limit can be accomplished trivially by adjusting the couch_views.batch_tx_max_time parameter.

Testing recommendations

make check

Checklist

Code is written and works correctly
Changes are covered by tests
Any new configurable parameters are documented in rel/overlay/etc/default.ini
A PR for documentation changes has been made in https://github.com/apache/couchdb-documentation

davisp · 2020-09-03T20:27:42Z

I should mention that my hack to generate the batch_size graphs had a bug where it would double count if a transaction would be retried. That spike to ~8500 just happens to be at the transition in this graph. There's always a conflict towards the start of view indexing due to a conflict on the couch_jobs data that I have yet to track down.

garrensmith

In general, this looks really good and is a lot simpler than the previous implementation.
I'm not convinced about having the default implementation of batch_module in the same file as couch_views_batch. I initially found it confusing trying to figure out where it starts and where couch_views_batch is. I would prefer we have a couch_batch module similar to couch_rate with the default module and couch_views_batch.

src/couch_views/src/couch_views_batch.erl

src/couch_views/src/couch_views_indexer.erl

src/couch_views/src/couch_views_batch.erl

src/couch_views/README.md

rel/overlay/etc/default.ini

iilyak · 2020-09-08T13:00:00Z

I'm not convinced about having the default implementation of batch_module in the same file as couch_views_batch. I initially found it confusing trying to figure out where it starts and where couch_views_batch is. I would prefer we have a couch_batch module similar to couch_rate with the default module and couch_views_batch.

I agree with Garren. This separation would also hide the #batch_st{} record in couch_views_batch module.

iilyak · 2020-09-08T13:17:39Z

Just wanted to mention that the scope of couch_rate was bigger. The idea behind couch_rate was to provide a rate limiter/batch size estimator which can be used anywhere where we have a chance of a transaction failure from FDB and where we can be better off by adjusting transaction size (one possible use case is to unify with https://github.com/apache/couchdb/blob/prototype/fdb-layer/src/couch_replicator/src/couch_replicator_rate_limiter.erl). However since it seems like we are getting a consensus that we should introduce limits we might not need a generic solution and couch_view specific one might suffice.

iilyak · 2020-09-08T13:32:59Z

In regards to Move error reporting test to EUnit commit you've mentioned:

This test doesn't fail correctly any longer. Rather than attempting to
create a new pathological view case I've just moved it to eunit where we can use meck to throw errors directly.

The meck can be used from ExUnit.

  defp with_fdb_error(_ctx) do
    :ok = :meck.new(:couch_views_batch, [:passthrough])
    :meck.expect(:couch_views_batch, :start, [], fn() ->
      :erlang.error({:erlfdb_error, 2101})
    end)
    on_exit(fn ->
      :meck.unload()
    end)
  end

describe "something something" do
    setup [....,  :with_fdb_error]
    test "something something", ctx do
       ...
    end
end

davisp · 2020-09-08T16:44:23Z

@garrensmith I've split the behavior/implementation between couch_views_batch (behavior) and couch_views_batch_impl (implementation). I don't believe this behavior is nearly abstract enough to cover anything besides couch_views so it doesn't make a lot of sense to me to create a new application to hold a new module that only couch_views will use.

@iilyak For the scope I agree that this is a lot more narrow. I started by trying to figure out the AIMD/congestion algorithms in both couch_rate and couch_replicator but after thinking awhile I realized that having an AIMD based algorithm isn't appropriate for batch sizes since we want to avoid failing batches as much as possible because every failed batch is wasted time when building a view. I did a number of different approaches to try and measure optimal batch sizes and behaviors. An earlier version of this work actually tried to optimize the rate of indexing within the bounds of transaction sizes. That lead to the realization that just maximizing the work done in a single transaction was most beneficial overall for indexing throughput. That's how it ended up being so simple and narrow in the end.

For the meck-via-Elixir, that would have worked if this test were running in the same VM but those tests are run separately against an instance of dev/run.

src/couch_views/src/couch_views_indexer.erl

src/couch_views/src/couch_views_batch.erl

src/couch_views/src/couch_views_indexer.erl

iilyak · 2020-09-08T16:56:23Z

For the meck-via-Elixir, that would have worked if this test were running in the same VM but those tests are run separately against an instance of dev/run.

We have two kinds of Elixir tests

integration tests which live in test/elixir/test
unit tests which live in src/<app>/test/exunit/

Unit tests are run on the same VM

davisp · 2020-09-08T18:20:02Z

@nickva I've addressed your three main suggestions. For the metrics I don't have a good answer other than maybe expose the timing logs I've currently somewhat hacked in the views branch. Let me know if you think some sort of combined metric there would be useful and I can add that.

nickva · 2020-09-08T18:36:06Z

@davisp Looks, let's skip the metrics for now, could be a future enhancement

I am getting a consistent timeout in the couch_views_indexer_test:39: indexer_test_ (indexed_empty_db)... test while it passes on prototype/fdb-layer. Looking a bit into that

davisp · 2020-09-08T19:00:49Z

@nickva I sometimes have one but it seems to be only occasionally for the first run after compiling and then doesn't happen again. Let me know if you find anything though. Flaky tests are never good.

nickva · 2020-09-09T00:20:44Z

@davisp Found the issue, couch_views builds are were fast enough to trigger a race condition in couch_jobs monitoring code. The job was completing before the type monitor for couch_views jobs even started.

Made a PR to fix it:
#3135

src/couch_views/src/couch_views_batch_impl.erl

src/couch_views/test/couch_views_batch_test.erl

src/couch_views/README.md

nickva

+1 Nice work!

And good comments from Ilya and Garren about re-structuring it as two modules.

Left a few style and param validation comments, up to you there if you feel they are too nitpicky

src/couch_views/src/couch_views_batch.erl

davisp · 2020-09-10T21:22:37Z

@nickva I've addressed all of your suggestions. Let me know if you have anything else.

@iilyak @garrensmith I'm pretty sure I've addressed all of your concerns, let me know if there's anything left to handle.

nickva · 2020-09-10T21:30:24Z

@davisp lgtm +1 to merge, also good job adding extra tests!

garrensmith

One small change from me and then it looks good.

src/couch_views/README.md

iilyak · 2020-09-15T12:29:25Z

rel/overlay/etc/default.ini

+; batch_initial_size = 100 ; Initial batch size in number of documents
+; batch_search_increment = 500 ; Size change when searching for the threshold
+; batch_sense_increment = 100 ; Size change increment after hitting a threshold
+; batch_max_tx_size = 9000000 ; Maximum transaction size in bytes


Should we name it batch_max_tx_size_bytes to make it clear that units are different compared to batch_initial_size?

iilyak

+1. after batch_max_tx_size -> batch_max_tx_size_bytes rename.

garrensmith

+1

davisp · 2020-09-15T16:16:29Z

Updated the max size parameter to include bytes. Will merge when CI comes back green.

This implementation was difficult to understand and had behavior that was too difficult to predict. It would break if view behavior changed in significant ways from what was originally expected.

The couch_views_batch module is responsible for sensing the largest batch sizes that can be successfully processed for a given indexer process. It works by initially searching for the maximum number of documents that can be included in a batch. Once this threshold is found it then works by slowly increasing the batch size and decreasing when its found again. This approach works to maximise batch sizes while being reactive to when a larger batch would cross over the FoundationDB transaction limits which causes the entire batch to be aborted and retried which wastes time during view builds.

This test doesn't fail correctly any longer. Rather than attempting to create a new pathological view case I've just moved it to eunit where we can use meck to throw errors directly.

davisp force-pushed the prototype/fdb-layer-replace-couch-rate branch from 5c477f3 to 2adc534 Compare September 3, 2020 20:25

davisp requested review from iilyak, garrensmith, nickva and rnewson and removed request for iilyak September 3, 2020 20:27

garrensmith requested changes Sep 7, 2020

View reviewed changes

src/couch_views/src/couch_views_batch.erl Outdated Show resolved Hide resolved

src/couch_views/src/couch_views_batch.erl Outdated Show resolved Hide resolved

src/couch_views/src/couch_views_indexer.erl Outdated Show resolved Hide resolved