Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couch Scanner #5014

Merged
merged 1 commit into from
Apr 12, 2024
Merged

Couch Scanner #5014

merged 1 commit into from
Apr 12, 2024

Conversation

nickva
Copy link
Contributor

@nickva nickva commented Mar 24, 2024

An application to scan the cluster with a plugin system to report various things about databases and documents. The initial idea was to have something like this to scan all the javascript design docs to check for compatibility with the new QuickJS engine. It had since been split apart from the QuickJS branch and made into a separate pull request.

The current implementation includes two plugins:

  • couch_scanner_plugin_find : scan for regexes in doc bodies
  • couch_scanner_ddoc_features : report various design doc features

A more detailed description is in the README.md file. The plugin API is defined in the couch_scanner_plugin module. There are additional details in the comments in the included Erlang modules. What follows is as summary description
of some of the implementation details and features.

Plugins are managed as individual process by the couch_scanner_server with the start_link/1 and stop/1 functions. After a plugin runner process is spawned, couch_scanner_server wait for it to exit. A process may exit with an error, then it will be penalized with an exponential back-off, or it may also exit with a special {shutdown, {reschedule, TSec}} value, in which case it will be rescheduled to run again on or after the TSec time.

After the plugin process process starts, it will load and validate its plugin module. Then, it will start scanning all the dbs and docs on the local node. Shard ranges will be scanned only on one of the cluster nodes to avoid duplicating work. For instance, if there are 2 shard ranges, 0-7, 8-f, with copies on nodes n1, n2, n3. Then, 0-7 might be scanned on n1 only, and 8-f on n3.

During various events the plugin process will call into the plugin module: on startup, when resuming from a checkpoint, when checkpointing, when processing a new db, design doc, a document, and when completing a scan. The plugin may accumulate reporting data, or may indicate that some parts of the scan should be skipped, or that the scanning session should be reset.

By default all plugins are disabled. Plugins are enabled and managed via the config system. To enable a plugin, add a $plugin = true entry in the [couch_scanner_plugins] section. For example:

[couch_scanner_plugins]
couch_scanner_plugin_ddoc_features = true

Plugins can be configured to run on or after a particular date and time or to run periodically. That can be configured via [$plugin] after = ... and [$plugin] repeat = ... settings. For instance, to run after 2024-03-20T15:00 and then run every Monday:

[couch_scanner_plugin_ddoc_features]
after = 2024-03-20T15:00
repeat = monday

The default values for after and repeat is = restart, meaning to run once after the node starts up.

To prevent the plugins from consuming too may resources. There is a simple rate limiter which limits how many databases, shard and documents should e processed by all the plugins. Rate limits are configurable:

[couch_scanner]
db_rate_limit = 25
shard_rate_limit = 50
doc_rate_limit = 1000

@nickva nickva force-pushed the scanner branch 3 times, most recently from 53394e2 to 09fbeca Compare April 1, 2024 23:38
@nickva nickva force-pushed the scanner branch 2 times, most recently from 314c404 to b68869d Compare April 5, 2024 05:36
@nickva nickva marked this pull request as ready for review April 5, 2024 05:38
@nickva nickva changed the title Implement a background db / ddoc / doc scanner Couch Scanner Apr 5, 2024
@nickva nickva requested a review from jaydoane April 5, 2024 21:36
rel/overlay/etc/default.ini Outdated Show resolved Hide resolved
Copy link
Contributor

@jaydoane jaydoane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a few more comments.

src/couch_scanner/README.md Outdated Show resolved Hide resolved
src/couch_scanner/README.md Outdated Show resolved Hide resolved
src/couch_scanner/README.md Outdated Show resolved Hide resolved
src/couch_scanner/README.md Outdated Show resolved Hide resolved
src/couch_scanner/README.md Outdated Show resolved Hide resolved
% scan starts from the beginning (first db, first shard, ...), and resume/2 is
% called when the scanning hasn't finished and has to continue.
%
% If start/2 or resume/2 returns `reset` then the checkpoint will be reset and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice feature!

src/couch_scanner/src/couch_scanner_plugin.erl Outdated Show resolved Hide resolved
src/couch_scanner/src/couch_scanner_plugin.erl Outdated Show resolved Hide resolved
rel/overlay/etc/default.ini Outdated Show resolved Hide resolved
rel/overlay/etc/default.ini Show resolved Hide resolved
Copy link
Contributor

@jaydoane jaydoane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a plugin for use with this scanner framework, and the examples were super helpful, and easy to adapt for a different use case.

This should be useful for anyone needing detailed insight into a large amount of CouchDB data.

Great work, Nick!

src/couch_scanner/test/eunit/couch_scanner_test.erl Outdated Show resolved Hide resolved
@nickva nickva force-pushed the scanner branch 2 times, most recently from c1ca2af to ca67980 Compare April 12, 2024 16:37
An application to scan the cluster with a plugin system to report various
things about databases and documents. The initial idea was to have something
like this to scan all the javascript design docs to check for compatibility
with the new QuickJS engine. It had since been split apart from the QuickJS
branch and made into a separate pull request.

The current implementation includes two plugins:
  * couch_scanner_plugin_find : scan for regexes in doc bodies
  * couch_scanner_ddoc_features : report various design doc features

A more detailed description is in the README.md file. The plugin API is defined
in the `couch_scanner_plugin` module. There are additional details in the
comments in the included Erlang modules. What follows is as summary description
of some of the implementation details and features.

Plugins are managed as individual process by the `couch_scanner_server` with
the `start_link/1` and `stop/1` functions. After a plugin runner process is
spawned, `couch_scanner_server` wait for it to exit. A process may exit with an
error, then it will be penalized with an exponential back-off, or it may also
exit with a special `{shutdown, {reschedule, TSec}}` value, in which case it
will be rescheduled to run again on or after the `TSec` time.

After the plugin process process starts, it will load and validate its plugin
module. Then, it will start scanning all the dbs and docs on the local node.
Shard ranges will be scanned only on one of the cluster nodes to avoid
duplicating work. For instance, if there are 2 shard ranges, `0-7`, `8-f`, with
copies on nodes `n1`, `n2`, `n3`. Then, `0-7` might be scanned on `n1` only,
and `8-f` on `n3`.

During various events the plugin process will call into the plugin module: on
startup, when resuming from a checkpoint, when checkpointing, when processing a
new db, design doc, a document, and when completing a scan. The plugin may
accumulate reporting data, or may indicate that some parts of the scan should
be skipped, or that the scanning session should be reset.

By default all plugins are disabled. Plugins are enabled and managed via the
config system. To enable a plugin, add a `$plugin = true` entry in the
`[couch_scanner_plugins]` section. For example:
```
[couch_scanner_plugins]
couch_scanner_plugin_ddoc_features = true
```

Plugins can be configured to run on or after a particular date and time or to
run periodically. That can be configured via `[$plugin] after = ...` and
`[$plugin] repeat = ...` settings. For instance, to run after 2024-03-20T15:00
and then run every Monday:

```
[couch_scanner_plugin_ddoc_features]
after = 2024-03-20T15:00
repeat = monday
```

The default values for `after` and `repeat` is ` = restart`, meaning to run
once after the node starts up.

To prevent the plugins from consuming too may resources. There is a simple rate
limiter which limits how many databases, shard and documents should e processed
by all the plugins. Rate limits are configurable:
```
[couch_scanner]
db_rate_limit = 50
shard_rate_limit = 50
doc_rate_limit = 500
```
@nickva nickva merged commit ceb2277 into main Apr 12, 2024
15 checks passed
@nickva nickva deleted the scanner branch April 12, 2024 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants