-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Couch Scanner #5014
Couch Scanner #5014
Conversation
53394e2
to
09fbeca
Compare
314c404
to
b68869d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a few more comments.
% scan starts from the beginning (first db, first shard, ...), and resume/2 is | ||
% called when the scanning hasn't finished and has to continue. | ||
% | ||
% If start/2 or resume/2 returns `reset` then the checkpoint will be reset and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice feature!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a plugin for use with this scanner framework, and the examples were super helpful, and easy to adapt for a different use case.
This should be useful for anyone needing detailed insight into a large amount of CouchDB data.
Great work, Nick!
c1ca2af
to
ca67980
Compare
An application to scan the cluster with a plugin system to report various things about databases and documents. The initial idea was to have something like this to scan all the javascript design docs to check for compatibility with the new QuickJS engine. It had since been split apart from the QuickJS branch and made into a separate pull request. The current implementation includes two plugins: * couch_scanner_plugin_find : scan for regexes in doc bodies * couch_scanner_ddoc_features : report various design doc features A more detailed description is in the README.md file. The plugin API is defined in the `couch_scanner_plugin` module. There are additional details in the comments in the included Erlang modules. What follows is as summary description of some of the implementation details and features. Plugins are managed as individual process by the `couch_scanner_server` with the `start_link/1` and `stop/1` functions. After a plugin runner process is spawned, `couch_scanner_server` wait for it to exit. A process may exit with an error, then it will be penalized with an exponential back-off, or it may also exit with a special `{shutdown, {reschedule, TSec}}` value, in which case it will be rescheduled to run again on or after the `TSec` time. After the plugin process process starts, it will load and validate its plugin module. Then, it will start scanning all the dbs and docs on the local node. Shard ranges will be scanned only on one of the cluster nodes to avoid duplicating work. For instance, if there are 2 shard ranges, `0-7`, `8-f`, with copies on nodes `n1`, `n2`, `n3`. Then, `0-7` might be scanned on `n1` only, and `8-f` on `n3`. During various events the plugin process will call into the plugin module: on startup, when resuming from a checkpoint, when checkpointing, when processing a new db, design doc, a document, and when completing a scan. The plugin may accumulate reporting data, or may indicate that some parts of the scan should be skipped, or that the scanning session should be reset. By default all plugins are disabled. Plugins are enabled and managed via the config system. To enable a plugin, add a `$plugin = true` entry in the `[couch_scanner_plugins]` section. For example: ``` [couch_scanner_plugins] couch_scanner_plugin_ddoc_features = true ``` Plugins can be configured to run on or after a particular date and time or to run periodically. That can be configured via `[$plugin] after = ...` and `[$plugin] repeat = ...` settings. For instance, to run after 2024-03-20T15:00 and then run every Monday: ``` [couch_scanner_plugin_ddoc_features] after = 2024-03-20T15:00 repeat = monday ``` The default values for `after` and `repeat` is ` = restart`, meaning to run once after the node starts up. To prevent the plugins from consuming too may resources. There is a simple rate limiter which limits how many databases, shard and documents should e processed by all the plugins. Rate limits are configurable: ``` [couch_scanner] db_rate_limit = 50 shard_rate_limit = 50 doc_rate_limit = 500 ```
An application to scan the cluster with a plugin system to report various things about databases and documents. The initial idea was to have something like this to scan all the javascript design docs to check for compatibility with the new QuickJS engine. It had since been split apart from the QuickJS branch and made into a separate pull request.
The current implementation includes two plugins:
A more detailed description is in the README.md file. The plugin API is defined in the
couch_scanner_plugin
module. There are additional details in the comments in the included Erlang modules. What follows is as summary descriptionof some of the implementation details and features.
Plugins are managed as individual process by the
couch_scanner_server
with thestart_link/1
andstop/1
functions. After a plugin runner process is spawned,couch_scanner_server
wait for it to exit. A process may exit with an error, then it will be penalized with an exponential back-off, or it may also exit with a special{shutdown, {reschedule, TSec}}
value, in which case it will be rescheduled to run again on or after theTSec
time.After the plugin process process starts, it will load and validate its plugin module. Then, it will start scanning all the dbs and docs on the local node. Shard ranges will be scanned only on one of the cluster nodes to avoid duplicating work. For instance, if there are 2 shard ranges,
0-7
,8-f
, with copies on nodesn1
,n2
,n3
. Then,0-7
might be scanned onn1
only, and8-f
onn3
.During various events the plugin process will call into the plugin module: on startup, when resuming from a checkpoint, when checkpointing, when processing a new db, design doc, a document, and when completing a scan. The plugin may accumulate reporting data, or may indicate that some parts of the scan should be skipped, or that the scanning session should be reset.
By default all plugins are disabled. Plugins are enabled and managed via the config system. To enable a plugin, add a
$plugin = true
entry in the[couch_scanner_plugins]
section. For example:Plugins can be configured to run on or after a particular date and time or to run periodically. That can be configured via
[$plugin] after = ...
and[$plugin] repeat = ...
settings. For instance, to run after2024-03-20T15:00
and then run every Monday:The default values for
after
andrepeat
is= restart
, meaning to run once after the node starts up.To prevent the plugins from consuming too may resources. There is a simple rate limiter which limits how many databases, shard and documents should e processed by all the plugins. Rate limits are configurable: