-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import weatherreport #3312
Import weatherreport #3312
Conversation
Fetches some basic Riak data.
1f2d638
to
5f51de7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking real good, Jay. Thanks for helping get this into CouchDB!
I think we need to do a bit more work to get this integrated with the CouchDB build and test process - ideally that happens on weatherreport
as a feature branch before merging with 3.x
.
@@ -50,6 +51,9 @@ call(Fd, Msg, Metadata) -> | |||
queued_call(Fd, Msg, Priority) | |||
end. | |||
|
|||
get_queue_lengths() -> | |||
gen_server:call(?MODULE, get_queue_lengths). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, I'd love to see this in _stats
or _system
output...I don't believe we have this today, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but I had to modify this IOQ lib here because weatherreport
previously got something like this from the old Cloudant IOQ library, and this seemed like the closest analogy to that metric. You can see later in this file that there's a compatibility clause that depends on the result of erlang:function_exported(ioq, get_queue_lengths, 0)
. But this is basically a newly exposed stat for this IOQ lib, and I have no idea how useful it might end up being in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we report on all the message queues in _system
, we should maybe expose this for people just in case it tells them anything useful. Happy to put that into another PR though.
@kocolosk since you took the IOQ2 work over the finish line for CouchDB 3.x, what do you think?
For a first pass at documentation, I ported the README. |
@wohali is there anything else needed here? |
@jaydoane : should we merge this so that it's available in the upcoming |
I have +1'ed this for 3.2.0... is there any reason it shouldn't land? |
I need to squash it somewhat first, but also not merging it makes Day Job ™️ somewhat easier in the near term. It may be time to bite that bullet though, if it's really needed in 3.2. |
I don't understand... if you can let me know what you mean on IRC or email I'd appreciate it? Not merging something is really confusing. Yes I'd like this in 3.2. |
It might be nice to squash the 250+ commits to just a few or even one, just from an ergonomic point of view. Most of the commits I don't see us needing to ever check out individually, like say caring that rebar was upgraded in 2011 or handling riak specific checks and such. |
Yeah, that's a really good idea. We certainly don't need to keep track of all those early Basho commits! |
I think I'm done with shell scripts for now. Port to Perl. Some cleaning up of the Perl code. Add check for ring_creation_size != num_partitions. Check if the Riak node is running. Prefer riak over riaksearch when looking for commands. Make it a proper hash. Fetch number of nodes in the ring. Look for crash dumps and emfile errors. Add checks for number of partitions vs. nodes and for not being part of the ring. Align status output for easier parsing. Check number of connected nodes vs. ring members. Use a better ratio for partitions vs. nodes. Add some checks if the node is running to prevent errors in checks. Start porting Riaknostic to Erlang. Remove io:format that's broken anyway. Remove shell script version. Fetch and start printing data from the Riak instance. Add Riak installation detection. Find Riak logs. Less code is good. Just tail-recurse it. Change run code to use a config dict. Add module to test for ring membership. Add more check modules. Add module to check for connected nodes. Fixed ping riak to work more consistantly. General code cleanup. Moved log directories into the riaknostic.app. Removed unneccassary filter code. Updated rebar to the current version Scriptized. Name and cookie can be passed in as parameters. Added a README Added type specs. Added more output to nodes connected. Added node to Config dict. Added initial version of disk check. Added noatime check for all mounted disks. Removed flag to specify vm name because it's not needed. Exrcised riaknostic node from connected node list. Removed perl script. Added dizzy's bitcask large value check. Better output from nodes connected Got rid of unnecessary sup. More readable output. Fixed incorrect application start callback return. Memory use stats Improved organization Added util library Added ability to output warnings and errors from riaknostic modules. Added a gen_server for logging. Improved logging with more logical strucutre Using list:keyfind for OTP release per Sean's comment Added conversion from binary to float Fixed issue with higher memory usage check vm.args can now be parsed in A bit of cleanup Integrated lager Improved code organization Moved from dicts to basic prop lists Added all ebin directories in riak lib to path Riaknostics are discovered via their run/1 methods Added lots of command line configuration Added sibling and vclock options to large value check Added a guard to bitcask_threshold_check function Key vals are binary_to_termed before printing. Inserted tabs in readme - usage Fixed broken lager:warning call. Add license headers to all source files. Closes #3. Upgrade rebar. Add a Makefile, copied from lager. Add check-module behavior according to plan. Starting refactor of check modules. Refactor memory use, add TODOs. Refactor ring membership. Added noatime check. Getting the DataDir from riaknostic_config won't work, but if DataDir is set correctly, it is a valid noatime check. Refactor nodes connected and fix some compilation bugs. Refactor ring size check. Fix typo/syntax error. Rename disk check module. Add ability to identify modules that are checks. Update TODOs. Refactor Joe's disk check module. Add a little documentation to the private functions. Add getopt, cleanup unused or antiquated modules. Switching to use a global notion of the config, probably app env but TBD. Do a little line-wrapping. WIP riaknostic_config accessors. Add top-level script with getopt and check descriptions. Remove high-impact bitcask check. Implement a huge swath of the runner, disk check works! application:get_env/2 returns {ok, Value}. Absolutize data directories. Expose base_dir/0 and etc_dir/0. Adjust crash dump detector to use base_dir(). Recognize -sname switch and distinguish between short and long names. Added riaknostic_node module for interacting with the local/cluster nodes. Fix a few bugs and enhance debugging information. * Messages are properly sorted now. * Match output of ps command properly. * Add debug logging of node-connection logic. * Improve detection of node connectivity. Fix docs target, ignore generated docs. Add edoc overview, initial stylesheet. Finish up some styles and documentation, more detail on behaviour needed. Add a more verbose description of the behaviour. Make clear that this stylesheet is for edoc. Initial version of the landing page. Don't need to link to edoc, reflow some of those paragraphs. Ignore parts of the gh-pages branch. WIP make pages. Add forkme ribbon. Make sure to ignore root PNG files and add the new image to the pages. Remove useless memsup info. Added Dr. Basho. Thanks @jgnewman! Closes #13. Build package tarball. Closes #12. Update the README. Minor wording correction, add missing docs to riaknostic_node. Add a word of caution. Solaris ps doesn't understand -o command and we don't use it anyway. Forgot to stage this line. Use -nocookie to prevent usage of the .erlang.cookie file. Closes #16. Fedora installs Riak libraries to /usr/lib64. Closes #18 Setup dialyzer. Fix dialyzer warnings. Check for: * Ring sizes not a multiple of 2 * Deployments where vnodes/node < 3% of ring size * Deployments where vnodes/node > 70% of ring size Change ring size inappropriate check from multiple of 2 to power of 2. Leave out the ring size/vnode messages until we have a better understanding of the relationship and can give better advice. Fix a few mistakes. Fix cluster_command 1 Added check for ring preflists satisfying n_val Check whether search is enabled on all nodes v1.0.1 Update lager dependency Changed regex split to string tokens. Fixes failure in Ubuntu Fix xargs argument for Linux eaccess -> eacces to catch the error correctly Add can_connect_all to check if all nodes are available. The reason for this, is that for search we're checking if search is enabled or disabled on all nodes. If a node is down, this is not a valid test, and errors out otherwise. Check if connected first before running all connected Travis CI config Ignore .eunit folder Add meck as a dependency Initial eunit test for riaknostic_check_ring using meck v1.0.2 Add Travis CI Build Status to README.md some work on the docs re: 26 & 29 add lines for the autosaves of the one true editor Added some reassuring output. Just a few lines so that the runner of the command knows that riaknostic is running and that it exited without error. Update README.md SmartOS has different paths from standard Solaris, using pkg_add package with current version 1.2 add basic sysctl checking Removed freeBSD stuff end of the day temp commit, code still kind of broken move stuff to zip rather than os:cmd added multiple platform support. a couple of bugs/features: - we also need to be able to just grab a copy of a file - we need a list of tests for each platform - need cases for sunos and freebsd - fold in regular diagnostic messages (once I land the fix for #14). - there is a bug in shelling out, only some of the output is actually recorded. another broken checkin, so I can work on something else added the ability to copy out named files changed the where the files were stored before cleanup to CWD. update to flesh out the export command a bit more. still needs much testing, especially on smartos clean up, fix some bugs, add directory-grabbing added a (bad) first pass at machine-readable output midstream checking to get back to work on export added a (bad) first pass at machine-readable output midstream checking to get back to work on export Changed getopt version, one other fix Fix default output Added some comments and TODO's Fix lager dependency version now that lager was updated Fixate lager dependency on 1.2.1 Change dep on lager to 1.2.2 to match the rest of riak Roll version riaknostic 1.1.0 Clarify that Riak 1.3 already has Riaknostic installed removed misplaced parathesis Add OpenBSD bits Update lager dep to 2.0.0rc2 Lager to 2.0.0 final Un-escriptize riaknostic and modify for lager 2.0 compatability Add an extra log line for clarity when running non-existent checks newline fix Restore riaknostic output to console When riaknostic became part of Riak instead of a separate app, its output (through lager) ended up in the node's console.log instead of being output by 'riak-admin diag'. Among other things, this broke the riaknostic_rt riak test. This adds a layer on top of lager, so messages can be directed to the console again, simply by using io:format. This way, messages are sent to the group_leader instead of the user process, which is what the lager backend does. When riaknostic is invoked through RPC by riak-admin, the caller becomes the group leader and picks up those messages. I wish there was a cleaner way to do this leveraging something in lager, but I couldn't find any. Roll riaknostic version 1.2.0 Pin meck dependency to a specific tag Remove sysctl checks Sysctl checks are now handled by the riak_kv_env module. Standardize meck dep Standardize on a rebar.config dep format to reduce conflicts pull app.config and vm.args from init:get_arguments added extra -vm_args to CONFIG_ARGS for easy access by erlang vm Roll riaknostic 1.2.1 to pull in lager 2.0.1 Fix rebar.config url to stay consistent Bump lager dep to 2.0.2 Bump lager dep to 2.0.3 - Adds a check for strong consistency configuration -- warning when strong consistency is enabled and AAE is disabled (defect basho/riak_kv#959)
Part 1/8 This commit removes parts of riaknostic that we don't want to port over. This work forks the develop branch of riaknostic at: - https://github.com/basho/riaknostic/tree/develop The base commit is: - basho/riaknostic@7f02e64
Part 2/8 This commit renames the riaknostic files which are to be ported to weatherreport and adds a modification note to the notices at the top of each file. This is groundwork for the following commits where the functionality of each file will be modified to work with Apache CouchDB. It should be noted that this commit completely breaks the build.
Part 3/8 Remove various Makefile targets that we don't want to port over.
Part 4/8 Remove lager and add the following dependencies: - twig (logging application for CouchDB) - config (config application for CouchDB)
Part 5/8 This commit updates the renamed files in src/ to perform functinality that it appropriate for CouchDB. As well as the obvious substitutions of weatherreport for riaknostic, this commit also makes the following changes: src/weatherreport.erl - Removed unused options and re-escriptize src/weatherreport_check_membership.erl - Updated to use mem3 instead of riak_kv_status to get the list of member nodes src/weatherreport_check_nodes_connected.erl - Updated to use mem3 instead of riak_kv_status to get the list of member nodes src/weatherreport_config.erl - Removed app_env functions (this is handled by the config app instead) - Updated data_directories/0 to return the data and view directories for CouchDB - Removed unused base_dir/0 function - Updated user/0 to get the current user via `whoami` (this makes an assumption that the script is run under the same user that is running CouchDB) - Replaced lager with twig - Updated load_app_config/0 to start the config app src/weatherreport_node.erl - Updated cluster_command/4 to use mem3 instead of riak_kv_status to get the list of member nodes - Made nodename/0 a public function so it can be used by checks - Removed stats functions src/weatherreport_util.erl - Made should_log/1 public so it can be used in weatherreport.erl - Replaced lager with twig
Part 8/8 Make docs and comments refer to WeatherReport and CouchDB instead of Riaknostic and Riak.
Add a markdown document which takes potential contributers through the process of adding a check.
Add a diagnostic check which verifies there is a registered mem3_sync process running on the cluster node.
This commit generalises the logic that checks for process attributes and moves it to weatherreport_utils. This is groundwork for the following commit which will add a check for processes by memory usage by re-using the generalised code. We also take the opportunity to rename fold_processes to something that is more descriptive and also correct. BugzID: 32875
This commit adds a diagnostic check that checks for processes where memory usage exceeds a set threshold (100MB). BugzID: 32875
The initial design of Riaknostic (and hence WeatherReport) was that checks ran in the script and made RPC calls to the local cluster node when needed. Due to a requirement to allow checks to run on multiple cluster nodes without guarantees of SSH connectivity, the design has changed such that each check is now run entirely on the cluster node. This means it is no longer necessary to use weatherreport_node:local_command/3 to interact with the cluster, the calls can just be made directly. This commit removes the redundant calls to weatherreport_node:local_command/3 which makes the intent of the code a bit clearer. BugzID: 34016
Unless it is explicitly required for a check, avoid returning the node name in diagnostic messages. The node name is added by the logging code so including it in the messages is not necessary.
Custodian will output a message if there are one or more conflicting revisions for a shard map. Previously weatherreport_check_custodian failed to match that message which caused the check to fail. This commit adds new function clauses to report_to_message/1 and format/1 so that the conflict messages are output to the console at the warning log level. BugzID: 34157
This commit adds a check for the number of pending internal replication jobs on a node. A large number of pending internal replication jobs indicates that internal replication is falling behind. A warning message is returned if the number of jobs exceeds a hard-coded threshold, otherwise an info message is returned. BugzID: 32872
Check the absolute statistics obtained by recon:node_stats/4 over a one second period. The values are sampled ten times and the mean is returned. For run_queue and process_count the mean is compared to hard-coded thresholds which determine whether a warning or info message is returned. For all other statistics an info message is always returned. BugzID: 32877
This commit downgrades messages about process call counts to notice level. The previous level of warning was inappropriate for many processes as it really just indicated normal operation of the system. Until this check is made a bit smarter, with custom thresholds per MFA, we will log at nothing more severe than notice so that we don't confuse operators into thinking something is wrong. BugzID: 37593
BugzID: 45855
Port fork of custom `recon` functions for checking process calls.
weatherreport previously relied on Cloudant's IOQ implementation. This adds support for the default IOQ so that it works with either.
In default CouchDB, search is disabled by default, so a failure to connect to clouseau should only be a warning.
Overview
Import Cloudant's
weatherreport
diagnostic app, previously forked from Basho'sriaknostic
.Testing recommendations
Related Issues or Pull Requests
Documentation PR
Checklist
rel/overlay/etc/default.ini