Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import weatherreport #3312

Merged
merged 62 commits into from
Apr 20, 2021
Merged

Import weatherreport #3312

merged 62 commits into from
Apr 20, 2021

Conversation

jaydoane
Copy link
Contributor

@jaydoane jaydoane commented Dec 25, 2020

Overview

Import Cloudant's weatherreport diagnostic app, previously forked from Basho's riaknostic.

Testing recommendations

$ make weatherreport-test

Related Issues or Pull Requests

Documentation PR

Checklist

Fetches some basic Riak data.
@jaydoane jaydoane marked this pull request as ready for review January 8, 2021 04:19
Copy link
Member

@wohali wohali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking real good, Jay. Thanks for helping get this into CouchDB!

I think we need to do a bit more work to get this integrated with the CouchDB build and test process - ideally that happens on weatherreport as a feature branch before merging with 3.x.

@@ -50,6 +51,9 @@ call(Fd, Msg, Metadata) ->
queued_call(Fd, Msg, Priority)
end.

get_queue_lengths() ->
gen_server:call(?MODULE, get_queue_lengths).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, I'd love to see this in _stats or _system output...I don't believe we have this today, do we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but I had to modify this IOQ lib here because weatherreport previously got something like this from the old Cloudant IOQ library, and this seemed like the closest analogy to that metric. You can see later in this file that there's a compatibility clause that depends on the result of erlang:function_exported(ioq, get_queue_lengths, 0). But this is basically a newly exposed stat for this IOQ lib, and I have no idea how useful it might end up being in practice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we report on all the message queues in _system, we should maybe expose this for people just in case it tells them anything useful. Happy to put that into another PR though.

@kocolosk since you took the IOQ2 work over the finish line for CouchDB 3.x, what do you think?

src/weatherreport/.travis.yml Outdated Show resolved Hide resolved
src/weatherreport/DEVELOPMENT.md Outdated Show resolved Hide resolved
src/weatherreport/DEVELOPMENT.md Outdated Show resolved Hide resolved
src/weatherreport/Makefile Outdated Show resolved Hide resolved
src/weatherreport/src/weatherreport_check.erl Outdated Show resolved Hide resolved
src/weatherreport/src/weatherreport.erl Outdated Show resolved Hide resolved
src/weatherreport/src/weatherreport.app.src Outdated Show resolved Hide resolved
src/weatherreport/rebar.config Outdated Show resolved Hide resolved
src/weatherreport/doc/overview.edoc Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
@jaydoane
Copy link
Contributor Author

For a first pass at documentation, I ported the README.

@jaydoane
Copy link
Contributor Author

@wohali is there anything else needed here?

@bessbd
Copy link
Member

bessbd commented Mar 29, 2021

@jaydoane : should we merge this so that it's available in the upcoming 3.x?

@wohali wohali added this to the 3.2.0 milestone Mar 29, 2021
@wohali
Copy link
Member

wohali commented Mar 29, 2021

I have +1'ed this for 3.2.0... is there any reason it shouldn't land?

@jaydoane
Copy link
Contributor Author

jaydoane commented Mar 30, 2021

I have +1'ed this for 3.2.0... is there any reason it shouldn't land?

I need to squash it somewhat first, but also not merging it makes Day Job ™️ somewhat easier in the near term. It may be time to bite that bullet though, if it's really needed in 3.2.

@wohali
Copy link
Member

wohali commented Mar 30, 2021

I don't understand... if you can let me know what you mean on IRC or email I'd appreciate it? Not merging something is really confusing. Yes I'd like this in 3.2.

@nickva
Copy link
Contributor

nickva commented Mar 30, 2021

It might be nice to squash the 250+ commits to just a few or even one, just from an ergonomic point of view. Most of the commits I don't see us needing to ever check out individually, like say caring that rebar was upgraded in 2011 or handling riak specific checks and such.

@jaydoane
Copy link
Contributor Author

It might be nice to squash the 250+ commits to just a few or even one, just from an ergonomic point of view. Most of the commits I don't see us needing to ever check out individually, like say caring that rebar was upgraded in 2011 or handling riak specific checks and such.

Yeah, that's a really good idea. We certainly don't need to keep track of all those early Basho commits!

roidrage and others added 11 commits April 19, 2021 00:32
I think I'm done with shell scripts for now.

Port to Perl.

Some cleaning up of the Perl code.

Add check for ring_creation_size != num_partitions.

Check if the Riak node is running.

Prefer riak over riaksearch when looking for commands.

Make it a proper hash.

Fetch number of nodes in the ring.

Look for crash dumps and emfile errors.

Add checks for number of partitions vs. nodes and for not being part of the ring.

Align status output for easier parsing.

Check number of connected nodes vs. ring members.

Use a better ratio for partitions vs. nodes.

Add some checks if the node is running to prevent errors in checks.

Start porting Riaknostic to Erlang.

Remove io:format that's broken anyway.

Remove shell script version.

Fetch and start printing data from the Riak instance.

Add Riak installation detection.

Find Riak logs.

Less code is good. Just tail-recurse it.

Change run code to use a config dict.

Add module to test for ring membership.

Add more check modules.

Add module to check for connected nodes.

Fixed ping riak to work more consistantly.

General code cleanup.

Moved log directories into the riaknostic.app.

Removed unneccassary filter code.

Updated rebar to the current version

Scriptized. Name and cookie can be passed in as parameters.

Added a README

Added type specs.

Added more output to nodes connected.

Added node to Config dict.

Added initial version of disk check.

Added noatime check for all mounted disks.

Removed flag to specify vm name because it's not needed.

Exrcised riaknostic node from connected node list.

Removed perl script.

Added dizzy's bitcask large value check.

Better output from nodes connected

Got rid of unnecessary sup.

More readable output.

Fixed incorrect application start callback return.

Memory use stats

Improved organization
Added util library

Added ability to output warnings and errors from riaknostic modules.

Added a gen_server for logging.

Improved logging with more logical strucutre

Using list:keyfind for OTP release per Sean's comment

Added conversion from binary to float
Fixed issue with higher memory usage check

vm.args can now be parsed in
A bit of cleanup

Integrated lager

Improved code organization
Moved from dicts to basic prop lists
Added all ebin directories in riak lib to path
Riaknostics are discovered via their run/1 methods

Added lots of command line configuration

Added sibling and vclock options to large value check

Added a guard to bitcask_threshold_check function

Key vals are binary_to_termed before printing.

Inserted tabs in readme - usage

Fixed broken lager:warning call.

Add license headers to all source files. Closes #3.

Upgrade rebar.

Add a Makefile, copied from lager.

Add check-module behavior according to plan.

Starting refactor of check modules.

Refactor memory use, add TODOs.

Refactor ring membership.

Added noatime check.

Getting the DataDir from riaknostic_config won't work, but if DataDir is set correctly, it is a valid noatime check.

Refactor nodes connected and fix some compilation bugs.

Refactor ring size check.

Fix typo/syntax error.

Rename disk check module.

Add ability to identify modules that are checks.

Update TODOs.

Refactor Joe's disk check module.

Add a little documentation to the private functions.

Add getopt, cleanup unused or antiquated modules.

Switching to use a global notion of the config, probably app env but TBD.

Do a little line-wrapping.

WIP riaknostic_config accessors.

Add top-level script with getopt and check descriptions.

Remove high-impact bitcask check.

Implement a huge swath of the runner, disk check works!

application:get_env/2 returns {ok, Value}.

Absolutize data directories.

Expose base_dir/0 and etc_dir/0.

Adjust crash dump detector to use base_dir().

Recognize -sname switch and distinguish between short and long names.

Added riaknostic_node module for interacting with the local/cluster nodes.

Fix a few bugs and enhance debugging information.

* Messages are properly sorted now.
* Match output of ps command properly.
* Add debug logging of node-connection logic.
* Improve detection of node connectivity.

Fix docs target, ignore generated docs.

Add edoc overview, initial stylesheet.

Finish up some styles and documentation, more detail on behaviour needed.

Add a more verbose description of the behaviour.

Make clear that this stylesheet is for edoc.

Initial version of the landing page.

Don't need to link to edoc, reflow some of those paragraphs.

Ignore parts of the gh-pages branch.

WIP make pages.

Add forkme ribbon.

Make sure to ignore root PNG files and add the new image to the pages.

Remove useless memsup info.

Added Dr. Basho. Thanks @jgnewman! Closes #13.

Build package tarball. Closes #12.

Update the README.

Minor wording correction, add missing docs to riaknostic_node.

Add a word of caution.

Solaris ps doesn't understand -o command and we don't use it anyway.

Forgot to stage this line.

Use -nocookie to prevent usage of the .erlang.cookie file. Closes #16.

Fedora installs Riak libraries to /usr/lib64. Closes #18

Setup dialyzer.

Fix dialyzer warnings.

Check for:
* Ring sizes not a multiple of 2
* Deployments where vnodes/node < 3% of ring size
* Deployments where vnodes/node > 70% of ring size

Change ring size inappropriate check from multiple of 2 to power of 2.

Leave out the ring size/vnode messages until we have a better
understanding of the relationship and can give better advice.

Fix a few mistakes.

Fix cluster_command

  1 Added check for ring preflists satisfying n_val

Check whether search is enabled on all nodes

v1.0.1

Update lager dependency

Changed regex split to string tokens. Fixes failure in Ubuntu

Fix xargs argument for Linux

eaccess -> eacces to catch the error correctly

Add can_connect_all to check if all nodes are available.

The reason for this, is that for search we're checking if search is enabled
or disabled on all nodes. If a node is down, this is not a valid test, and
errors out otherwise.

Check if connected first before running all connected

Travis CI config

Ignore .eunit folder

Add meck as a dependency

Initial eunit test for riaknostic_check_ring using meck

v1.0.2

Add Travis CI Build Status to README.md

some work on the docs re: 26 & 29

add lines for the autosaves of the one true editor

Added some reassuring output.

Just a few lines so that the runner of the command knows that riaknostic
is running and that it exited without error.

Update README.md

SmartOS has different paths from standard Solaris, using pkg_add package with current version 1.2

add basic sysctl checking

Removed freeBSD stuff

end of the day temp commit, code still kind of broken

move stuff to zip rather than os:cmd
added multiple platform support.
a couple of bugs/features:
 - we also need to be able to just grab
   a copy of a file
 - we need a list of tests for each platform
 - need cases for sunos and freebsd
 - fold in regular diagnostic messages (once I
   land the fix for #14).
 - there is a bug in shelling out, only some of the
   output is actually recorded.

another broken checkin, so I can work on something else

added the ability to copy out named files
changed the where the files were stored before
cleanup to CWD.

update to flesh out the export command a bit more.
still needs much testing, especially on smartos

clean up, fix some bugs, add directory-grabbing

added a (bad) first pass at machine-readable output

midstream checking to get back to work on export

added a (bad) first pass at machine-readable output

midstream checking to get back to work on export

Changed getopt version, one other fix

Fix default output

Added some comments and TODO's

Fix lager dependency version now that lager was updated

Fixate lager dependency on 1.2.1

Change dep on lager to 1.2.2 to match the rest of riak

Roll version riaknostic 1.1.0

Clarify that Riak 1.3 already has Riaknostic installed

removed misplaced parathesis

Add OpenBSD bits

Update lager dep to 2.0.0rc2

Lager to 2.0.0 final

Un-escriptize riaknostic and modify for lager 2.0 compatability

Add an extra log line for clarity when running non-existent checks

newline fix

Restore riaknostic output to console

When riaknostic became part of Riak instead of a separate app, its
output (through lager) ended up in the node's console.log instead of
being output by 'riak-admin diag'. Among other things, this broke the
riaknostic_rt riak test. This adds a layer on top of lager, so messages
can be directed to the console again, simply by using io:format. This
way, messages are sent to the group_leader instead of the user process,
which is what the lager backend does. When riaknostic is invoked through
RPC by riak-admin, the caller becomes the group leader and picks up
those messages. I wish there was a cleaner way to do this leveraging
something in lager, but I couldn't find any.

Roll riaknostic version 1.2.0

Pin meck dependency to a specific tag

Remove sysctl checks

Sysctl checks are now handled by the riak_kv_env module.

Standardize meck dep

Standardize on a rebar.config dep format to reduce conflicts

pull app.config and vm.args from init:get_arguments

added extra -vm_args to CONFIG_ARGS for easy access by erlang vm

Roll riaknostic 1.2.1 to pull in lager 2.0.1

Fix rebar.config url to stay consistent

Bump lager dep to 2.0.2

Bump lager dep to 2.0.3

- Adds a check for strong consistency configuration -- warning when
  strong consistency is enabled and AAE is disabled
  (defect basho/riak_kv#959)
Part 1/8

This commit removes parts of riaknostic that we don't want to
port over.

This work forks the develop branch of riaknostic at:

 - https://github.com/basho/riaknostic/tree/develop

The base commit is:

 - basho/riaknostic@7f02e64
Part 2/8

This commit renames the riaknostic files which are to be ported
to weatherreport and adds a modification note to the notices at
the top of each file.

This is groundwork for the following commits where the
functionality of each file will be modified to work with Apache
CouchDB.

It should be noted that this commit completely breaks the build.
Part 3/8

Remove various Makefile targets that we don't want to port over.
Part 4/8

Remove lager and add the following dependencies:

 - twig (logging application for CouchDB)
 - config (config application for CouchDB)
Part 5/8

This commit updates the renamed files in src/ to perform
functinality that it appropriate for CouchDB. As well as the
obvious substitutions of weatherreport for riaknostic, this
commit also makes the following changes:

src/weatherreport.erl

 - Removed unused options and re-escriptize

src/weatherreport_check_membership.erl

 - Updated to use mem3 instead of riak_kv_status to get the
   list of member nodes

src/weatherreport_check_nodes_connected.erl

 - Updated to use mem3 instead of riak_kv_status to get the
   list of member nodes

src/weatherreport_config.erl

 - Removed app_env functions (this is handled by the config app
   instead)
 - Updated data_directories/0 to return the data and view
   directories for CouchDB
 - Removed unused base_dir/0 function
 - Updated user/0 to get the current user via `whoami` (this makes
   an assumption that the script is run under the same user that
   is running CouchDB)
 - Replaced lager with twig
 - Updated load_app_config/0 to start the config app

src/weatherreport_node.erl

 - Updated cluster_command/4 to use mem3 instead of riak_kv_status
   to get the list of member nodes
 - Made nodename/0 a public function so it can be used by checks
 - Removed stats functions

src/weatherreport_util.erl

 - Made should_log/1 public so it can be used in weatherreport.erl
 - Replaced lager with twig
Part 8/8

Make docs and comments refer to WeatherReport and CouchDB instead
of Riaknostic and Riak.
Add a markdown document which takes potential contributers through
the process of adding a check.
Add a diagnostic check which verifies there is a registered
mem3_sync process running on the cluster node.
mikewallace1979 and others added 27 commits April 19, 2021 00:35
This commit generalises the logic that checks for process
attributes and moves it to weatherreport_utils. This is
groundwork for the following commit which will add a check
for processes by memory usage by re-using the generalised code.

We also take the opportunity to rename fold_processes to
something that is more descriptive and also correct.

BugzID: 32875
This commit adds a diagnostic check that checks for processes
where memory usage exceeds a set threshold (100MB).

BugzID: 32875
The initial design of Riaknostic (and hence WeatherReport) was
that checks ran in the script and made RPC calls to the local
cluster node when needed. Due to a requirement to allow checks
to run on multiple cluster nodes without guarantees of SSH
connectivity, the design has changed such that each check is
now run entirely on the cluster node.

This means it is no longer necessary to use
weatherreport_node:local_command/3 to interact with the cluster,
the calls can just be made directly.

This commit removes the redundant calls to
weatherreport_node:local_command/3 which makes the intent of the
code a bit clearer.

BugzID: 34016
Unless it is explicitly required for a check, avoid returning
the node name in diagnostic messages. The node name is added by
the logging code so including it in the messages is not necessary.
Custodian will output a message if there are one or more
conflicting revisions for a shard map. Previously
weatherreport_check_custodian failed to match that message which
caused the check to fail.

This commit adds new function clauses to report_to_message/1 and
format/1 so that the conflict messages are output to the console
at the warning log level.

BugzID: 34157
This commit adds a check for the number of pending internal
replication jobs on a node. A large number of pending internal
replication jobs indicates that internal replication is falling
behind. A warning message is returned if the number of jobs
exceeds a hard-coded threshold, otherwise an info message is
returned.

BugzID: 32872
Check the absolute statistics obtained by recon:node_stats/4
over a one second period. The values are sampled ten times and
the mean is returned.

For run_queue and process_count the mean is compared to
hard-coded thresholds which determine whether a warning or
info message is returned. For all other statistics an info
message is always returned.

BugzID: 32877
This commit downgrades messages about process call counts to
notice level.

The previous level of warning was inappropriate for many
processes as it really just indicated normal operation of the
system.

Until this check is made a bit smarter, with custom thresholds
per MFA, we will log at nothing more severe than notice so that
we don't confuse operators into thinking something is wrong.

BugzID: 37593
Port fork of custom `recon` functions for checking process calls.
weatherreport previously relied on Cloudant's IOQ implementation.
This adds support for the default IOQ so that it works with either.
In default CouchDB, search is disabled by default, so a failure to
connect to clouseau should only be a warning.
@jaydoane jaydoane merged commit 88d1824 into 3.x Apr 20, 2021
@jaydoane jaydoane deleted the weatherreport branch April 20, 2021 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants