[MXNET-645] Add flakiness checker #11572

cetsai · 2018-07-05T23:01:08Z

Description

Added a new script under tools called flakiness checker, which runs tests a large number of times to check for flakiness. @haojin2 @eric-haibin-lin @azai91 @anirudh2290

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Added flakiness chacker

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

haojin2 · 2018-07-06T02:40:58Z

tools/flakiness_checker.py

+ "MXNET_TEST_COUNT, defaults to 500")
+
+ parser.add_argument("-s", "--seed",
+ help="random seed, passed as MXNET_TEST_SEED") 


Maybe also elaborate that if no seed is provided then using a random seed

haojin2 · 2018-07-06T03:01:49Z

tools/flakiness_checker.py

+ test_file += ".py"
+ test_path = test_file
+ top = str(subprocess.check_output(["git", "rev-parse", "--show-toplevel"]),
+ errors= "strict").strip()


align the arguments

haojin2 · 2018-07-06T03:13:06Z

tools/flakiness_checker.py

+ new_env["MXNET_TEST_SEED"] = seed
+
+ code = subprocess.call(["nosetests", "-s","--verbose",test_path], env = new_env)
+ print("nosetests completed with return code " + str(code))


I'm not sure if we need to unset the environment variables after the run? Zero effect on the user's environment variables is preferred.

No need, the user environment variables are local to the spawned process. The actual environment is not modified.

haojin2 · 2018-07-06T04:52:54Z

@marcoabreu Please take a look and share your comments on this, this could be a useful tool for contributors to check the flakiness with an out-of-box experience.

haojin2 · 2018-07-06T05:07:55Z

tools/flakiness_checker.py

+
+ parser.add_argument("test", action=NameAction,
+ help="file name and and function name of test, "
+ "provided in the format: file_name.test_name")


I think there're two formats? One is the <path-to-file>:<test-name> while the other is the <name-of-test-file>.<test_name>?

I've added this in my most recent commit, and updated the help message accordingly

marcoabreu · 2018-07-06T12:49:34Z

I'm currently on vacation and will be back next Friday. I'll do a review then.

haojin2 · 2018-07-06T16:51:29Z

@marcoabreu We are not deploying this yet but using it as a developer tool, as this is the work of our intern @cetsai, who only stays with us for a short period of time, is it okay if we do a review within our team and merge this?

marcoabreu · 2018-07-06T16:58:01Z

tools/flakiness_checker.py

+import argparse
+
+
+DEFAULT_NUM_TRIALS = 500


Flakyness Test should be at least 10000 runs

0.99^500 < 0.01, which means that a flaky test that succeeds with 99% of probability has less than 1% chance of succeeding 500 runs.

BTW this is just the default value, for this tool you can always provide an argument to specify how many trials you want.

Our biggest issues are the random seeds as well as race conditions - both cases which are not directly related to the number of runs but just are really just pure random. My experience has shown that sometimes we only reveal a problem with the 8000th iteration

My concern is that people trust the default values (and they should). If everyone has to change the value themselves, we should just increase it directly. I don't know of anybody using 500 as iteration count

Actually some unit tests may take up to more than 30 seconds each, if you're running it for more than 10000 times by default then that's 300000 seconds which is ~3.5days. Say if some test is related to random seeds, it's due to input values generated randomly by the seed which is also randomly drawn, that's like a fixed probability. BTW 0.999^7000 < 0.001 so we don't even need 10000 times I guess...

Our software runs in services that that serve millions of requests - even a 0.1% chance can be devastating. We should err on the side of caution as a default.

On the other hand, I think the main purpose of running tests is to check your own implementation rather than making tests pass all the time. For discovering problems in your own code, less number of trials is sufficient. In addition, when we deploy this on CI we also don't want it to introduce much overhead in the CI. As it is stated in the design doc of Carl's project, the golden standard of checking flakiness is running it infinite number of times or at least with all possible seeds (0-2147483647), but it's also something impossible.

Well, there're a lot of unit tests that would last for more than 30 seconds...even it only lasts for 3.5 seconds each, 10000 runs would cost ~10hrs.

Well the purpose of this script is specifically to detect flaky tests, right? I'd expect that the expectation would then be that it passes all the time.

Sure, we won't achieve perfect coverage and at some points you get diminishing returns, but I have seen that even you usually use 10000 runs as the standard value for your tests. Time should also not be that much of a factor (even a few hours are fine) because this is intended to he run in the background. Robustness tests are a time intensive task and they should be designed to leave as few room for errors as possible (with costs in mind).

For CI you don't have to worry, this will run asynchronously and we'll make a proper assessment what are be best values for which case - it will probably be a Time-Boxed execution. But we are not at that stage yet.

Fyi, we got 455 tests < 0.1s, 509 < 1s, 565 < 3s and 63 > 3s. 628 tests in total.

But is this supposed to be a required check for every PR? If it's taking too long then is it becoming an obstacle for PRs.

marcoabreu · 2018-07-06T16:59:16Z

tools/flakiness_checker.py

+ new_env["MXNET_TEST_SEED"] = seed
+
+ code = subprocess.call(["nosetests", "-s","--verbose",test_path], env = new_env)
+ print("nosetests completed with return code " + str(code))


No need, the user environment variables are local to the spawned process. The actual environment is not modified.

marcoabreu · 2018-07-06T17:01:49Z

tools/flakiness_checker.py

+ code = subprocess.call(["nosetests", "-s","--verbose",test_path], env = new_env)
+ print("nosetests completed with return code " + str(code))
+
+def find_test_path(test_file):


Could you add some documentation? From the name alone it was not clear to me what this exactly does. Also, how is it going to handle duplicates and the fact that we import tests into other testfiles, e.g. test_operator_gpu?

If you take a closer look at the code, you'll see that the file name and test name are both provided by the user. So if you would like to run the gpu version of test_sigmoid then you'll do <path-to-test_operator_gpu.py>:test_sigmoid or test_operator_gpu.test_sigmoid. The problem you're asking is probably "if we modified some tests in test_operator.py, how do we detect that we also need to run the gpu version of it?", which is designed to be handled by other components in the project (this is stated in the design doc too).

marcoabreu · 2018-07-06T17:02:08Z

tools/flakiness_checker.py

+
+ for (path, names, files) in os.walk(top):
+ if test_file in files:
+ test_path = path + "/" + test_path


marcoabreu · 2018-07-06T17:06:27Z

Sure, my review is non blocking. please just address my comments.
Also, please add some documentation somewhere so other people can find this tool. The best place would be the wiki site which describes how to reproduce test results

anirudh2290 · 2018-07-06T18:14:06Z

tools/flakiness_checker.py

+ else:
+ new_env["MXNET_TEST_SEED"] = seed
+
+ code = subprocess.call(["nosetests", "-s","--verbose",test_path], 


we should consider passing logging-level as an argument and set it here or use a default logging-level argument.

good, point, I'll add that in my next commit

marcoabreu · 2018-07-06T19:28:20Z

tools/flakiness_checker.py

+
+def run_test_trials(args):
+ test_path = args.test_path + ":" + args.test_name
+ print("testing: " + test_path)



Try to use logging instead of print. This allows you to make log messages without having to concatenate strings

marcoabreu · 2018-07-06T21:41:00Z

tools/flakiness_checker.py

+ If a directory was provided as part of the argument, the directory will be
+ joined with cwd unless it was an absolute path, in which case, the
+ absolute path will be used instead. 
+ """


I don't know if the behaviour described in this doc is actually resembled by the code. I'm a bit concerned about the absolute path because you always join with cwd. Also you should only append the .py if it isn't present yet

Path.join will only use the last absolute path provided, so if the user provides an absolute path it will override cwd, I've tested this and it does work this way. Currently, I'm using re.split to process the command line argument with test file and test name, and I'm removing .py from the case when it's present.

Awesome! Very nice catch and approach!

marcoabreu

Thanks a lot! This is a great step towards more stable tests!

haojin2 · 2018-07-06T23:03:21Z

@marcoabreu Thanks for the quick reviews!

…to flakiness_checker

larroy · 2018-07-07T21:46:44Z

Could you do something more sophisticated that can be automated?

With nose internals you could grab all the tests, maybe blacklist some, run them and time how long they take, then set a budget of time and run each test a number of iterations with different seeds, to detect flakyness.

Example, of using nose internals:

In [4]: from nose import loader
In [5]: tl.loadTestsFromDir('.')
In [6]: tl = loader.TestLoader()
In [7]: ti=tl.loadTestsFromDir('.')
In [8]: tests=list(ti)
In [9]: len(tests)

http:https://nose.readthedocs.io/en/latest/api/suite.html

haojin2 · 2018-07-07T22:38:21Z

@larroy I would to clarify the scope and use case of this script here, this is only part of the final automated checker bot for flaky tests, it's being submitted now as a standalone helper tool for developers to have an easier way to check their tests. The reason we're submitting this is that we are seeing the need of such a tool during the flaky tests fixes. What you suggest is nice to have, but it probably does not fit in the scope of this tool. I can share the link to the doc of this project to you if you would like more details on the whole project and where this small tool stands within the project.

…to flakiness_checker

…or-mxnet into flakiness_checker

marcoabreu · 2018-07-13T08:08:19Z

@larroy are we good to move ahead?

haojin2 · 2018-07-18T22:44:49Z

@larroy @marcoabreu Is this good for merge? The second batch of flakiness fixes is about to be on and we'd like to make this available for all developers involved.

* add flakiness checker * fixed style and argument parsing * added verbosity option, further documentation, etc. * added logging * Added check for invalid argument * updated error message for specificity * fixed help message

add flakiness checker

ca5553b

cetsai requested a review from szha as a code owner July 5, 2018 23:01

cetsai changed the title ~~add flakiness checker~~ Add flakiness checker Jul 5, 2018

haojin2 reviewed Jul 6, 2018

View reviewed changes

marcoabreu self-assigned this Jul 6, 2018

marcoabreu reviewed Jul 6, 2018

View reviewed changes

fixed style and argument parsing

68ea32e

anirudh2290 reviewed Jul 6, 2018

View reviewed changes

added verbosity option, further documentation, etc.

14df126

marcoabreu reviewed Jul 6, 2018

View reviewed changes

added logging

e03e964

marcoabreu approved these changes Jul 6, 2018

View reviewed changes

cetsai added 2 commits July 6, 2018 21:53

Added check for invalid argument

16bf6b7

updated error message for specificity

8a06042

marcoabreu approved these changes Jul 6, 2018

View reviewed changes

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

e25b3ff

…to flakiness_checker

cetsai changed the title ~~Add flakiness checker~~ [MXNET-645] Add flakiness checker Jul 9, 2018

cetsai added 3 commits July 10, 2018 18:04

fixed help message

137c44e

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

6df65db

…to flakiness_checker

Merge branch 'flakiness_checker' of https://github.com/cetsai/incubat…

41f5d33

…or-mxnet into flakiness_checker

cetsai force-pushed the flakiness_checker branch from 5e08b67 to 41f5d33 Compare July 12, 2018 00:07

haojin2 approved these changes Jul 13, 2018

View reviewed changes

marcoabreu merged commit f407345 into apache:master Jul 24, 2018

[MXNET-645] Add flakiness checker #11572

[MXNET-645] Add flakiness checker #11572

Conversation

cetsai commented Jul 5, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haojin2 commented Jul 6, 2018

haojin2 Jul 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu commented Jul 6, 2018

haojin2 commented Jul 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haojin2 Jul 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu commented Jul 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu left a comment

Choose a reason for hiding this comment

haojin2 commented Jul 6, 2018

larroy commented Jul 7, 2018 • edited Loading

haojin2 commented Jul 7, 2018

marcoabreu commented Jul 13, 2018

haojin2 commented Jul 18, 2018

cetsai commented Jul 5, 2018 •

edited

Loading

haojin2 Jul 6, 2018 •

edited

Loading

haojin2 Jul 6, 2018 •

edited

Loading

larroy commented Jul 7, 2018 •

edited

Loading