Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-645] Add flakiness checker #11572

Merged
merged 10 commits into from
Jul 24, 2018
Merged

Conversation

cetsai
Copy link
Contributor

@cetsai cetsai commented Jul 5, 2018

Description

Added a new script under tools called flakiness checker, which runs tests a large number of times to check for flakiness. @haojin2 @eric-haibin-lin @azai91 @anirudh2290

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Added flakiness chacker

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@cetsai cetsai requested a review from szha as a code owner July 5, 2018 23:01
@cetsai cetsai changed the title add flakiness checker Add flakiness checker Jul 5, 2018
"MXNET_TEST_COUNT, defaults to 500")

parser.add_argument("-s", "--seed",
help="random seed, passed as MXNET_TEST_SEED")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also elaborate that if no seed is provided then using a random seed

test_file += ".py"
test_path = test_file
top = str(subprocess.check_output(["git", "rev-parse", "--show-toplevel"]),
errors= "strict").strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align the arguments

new_env["MXNET_TEST_SEED"] = seed

code = subprocess.call(["nosetests", "-s","--verbose",test_path], env = new_env)
print("nosetests completed with return code " + str(code))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need to unset the environment variables after the run? Zero effect on the user's environment variables is preferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need, the user environment variables are local to the spawned process. The actual environment is not modified.

@haojin2
Copy link
Contributor

haojin2 commented Jul 6, 2018

@marcoabreu Please take a look and share your comments on this, this could be a useful tool for contributors to check the flakiness with an out-of-box experience.


parser.add_argument("test", action=NameAction,
help="file name and and function name of test, "
"provided in the format: file_name.test_name")
Copy link
Contributor

@haojin2 haojin2 Jul 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there're two formats? One is the <path-to-file>:<test-name> while the other is the <name-of-test-file>.<test_name>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this in my most recent commit, and updated the help message accordingly

@marcoabreu
Copy link
Contributor

I'm currently on vacation and will be back next Friday. I'll do a review then.

@marcoabreu marcoabreu self-assigned this Jul 6, 2018
@haojin2
Copy link
Contributor

haojin2 commented Jul 6, 2018

@marcoabreu We are not deploying this yet but using it as a developer tool, as this is the work of our intern @cetsai, who only stays with us for a short period of time, is it okay if we do a review within our team and merge this?

import argparse


DEFAULT_NUM_TRIALS = 500
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flakyness Test should be at least 10000 runs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.99^500 < 0.01, which means that a flaky test that succeeds with 99% of probability has less than 1% chance of succeeding 500 runs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this is just the default value, for this tool you can always provide an argument to specify how many trials you want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our biggest issues are the random seeds as well as race conditions - both cases which are not directly related to the number of runs but just are really just pure random. My experience has shown that sometimes we only reveal a problem with the 8000th iteration

My concern is that people trust the default values (and they should). If everyone has to change the value themselves, we should just increase it directly. I don't know of anybody using 500 as iteration count

Copy link
Contributor

@haojin2 haojin2 Jul 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually some unit tests may take up to more than 30 seconds each, if you're running it for more than 10000 times by default then that's 300000 seconds which is ~3.5days. Say if some test is related to random seeds, it's due to input values generated randomly by the seed which is also randomly drawn, that's like a fixed probability. BTW 0.999^7000 < 0.001 so we don't even need 10000 times I guess...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our software runs in services that that serve millions of requests - even a 0.1% chance can be devastating. We should err on the side of caution as a default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, I think the main purpose of running tests is to check your own implementation rather than making tests pass all the time. For discovering problems in your own code, less number of trials is sufficient. In addition, when we deploy this on CI we also don't want it to introduce much overhead in the CI. As it is stated in the design doc of Carl's project, the golden standard of checking flakiness is running it infinite number of times or at least with all possible seeds (0-2147483647), but it's also something impossible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there're a lot of unit tests that would last for more than 30 seconds...even it only lasts for 3.5 seconds each, 10000 runs would cost ~10hrs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the purpose of this script is specifically to detect flaky tests, right? I'd expect that the expectation would then be that it passes all the time.

Sure, we won't achieve perfect coverage and at some points you get diminishing returns, but I have seen that even you usually use 10000 runs as the standard value for your tests. Time should also not be that much of a factor (even a few hours are fine) because this is intended to he run in the background. Robustness tests are a time intensive task and they should be designed to leave as few room for errors as possible (with costs in mind).

For CI you don't have to worry, this will run asynchronously and we'll make a proper assessment what are be best values for which case - it will probably be a Time-Boxed execution. But we are not at that stage yet.

Fyi, we got 455 tests < 0.1s, 509 < 1s, 565 < 3s and 63 > 3s. 628 tests in total.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is this supposed to be a required check for every PR? If it's taking too long then is it becoming an obstacle for PRs.

new_env["MXNET_TEST_SEED"] = seed

code = subprocess.call(["nosetests", "-s","--verbose",test_path], env = new_env)
print("nosetests completed with return code " + str(code))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need, the user environment variables are local to the spawned process. The actual environment is not modified.

code = subprocess.call(["nosetests", "-s","--verbose",test_path], env = new_env)
print("nosetests completed with return code " + str(code))

def find_test_path(test_file):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some documentation? From the name alone it was not clear to me what this exactly does. Also, how is it going to handle duplicates and the fact that we import tests into other testfiles, e.g. test_operator_gpu?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take a closer look at the code, you'll see that the file name and test name are both provided by the user. So if you would like to run the gpu version of test_sigmoid then you'll do <path-to-test_operator_gpu.py>:test_sigmoid or test_operator_gpu.test_sigmoid. The problem you're asking is probably "if we modified some tests in test_operator.py, how do we detect that we also need to run the gpu version of it?", which is designed to be handled by other components in the project (this is stated in the design doc too).


for (path, names, files) in os.walk(top):
if test_file in files:
test_path = path + "/" + test_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path.join

@marcoabreu
Copy link
Contributor

Sure, my review is non blocking. please just address my comments.
Also, please add some documentation somewhere so other people can find this tool. The best place would be the wiki site which describes how to reproduce test results

else:
new_env["MXNET_TEST_SEED"] = seed

code = subprocess.call(["nosetests", "-s","--verbose",test_path],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should consider passing logging-level as an argument and set it here or use a default logging-level argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good, point, I'll add that in my next commit


def run_test_trials(args):
test_path = args.test_path + ":" + args.test_name
print("testing: " + test_path)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to use logging instead of print. This allows you to make log messages without having to concatenate strings

If a directory was provided as part of the argument, the directory will be
joined with cwd unless it was an absolute path, in which case, the
absolute path will be used instead.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if the behaviour described in this doc is actually resembled by the code. I'm a bit concerned about the absolute path because you always join with cwd. Also you should only append the .py if it isn't present yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path.join will only use the last absolute path provided, so if the user provides an absolute path it will override cwd, I've tested this and it does work this way. Currently, I'm using re.split to process the command line argument with test file and test name, and I'm removing .py from the case when it's present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Very nice catch and approach!

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! This is a great step towards more stable tests!

@haojin2
Copy link
Contributor

haojin2 commented Jul 6, 2018

@marcoabreu Thanks for the quick reviews!

@larroy
Copy link
Contributor

larroy commented Jul 7, 2018

Could you do something more sophisticated that can be automated?

With nose internals you could grab all the tests, maybe blacklist some, run them and time how long they take, then set a budget of time and run each test a number of iterations with different seeds, to detect flakyness.

Example, of using nose internals:

In [4]: from nose import loader
In [5]: tl.loadTestsFromDir('.')
In [6]: tl = loader.TestLoader()
In [7]: ti=tl.loadTestsFromDir('.')
In [8]: tests=list(ti)
In [9]: len(tests)

http:https://nose.readthedocs.io/en/latest/api/suite.html

@haojin2
Copy link
Contributor

haojin2 commented Jul 7, 2018

@larroy I would to clarify the scope and use case of this script here, this is only part of the final automated checker bot for flaky tests, it's being submitted now as a standalone helper tool for developers to have an easier way to check their tests. The reason we're submitting this is that we are seeing the need of such a tool during the flaky tests fixes. What you suggest is nice to have, but it probably does not fit in the scope of this tool. I can share the link to the doc of this project to you if you would like more details on the whole project and where this small tool stands within the project.

@cetsai cetsai changed the title Add flakiness checker [MXNET-645] Add flakiness checker Jul 9, 2018
@marcoabreu
Copy link
Contributor

@larroy are we good to move ahead?

@haojin2
Copy link
Contributor

haojin2 commented Jul 18, 2018

@larroy @marcoabreu Is this good for merge? The second batch of flakiness fixes is about to be on and we'd like to make this available for all developers involved.

@marcoabreu marcoabreu merged commit f407345 into apache:master Jul 24, 2018
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
* add flakiness checker

* fixed style and argument parsing

* added verbosity option, further documentation, etc.

* added logging

* Added check for invalid argument

* updated error message for specificity

* fixed help message
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants