Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add machine_util.py for tuning optimal benchmark settings #183

Merged
merged 14 commits into from
Jan 27, 2021

Conversation

wconstab
Copy link
Contributor

Applies some configuration settings automatically, but verifies more.

Future work

  • utilize the query part from inside benchmark to embed queried values in json results
  • additional optimizations like nohz, other nvidia settings, disabling IRQ
  • add temp/clock query for gpu; try querying nvidia-smi performance mode

Applies some configuration settings automatically, but verifies more.

Future work 
- utilize the query part from inside benchmark to embed queried values in json results
- additional optimizations like nohz, other nvidia settings, disabling IRQ
- add temp/clock query for gpu; try querying nvidia-smi performance mode
Copy link
Contributor

@xuzhao9 xuzhao9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with small comments.

isolcpus.add(int(chunk))
return list(isolcpus)

def get_nvidia_graphics_clock(device_id=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to get all available graphics clock by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, i have already made that change actually. I am adding more stuff still. I'll update the PR in a little bit.

- debugging issue where GOMP_CPU_AFFINITY and taskset are
  properly specified with a range e.g. 4-47  on the command line
  for pytest, but the machine util sees process affinity as just
  4; this does not happen when using the same command line with
  python instead of pytest and invoking the machine util

- need to verify the extra json file size incurred by logging
  machine state every benchmark
wconstab and others added 8 commits January 11, 2021 18:05
- be more explicit about checking the machine os config and bail gracefully outside amazon linux
- restore limited isolated cpu checks modulo pytorch issue #49971
- implement nvidia clock setting
- add 'ignore_machine_config' option to enable running without checks
- bail out before running tests rather than after
- log machine info in the json
since this CI runner is not one of the performance tuned machines, just used for correctness.
f.write(content)

def check_intel_turbo_state(turbo_file='/sys/devices/system/cpu/intel_pstate/no_turbo'):
return int(read_sys_file(turbo_file))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be 1 - int(read_sys_file(turbo_file)) because if "no_turbo" is 1 means turbo is disabled, this function should return 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yea good point.

wconstab and others added 3 commits January 26, 2021 14:30
torchbenchmark isn't installed in a proper way, so it doesn't work
to use relative imports from a script with a main function that is
nested inside the torchbenchmark package. work around this for now.

make machine_config script agnostic to cpu core pinning when run
as a script, since pinning is  checked by conftest during benchmarking
and it's pointless to make the user pin the configuration script.
@wconstab wconstab merged commit f9e5a75 into master Jan 27, 2021
@wconstab wconstab deleted the wconstab/add_machine_util branch January 27, 2021 03:56
xuzhao9 added a commit to xuzhao9/benchmark that referenced this pull request Jan 28, 2021
Applies some configuration settings automatically, but verifies more.

Only works on amazon linux so far; tries to bail gracefully on other platforms.

Usage:
- as a command line script, machine_config.py will check or configure the machine
sudo `which python` <path to machine_config.py> --configure

- as a library, provides functions for use e.g. in conftest.py
  - asserts benchmark script is run with configured settings
  - logs machine settings to benchmark data file

Other miscellaneous fixes:
* Add score plot, nightly sweep scripts
* Add legend to sweep result plotting script. (pytorch#193)

Moves compute_score and some other utils around.  Needs more work.

torchbenchmark isn't installed in a proper way, so it doesn't work
to use relative imports from a script with a main function that is
nested inside the torchbenchmark package. work around this for now.

make machine_config script agnostic to cpu core pinning when run
as a script, since pinning is  checked by conftest during benchmarking
and it's pointless to make the user pin the configuration script.

Co-authored-by: xz <[email protected]>
xuzhao9 added a commit that referenced this pull request Jan 29, 2021
* Add machine_util.py for tuning optimal benchmark settings 
Cherry-picked from master branch PR #183

Applies some configuration settings automatically, but verifies more.

Only works on amazon linux so far; tries to bail gracefully on other platforms.

Usage:
- as a command line script, machine_config.py will check or configure the machine
sudo `which python` <path to machine_config.py> --configure

- as a library, provides functions for use e.g. in conftest.py
  - asserts benchmark script is run with configured settings
  - logs machine settings to benchmark data file

Other miscellaneous fixes:
* Add score plot, nightly sweep scripts
* Add legend to sweep result plotting script. (#193)

Moves compute_score and some other utils around.  Needs more work.

torchbenchmark isn't installed in a proper way, so it doesn't work
to use relative imports from a script with a main function that is
nested inside the torchbenchmark package. work around this for now.

make machine_config script agnostic to cpu core pinning when run
as a script, since pinning is  checked by conftest during benchmarking
and it's pointless to make the user pin the configuration script.

Co-authored-by: xz <[email protected]>

* Fix CI script paths under the new directory hierarchy.

Co-authored-by: Will Constable <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants