-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Roadmap WIP] Standardize and increase coverage for TorchBench #1293
Comments
To better help with different stakeholders, we are migrating away from Benefits of using TorchBench userbenchmark:
Therefore, I believe the above section "Fit for typical user scenarios" should be developed as a new TorchBench userbenchmark, instead of modifying We are still keeping I am happy to answer any questions about TorchBench userbenchmarks from Intel, please feel free to reach out on Slack, or here at GitHub. |
@xuzhao9 Thanks for the information! We will look into |
@yanbing-j My answers:
|
Hi @xuzhao9, thanks for the future plan sharing, may I know is there any guideline or document to demonstrate how to enable a new benchmark under
And could you also share the roughly timeline for this CPU userbenchmarks deliver plan? And in your perspective, do you want @yanbing-j and I work based on this CPU userbenchmark in nearly future or we can define a new one? |
@chuanqi129 We plan to deliver the first CPU userbenchmark within a month, it will be about the stableness of CPU latency across all TorchBench models. I suggest Intel can work on a their own userbenchmark (a new one). I created a userbenchmark doc here: #1328 |
Thanks @xuzhao9 for the update, I will check it. |
…#1371) Summary: In order to standardize the performance evluation and increase coverage, added the `channels_last` for all torchbench models, and enable it with `run.py` entry for debug using. This PR as a first step to standardize and increase coverage for TorchBench, which works for the roadmap #1293. Took `alexnet` as an example, which run on CLX 8280L (28cc) ```shell python run.py alexnet -d cpu -m eager -t eval Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 108.189 milliseconds ``` ```shell python run.py alexnet -d cpu -m eager -t eval --channels-last Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 72.930 milliseconds ``` Pull Request resolved: #1371 Reviewed By: davidberard98 Differential Revision: D43273579 Pulled By: xuzhao9 fbshipit-source-id: 9597d996d27dd228445e3e8122e5e7131cc93669
Summary: Enabled fuser for jit backend and added fuser3 for llga path Works for Roadmap #1293 for jit support, below is an example on CLX machine ``` $ python run.py alexnet -d cpu -m jit -t eval Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 70.918 milliseconds Correctness: True $ python run.py alexnet -d cpu -m jit --fuser fuser0 -t eval Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 70.741 milliseconds Correctness: True $ python run.py alexnet -d cpu -m jit --fuser fuser3 -t eval Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 64.179 milliseconds Correctness: True ``` Pull Request resolved: #1449 Reviewed By: davidberard98 Differential Revision: D43837782 Pulled By: xuzhao9 fbshipit-source-id: 313578112f1a406d42bc5d0d599e5fc20f4bfd0b
Summary: Enable fx int8 for most models on cpu device Works for Roadmap #1293 for fx int8 support, below is an example on CLX machine ``` $ python run.py alexnet -d cpu -t eval --precision fp32 -m eager Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 93.586 milliseconds CPU Peak Memory: 6.3857 GB $ python run.py alexnet -d cpu -t eval --precision fx_int8 -m eager Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 21.892 milliseconds CPU Peak Memory: 1.4150 GB $ python run.py alexnet -d cpu -t eval --precision fp32 -m jit Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 70.556 milliseconds CPU Peak Memory: 1.5918 GB Correctness: True $ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 21.176 milliseconds CPU Peak Memory: 1.6758 GB Correctness: True $ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit --quant-engine fbgemm Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 29.487 milliseconds CPU Peak Memory: 1.6777 GB Correctness: True ``` Pull Request resolved: #1485 Reviewed By: weiwangmeta Differential Revision: D44256938 Pulled By: xuzhao9 fbshipit-source-id: 1754028660b6908e66616531a42571e9c08690e6
Summary: This PR is to add typical GNN workloads which is one task in #1293. This task includes: - Add models in `torchbenchmark`: including `GCN`, `GraphSage` and `GAT`. - Use real datasets as inputs: Split subgraph from `Reddit`. - Add metrics Pull Request resolved: #1422 Reviewed By: weiwangmeta Differential Revision: D43946504 Pulled By: xuzhao9 fbshipit-source-id: 6e979ed871c2d3bffa16ca1c6b713d71f56bb7e3
Summary: Fix #1548 . Works for Roadmap #1293 for Increase benchmark coverage, Before: ```bash python run.py llama -d cpu Traceback (most recent call last): File "run.py", line 298, in <module> m = Model(device=args.device, test=args.test, jit=(args.mode == "jit"), batch_size=args.bs, extra_args=extra_args) File "/workspace/benchmark/torchbenchmark/util/model.py", line 20, in __call__ obj = type.__call__(cls, *args, **kwargs) File "/workspace/benchmark/torchbenchmark/models/llama/__init__.py", line 16, in __init__ super().__init__(test=test, device=device, jit=jit, batch_size=batch_size, extra_args=extra_args) File "/workspace/benchmark/torchbenchmark/util/model.py", line 84, in __init__ self.determine_batch_size(batch_size) File "/workspace/benchmark/torchbenchmark/util/model.py", line 216, in determine_batch_size raise NotImplementedError(f"Test {self.test} is not implemented.") NotImplementedError: Test eval is not implemented. ``` After: ```bash python run.py llama -d cpu --bs 32 Running eval method from llama on cpu in eager mode with input batch size 32. CPU Total Wall Time: 11.997 milliseconds CPU Peak Memory: 1.3799 GB python run.py llama -d cpu --bs 16 Running eval method from llama on cpu in eager mode with input batch size 16. CPU Total Wall Time: 9.870 milliseconds CPU Peak Memory: 1.3770 GB ``` Pull Request resolved: #1549 Reviewed By: aaronenyeshi Differential Revision: D45005325 Pulled By: xuzhao9 fbshipit-source-id: 265532b33f83e87fecf94eac95e29f65ad8083f4
Summary: This PR is to add amp support in CPU in TorchBench, which contributes to #1293. To be compatible with current amp implementation, we add 3 options in `--precision`: `--precision bf16`: use `enable_bf16` to convert model and inputs to bf16 `--precision amp_bf16`: use `torch.cpu.amp.autocast(dtype=torch.bfloat16)` (can extend to cuda bf16 when ready) `--precision amp_fp16`: use `torch.cuda.amp.autocast(dtype=torch.float16)` (can extend to cpu fp16 when ready) `--precision amp`: use torch.autocast(device), same as --amp ### Performance Test in Copper Lake machine. $ python run.py alexnet -d cpu -m eager -t eval --precision fp32 Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision fp32. CPU Total Wall Time: 92.600 milliseconds CPU Peak Memory: 1.1299 GB $ python run.py alexnet -d cpu -m eager -t eval --precision bf16 Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision bf16. CPU Total Wall Time: 56.580 milliseconds CPU Peak Memory: 0.6934 GB $ python run.py alexnet -d cpu -m eager -t eval --precision amp_bf16 Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16. CPU Total Wall Time: 71.385 milliseconds CPU Peak Memory: 0.9922 GB $ python run.py alexnet -d cpu -m eager -t train --precision fp32 Running train method from alexnet on cpu in eager mode with input batch size 128 and precision fp32. CPU Total Wall Time: 306.164 milliseconds CPU Peak Memory: 2.0977 GB $ python run.py alexnet -d cpu -m eager -t train --precision bf16 Running train method from alexnet on cpu in eager mode with input batch size 128 and precision bf16. CPU Total Wall Time: 180.958 milliseconds CPU Peak Memory: 1.2686 GB $ python run.py alexnet -d cpu -m eager -t train --precision amp_bf16 Running train method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16. CPU Total Wall Time: 233.332 milliseconds CPU Peak Memory: 2.0117 GB Pull Request resolved: #1516 Reviewed By: aaronenyeshi Differential Revision: D44883144 Pulled By: xuzhao9 fbshipit-source-id: 75251f9eec128b3a1dbca39540193b89059ec183
Summary: Add initial cpu userbenchmark for torchbench Works for Roadmap #1293 for cpu userbenchmark extend with below functions. - [x] Add core binding option, support multi-instances test. - [x] Add gomp/iomp option. - [x] Add memory allocator option. - [x] Support all enabled cpu features test based on torchbench models, e.g. channels-last / fx_int8 / jit with fusers - [x] Support latency and cpu_peak_mem metrics for now, will extend to fps-like report - [x] Add `README.md` For example, in below cml, we tested 2 models fx_int8 inference with batch size 8 on CLX socket 0 and 4 instances at the same time. ```shell $ python run_benchmark.py cpu --model resnet50,alexnet --test eval -b 8 --precision fx_int8 --launcher --launcher-args "--node-id 0 --ninstances 4" Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'resnet50', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')] 2023-04-20 00:43:37,960 - __main__ - INFO - Use JeMalloc memory allocator 2023-04-20 00:43:37,960 - __main__ - INFO - OMP_NUM_THREADS=7 2023-04-20 00:43:37,960 - __main__ - INFO - Using Intel OpenMP 2023-04-20 00:43:37,960 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 2023-04-20 00:43:37,960 - __main__ - INFO - KMP_BLOCKTIME=1 2023-04-20 00:43:37,960 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done] [Done] [Done] [Done] Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'alexnet', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')] 2023-04-20 00:43:53,444 - __main__ - INFO - Use JeMalloc memory allocator 2023-04-20 00:43:53,444 - __main__ - INFO - OMP_NUM_THREADS=7 2023-04-20 00:43:53,444 - __main__ - INFO - Using Intel OpenMP 2023-04-20 00:43:53,444 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 2023-04-20 00:43:53,444 - __main__ - INFO - KMP_BLOCKTIME=1 2023-04-20 00:43:53,444 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done] [Done] [Done] [Done] ``` We can find the test results in `.userbenchmark/cpu/cpu-20230420004336`, `cpu` userbenchmark will create a subfolder for each test, and aggregate all test results into `metrics-20230420004336.json`. For each sub-folder, it contains instances logs named with instance PID for that model test. ```shell $ ls .userbenchmark/cpu/cpu-20230420004336 eval_alexnet_eager/ eval_resnet50_eager/ $ ls .userbenchmark/cpu/cpu-20230420004336/eval_alexnet_eager/ metrics-3347653.json metrics-3347654.json metrics-3347655.json metrics-3347656.json $ cat .userbenchmark/cpu/metrics-20230420004336.json { "name": "cpu", "environ": { "pytorch_git_version": "de1114554c38322273c066c091d455519d45472d" }, "metrics": { "alexnet-eval-eager_latency": 58.309660750000006, "alexnet-eval-eager_cmem": 0.416259765625, "resnet50-eval-eager_latency": 335.04970325, "resnet50-eval-eager_cmem": 0.90673828125 } } ``` Pull Request resolved: #1559 Reviewed By: aaronenyeshi Differential Revision: D45450175 Pulled By: xuzhao9 fbshipit-source-id: 8e7528f4d694eae182ee601cd80bc6e57cd14e3c
Hi @chuanqi129 , I tried running your userbenchmark on our CI runner, and it failed with error: https://github.com/pytorch/benchmark/actions/runs/4885488240 Also, please let me know about which runner you would like to deploy your benchmark on. |
Thanks @xuzhao9 for your great support for the cpu userbenchmark. I'm sorry about the late reply due to the out of office for labor holiday and AL. I will focus on it recent days and fix the CI runner failures. Some reply for comments in #1559 as below
I have double checked the instance types in Pytorch ci node pool,
The 8259CL also belongs to 2nd Generation Intel® Xeon® Scalable Processors (CLX). So it also support fp32/int8. We can use BTW, ideally, it will be great if we can deploy the cpu benchmark on |
@chuanqi129 I am wondering does the dynamo cpu dashboard work on GitHub Actions? Can I have the GitHub Actions workflow file? |
No, the dynamo cpu dashboard maintained by our side. Attached this test used Dockerfile and scripts. In this test, all needed components are built from source code. And I also think it will be great if we can integrate this dynamo cpu dashboard test into Pytorch Github Action, but it needs |
Summary: Works for Roadmap #1293 to increase benchmark coverage. For these 5 models:tacotron2, yolov3, nvidia_deeprecommender, LearningToPoint and pytorch_CycleGAN_and_pix2pix, when running on custom devices except for CPU and CUDA(e.g. XPU), it will raise the error as it's hard-coded with CPU/CUDA backends. In this PR, we accept the device args as a param within the training process and inference process which will cover the model initializing and data transposition for these custom devices. Pull Request resolved: #2230 Reviewed By: aaronenyeshi Differential Revision: D56097643 Pulled By: xuzhao9 fbshipit-source-id: deba28fee42b5119f62dbddc15e017bf00eb6843
Summary: This Pull Request relates to [Roadmap Issue #1293 by enhancing our benchmark coverage. Currently, Torchbench utilizes a custom random seed function that is incompatible with the XPU device backend. This incompatibility affects models that include random data augmentation operations, leading to accuracy check failures due to variations in input data across two separate runs. In this PR, we introduce support for setting a random seed for the XPU backend. Pull Request resolved: #2270 Reviewed By: davidberard98 Differential Revision: D57884498 Pulled By: xuzhao9 fbshipit-source-id: 24234674333945782b233191e42a8b344e90d74c
Summary: We are adding the documentations on how to develop a TorchBench userbenchmark for our customers. Related to pytorch/benchmark#1293 Pull Request resolved: pytorch/benchmark#1328 Reviewed By: davidberard98 Differential Revision: D41778804 Pulled By: xuzhao9 fbshipit-source-id: 89e019ad2014aa0b60b82c7104b93fa63b71e169
… (#1371) Summary: In order to standardize the performance evluation and increase coverage, added the `channels_last` for all torchbench models, and enable it with `run.py` entry for debug using. This PR as a first step to standardize and increase coverage for TorchBench, which works for the roadmap pytorch/benchmark#1293. Took `alexnet` as an example, which run on CLX 8280L (28cc) ```shell python run.py alexnet -d cpu -m eager -t eval Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 108.189 milliseconds ``` ```shell python run.py alexnet -d cpu -m eager -t eval --channels-last Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 72.930 milliseconds ``` Pull Request resolved: pytorch/benchmark#1371 Reviewed By: davidberard98 Differential Revision: D43273579 Pulled By: xuzhao9 fbshipit-source-id: 9597d996d27dd228445e3e8122e5e7131cc93669
Summary: Enabled fuser for jit backend and added fuser3 for llga path Works for Roadmap pytorch/benchmark#1293 for jit support, below is an example on CLX machine ``` $ python run.py alexnet -d cpu -m jit -t eval Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 70.918 milliseconds Correctness: True $ python run.py alexnet -d cpu -m jit --fuser fuser0 -t eval Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 70.741 milliseconds Correctness: True $ python run.py alexnet -d cpu -m jit --fuser fuser3 -t eval Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 64.179 milliseconds Correctness: True ``` Pull Request resolved: pytorch/benchmark#1449 Reviewed By: davidberard98 Differential Revision: D43837782 Pulled By: xuzhao9 fbshipit-source-id: 313578112f1a406d42bc5d0d599e5fc20f4bfd0b
Summary: Enable fx int8 for most models on cpu device Works for Roadmap pytorch/benchmark#1293 for fx int8 support, below is an example on CLX machine ``` $ python run.py alexnet -d cpu -t eval --precision fp32 -m eager Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 93.586 milliseconds CPU Peak Memory: 6.3857 GB $ python run.py alexnet -d cpu -t eval --precision fx_int8 -m eager Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 21.892 milliseconds CPU Peak Memory: 1.4150 GB $ python run.py alexnet -d cpu -t eval --precision fp32 -m jit Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 70.556 milliseconds CPU Peak Memory: 1.5918 GB Correctness: True $ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 21.176 milliseconds CPU Peak Memory: 1.6758 GB Correctness: True $ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit --quant-engine fbgemm Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 29.487 milliseconds CPU Peak Memory: 1.6777 GB Correctness: True ``` Pull Request resolved: pytorch/benchmark#1485 Reviewed By: weiwangmeta Differential Revision: D44256938 Pulled By: xuzhao9 fbshipit-source-id: 1754028660b6908e66616531a42571e9c08690e6
Summary: This PR is to add typical GNN workloads which is one task in pytorch/benchmark#1293. This task includes: - Add models in `torchbenchmark`: including `GCN`, `GraphSage` and `GAT`. - Use real datasets as inputs: Split subgraph from `Reddit`. - Add metrics Pull Request resolved: pytorch/benchmark#1422 Reviewed By: weiwangmeta Differential Revision: D43946504 Pulled By: xuzhao9 fbshipit-source-id: 6e979ed871c2d3bffa16ca1c6b713d71f56bb7e3
Summary: Fix pytorch/benchmark#1548 . Works for Roadmap pytorch/benchmark#1293 for Increase benchmark coverage, Before: ```bash python run.py llama -d cpu Traceback (most recent call last): File "run.py", line 298, in <module> m = Model(device=args.device, test=args.test, jit=(args.mode == "jit"), batch_size=args.bs, extra_args=extra_args) File "/workspace/benchmark/torchbenchmark/util/model.py", line 20, in __call__ obj = type.__call__(cls, *args, **kwargs) File "/workspace/benchmark/torchbenchmark/models/llama/__init__.py", line 16, in __init__ super().__init__(test=test, device=device, jit=jit, batch_size=batch_size, extra_args=extra_args) File "/workspace/benchmark/torchbenchmark/util/model.py", line 84, in __init__ self.determine_batch_size(batch_size) File "/workspace/benchmark/torchbenchmark/util/model.py", line 216, in determine_batch_size raise NotImplementedError(f"Test {self.test} is not implemented.") NotImplementedError: Test eval is not implemented. ``` After: ```bash python run.py llama -d cpu --bs 32 Running eval method from llama on cpu in eager mode with input batch size 32. CPU Total Wall Time: 11.997 milliseconds CPU Peak Memory: 1.3799 GB python run.py llama -d cpu --bs 16 Running eval method from llama on cpu in eager mode with input batch size 16. CPU Total Wall Time: 9.870 milliseconds CPU Peak Memory: 1.3770 GB ``` Pull Request resolved: pytorch/benchmark#1549 Reviewed By: aaronenyeshi Differential Revision: D45005325 Pulled By: xuzhao9 fbshipit-source-id: 265532b33f83e87fecf94eac95e29f65ad8083f4
Summary: This PR is to add amp support in CPU in TorchBench, which contributes to pytorch/benchmark#1293. To be compatible with current amp implementation, we add 3 options in `--precision`: `--precision bf16`: use `enable_bf16` to convert model and inputs to bf16 `--precision amp_bf16`: use `torch.cpu.amp.autocast(dtype=torch.bfloat16)` (can extend to cuda bf16 when ready) `--precision amp_fp16`: use `torch.cuda.amp.autocast(dtype=torch.float16)` (can extend to cpu fp16 when ready) `--precision amp`: use torch.autocast(device), same as --amp ### Performance Test in Copper Lake machine. $ python run.py alexnet -d cpu -m eager -t eval --precision fp32 Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision fp32. CPU Total Wall Time: 92.600 milliseconds CPU Peak Memory: 1.1299 GB $ python run.py alexnet -d cpu -m eager -t eval --precision bf16 Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision bf16. CPU Total Wall Time: 56.580 milliseconds CPU Peak Memory: 0.6934 GB $ python run.py alexnet -d cpu -m eager -t eval --precision amp_bf16 Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16. CPU Total Wall Time: 71.385 milliseconds CPU Peak Memory: 0.9922 GB $ python run.py alexnet -d cpu -m eager -t train --precision fp32 Running train method from alexnet on cpu in eager mode with input batch size 128 and precision fp32. CPU Total Wall Time: 306.164 milliseconds CPU Peak Memory: 2.0977 GB $ python run.py alexnet -d cpu -m eager -t train --precision bf16 Running train method from alexnet on cpu in eager mode with input batch size 128 and precision bf16. CPU Total Wall Time: 180.958 milliseconds CPU Peak Memory: 1.2686 GB $ python run.py alexnet -d cpu -m eager -t train --precision amp_bf16 Running train method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16. CPU Total Wall Time: 233.332 milliseconds CPU Peak Memory: 2.0117 GB Pull Request resolved: pytorch/benchmark#1516 Reviewed By: aaronenyeshi Differential Revision: D44883144 Pulled By: xuzhao9 fbshipit-source-id: 75251f9eec128b3a1dbca39540193b89059ec183
Summary: Add initial cpu userbenchmark for torchbench Works for Roadmap pytorch/benchmark#1293 for cpu userbenchmark extend with below functions. - [x] Add core binding option, support multi-instances test. - [x] Add gomp/iomp option. - [x] Add memory allocator option. - [x] Support all enabled cpu features test based on torchbench models, e.g. channels-last / fx_int8 / jit with fusers - [x] Support latency and cpu_peak_mem metrics for now, will extend to fps-like report - [x] Add `README.md` For example, in below cml, we tested 2 models fx_int8 inference with batch size 8 on CLX socket 0 and 4 instances at the same time. ```shell $ python run_benchmark.py cpu --model resnet50,alexnet --test eval -b 8 --precision fx_int8 --launcher --launcher-args "--node-id 0 --ninstances 4" Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'resnet50', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')] 2023-04-20 00:43:37,960 - __main__ - INFO - Use JeMalloc memory allocator 2023-04-20 00:43:37,960 - __main__ - INFO - OMP_NUM_THREADS=7 2023-04-20 00:43:37,960 - __main__ - INFO - Using Intel OpenMP 2023-04-20 00:43:37,960 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 2023-04-20 00:43:37,960 - __main__ - INFO - KMP_BLOCKTIME=1 2023-04-20 00:43:37,960 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done] [Done] [Done] [Done] Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'alexnet', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')] 2023-04-20 00:43:53,444 - __main__ - INFO - Use JeMalloc memory allocator 2023-04-20 00:43:53,444 - __main__ - INFO - OMP_NUM_THREADS=7 2023-04-20 00:43:53,444 - __main__ - INFO - Using Intel OpenMP 2023-04-20 00:43:53,444 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 2023-04-20 00:43:53,444 - __main__ - INFO - KMP_BLOCKTIME=1 2023-04-20 00:43:53,444 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done] [Done] [Done] [Done] ``` We can find the test results in `.userbenchmark/cpu/cpu-20230420004336`, `cpu` userbenchmark will create a subfolder for each test, and aggregate all test results into `metrics-20230420004336.json`. For each sub-folder, it contains instances logs named with instance PID for that model test. ```shell $ ls .userbenchmark/cpu/cpu-20230420004336 eval_alexnet_eager/ eval_resnet50_eager/ $ ls .userbenchmark/cpu/cpu-20230420004336/eval_alexnet_eager/ metrics-3347653.json metrics-3347654.json metrics-3347655.json metrics-3347656.json $ cat .userbenchmark/cpu/metrics-20230420004336.json { "name": "cpu", "environ": { "pytorch_git_version": "de1114554c38322273c066c091d455519d45472d" }, "metrics": { "alexnet-eval-eager_latency": 58.309660750000006, "alexnet-eval-eager_cmem": 0.416259765625, "resnet50-eval-eager_latency": 335.04970325, "resnet50-eval-eager_cmem": 0.90673828125 } } ``` Pull Request resolved: pytorch/benchmark#1559 Reviewed By: aaronenyeshi Differential Revision: D45450175 Pulled By: xuzhao9 fbshipit-source-id: 8e7528f4d694eae182ee601cd80bc6e57cd14e3c
Summary: We are adding the documentations on how to develop a TorchBench userbenchmark for our customers. Related to pytorch/benchmark#1293 Pull Request resolved: pytorch/benchmark#1328 Reviewed By: davidberard98 Differential Revision: D41778804 Pulled By: xuzhao9 fbshipit-source-id: 89e019ad2014aa0b60b82c7104b93fa63b71e169
… (#1371) Summary: In order to standardize the performance evluation and increase coverage, added the `channels_last` for all torchbench models, and enable it with `run.py` entry for debug using. This PR as a first step to standardize and increase coverage for TorchBench, which works for the roadmap pytorch/benchmark#1293. Took `alexnet` as an example, which run on CLX 8280L (28cc) ```shell python run.py alexnet -d cpu -m eager -t eval Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 108.189 milliseconds ``` ```shell python run.py alexnet -d cpu -m eager -t eval --channels-last Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 72.930 milliseconds ``` Pull Request resolved: pytorch/benchmark#1371 Reviewed By: davidberard98 Differential Revision: D43273579 Pulled By: xuzhao9 fbshipit-source-id: 9597d996d27dd228445e3e8122e5e7131cc93669
Summary: Enabled fuser for jit backend and added fuser3 for llga path Works for Roadmap pytorch/benchmark#1293 for jit support, below is an example on CLX machine ``` $ python run.py alexnet -d cpu -m jit -t eval Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 70.918 milliseconds Correctness: True $ python run.py alexnet -d cpu -m jit --fuser fuser0 -t eval Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 70.741 milliseconds Correctness: True $ python run.py alexnet -d cpu -m jit --fuser fuser3 -t eval Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 64.179 milliseconds Correctness: True ``` Pull Request resolved: pytorch/benchmark#1449 Reviewed By: davidberard98 Differential Revision: D43837782 Pulled By: xuzhao9 fbshipit-source-id: 313578112f1a406d42bc5d0d599e5fc20f4bfd0b
Summary: Enable fx int8 for most models on cpu device Works for Roadmap pytorch/benchmark#1293 for fx int8 support, below is an example on CLX machine ``` $ python run.py alexnet -d cpu -t eval --precision fp32 -m eager Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 93.586 milliseconds CPU Peak Memory: 6.3857 GB $ python run.py alexnet -d cpu -t eval --precision fx_int8 -m eager Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 21.892 milliseconds CPU Peak Memory: 1.4150 GB $ python run.py alexnet -d cpu -t eval --precision fp32 -m jit Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 70.556 milliseconds CPU Peak Memory: 1.5918 GB Correctness: True $ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 21.176 milliseconds CPU Peak Memory: 1.6758 GB Correctness: True $ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit --quant-engine fbgemm Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 29.487 milliseconds CPU Peak Memory: 1.6777 GB Correctness: True ``` Pull Request resolved: pytorch/benchmark#1485 Reviewed By: weiwangmeta Differential Revision: D44256938 Pulled By: xuzhao9 fbshipit-source-id: 1754028660b6908e66616531a42571e9c08690e6
Summary: This PR is to add typical GNN workloads which is one task in pytorch/benchmark#1293. This task includes: - Add models in `torchbenchmark`: including `GCN`, `GraphSage` and `GAT`. - Use real datasets as inputs: Split subgraph from `Reddit`. - Add metrics Pull Request resolved: pytorch/benchmark#1422 Reviewed By: weiwangmeta Differential Revision: D43946504 Pulled By: xuzhao9 fbshipit-source-id: 6e979ed871c2d3bffa16ca1c6b713d71f56bb7e3
Summary: Fix pytorch/benchmark#1548 . Works for Roadmap pytorch/benchmark#1293 for Increase benchmark coverage, Before: ```bash python run.py llama -d cpu Traceback (most recent call last): File "run.py", line 298, in <module> m = Model(device=args.device, test=args.test, jit=(args.mode == "jit"), batch_size=args.bs, extra_args=extra_args) File "/workspace/benchmark/torchbenchmark/util/model.py", line 20, in __call__ obj = type.__call__(cls, *args, **kwargs) File "/workspace/benchmark/torchbenchmark/models/llama/__init__.py", line 16, in __init__ super().__init__(test=test, device=device, jit=jit, batch_size=batch_size, extra_args=extra_args) File "/workspace/benchmark/torchbenchmark/util/model.py", line 84, in __init__ self.determine_batch_size(batch_size) File "/workspace/benchmark/torchbenchmark/util/model.py", line 216, in determine_batch_size raise NotImplementedError(f"Test {self.test} is not implemented.") NotImplementedError: Test eval is not implemented. ``` After: ```bash python run.py llama -d cpu --bs 32 Running eval method from llama on cpu in eager mode with input batch size 32. CPU Total Wall Time: 11.997 milliseconds CPU Peak Memory: 1.3799 GB python run.py llama -d cpu --bs 16 Running eval method from llama on cpu in eager mode with input batch size 16. CPU Total Wall Time: 9.870 milliseconds CPU Peak Memory: 1.3770 GB ``` Pull Request resolved: pytorch/benchmark#1549 Reviewed By: aaronenyeshi Differential Revision: D45005325 Pulled By: xuzhao9 fbshipit-source-id: 265532b33f83e87fecf94eac95e29f65ad8083f4
Summary: This PR is to add amp support in CPU in TorchBench, which contributes to pytorch/benchmark#1293. To be compatible with current amp implementation, we add 3 options in `--precision`: `--precision bf16`: use `enable_bf16` to convert model and inputs to bf16 `--precision amp_bf16`: use `torch.cpu.amp.autocast(dtype=torch.bfloat16)` (can extend to cuda bf16 when ready) `--precision amp_fp16`: use `torch.cuda.amp.autocast(dtype=torch.float16)` (can extend to cpu fp16 when ready) `--precision amp`: use torch.autocast(device), same as --amp ### Performance Test in Copper Lake machine. $ python run.py alexnet -d cpu -m eager -t eval --precision fp32 Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision fp32. CPU Total Wall Time: 92.600 milliseconds CPU Peak Memory: 1.1299 GB $ python run.py alexnet -d cpu -m eager -t eval --precision bf16 Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision bf16. CPU Total Wall Time: 56.580 milliseconds CPU Peak Memory: 0.6934 GB $ python run.py alexnet -d cpu -m eager -t eval --precision amp_bf16 Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16. CPU Total Wall Time: 71.385 milliseconds CPU Peak Memory: 0.9922 GB $ python run.py alexnet -d cpu -m eager -t train --precision fp32 Running train method from alexnet on cpu in eager mode with input batch size 128 and precision fp32. CPU Total Wall Time: 306.164 milliseconds CPU Peak Memory: 2.0977 GB $ python run.py alexnet -d cpu -m eager -t train --precision bf16 Running train method from alexnet on cpu in eager mode with input batch size 128 and precision bf16. CPU Total Wall Time: 180.958 milliseconds CPU Peak Memory: 1.2686 GB $ python run.py alexnet -d cpu -m eager -t train --precision amp_bf16 Running train method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16. CPU Total Wall Time: 233.332 milliseconds CPU Peak Memory: 2.0117 GB Pull Request resolved: pytorch/benchmark#1516 Reviewed By: aaronenyeshi Differential Revision: D44883144 Pulled By: xuzhao9 fbshipit-source-id: 75251f9eec128b3a1dbca39540193b89059ec183
Summary: Add initial cpu userbenchmark for torchbench Works for Roadmap pytorch/benchmark#1293 for cpu userbenchmark extend with below functions. - [x] Add core binding option, support multi-instances test. - [x] Add gomp/iomp option. - [x] Add memory allocator option. - [x] Support all enabled cpu features test based on torchbench models, e.g. channels-last / fx_int8 / jit with fusers - [x] Support latency and cpu_peak_mem metrics for now, will extend to fps-like report - [x] Add `README.md` For example, in below cml, we tested 2 models fx_int8 inference with batch size 8 on CLX socket 0 and 4 instances at the same time. ```shell $ python run_benchmark.py cpu --model resnet50,alexnet --test eval -b 8 --precision fx_int8 --launcher --launcher-args "--node-id 0 --ninstances 4" Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'resnet50', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')] 2023-04-20 00:43:37,960 - __main__ - INFO - Use JeMalloc memory allocator 2023-04-20 00:43:37,960 - __main__ - INFO - OMP_NUM_THREADS=7 2023-04-20 00:43:37,960 - __main__ - INFO - Using Intel OpenMP 2023-04-20 00:43:37,960 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 2023-04-20 00:43:37,960 - __main__ - INFO - KMP_BLOCKTIME=1 2023-04-20 00:43:37,960 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done] [Done] [Done] [Done] Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'alexnet', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')] 2023-04-20 00:43:53,444 - __main__ - INFO - Use JeMalloc memory allocator 2023-04-20 00:43:53,444 - __main__ - INFO - OMP_NUM_THREADS=7 2023-04-20 00:43:53,444 - __main__ - INFO - Using Intel OpenMP 2023-04-20 00:43:53,444 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 2023-04-20 00:43:53,444 - __main__ - INFO - KMP_BLOCKTIME=1 2023-04-20 00:43:53,444 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done] [Done] [Done] [Done] ``` We can find the test results in `.userbenchmark/cpu/cpu-20230420004336`, `cpu` userbenchmark will create a subfolder for each test, and aggregate all test results into `metrics-20230420004336.json`. For each sub-folder, it contains instances logs named with instance PID for that model test. ```shell $ ls .userbenchmark/cpu/cpu-20230420004336 eval_alexnet_eager/ eval_resnet50_eager/ $ ls .userbenchmark/cpu/cpu-20230420004336/eval_alexnet_eager/ metrics-3347653.json metrics-3347654.json metrics-3347655.json metrics-3347656.json $ cat .userbenchmark/cpu/metrics-20230420004336.json { "name": "cpu", "environ": { "pytorch_git_version": "de1114554c38322273c066c091d455519d45472d" }, "metrics": { "alexnet-eval-eager_latency": 58.309660750000006, "alexnet-eval-eager_cmem": 0.416259765625, "resnet50-eval-eager_latency": 335.04970325, "resnet50-eval-eager_cmem": 0.90673828125 } } ``` Pull Request resolved: pytorch/benchmark#1559 Reviewed By: aaronenyeshi Differential Revision: D45450175 Pulled By: xuzhao9 fbshipit-source-id: 8e7528f4d694eae182ee601cd80bc6e57cd14e3c
Motivation
TorchBench
is a collection of open-source benchmarks used to evaluate PyTorch performance. It provides a standardized API for benchmark drivers, both for evaluation (eager/jit) and training. Plenty of popular models are involved inTorchBench
. Users are convenient to debug and profile.In order to standardize the performance evluation and increase coverage,
TorchBench
can be enhanced in the following 3 aspects in CPU:Detailed proposal
Fit for typical user scenarios (especially in userbenchmark)
add a new userbenchmark with CPU runtime configuration options, enable those configurations into
test.py/run.py
also for sanity check or debuggingsupport performance metrics in the new CPU userbenchmark
Well integrate new features of PyTorch
Increase benchmark coverage
Increase model coverage
Port OpBench to TorchBench
The text was updated successfully, but these errors were encountered: