[Roadmap WIP] Standardize and increase coverage for TorchBench #1293

yanbing-j · 2022-11-09T06:31:44Z

Motivation

TorchBench is a collection of open-source benchmarks used to evaluate PyTorch performance. It provides a standardized API for benchmark drivers, both for evaluation (eager/jit) and training. Plenty of popular models are involved in TorchBench. Users are convenient to debug and profile.

In order to standardize the performance evluation and increase coverage, TorchBench can be enhanced in the following 3 aspects in CPU:

Fit for typical user scenarios
Well integrate new features of PyTorch
Increase benchmark coverage

Detailed proposal

Fit for typical user scenarios (especially in userbenchmark)

add a new userbenchmark with CPU runtime configuration options, enable those configurations into test.py/run.py also for sanity check or debugging

Add core binding option, may leverage torch launcher
Add gomp/iomp option
Add memory allocator option

support performance metrics in the new CPU userbenchmark

Add throughput: Samples / Total time
Add latency: Total time / samples
Add fps-like report

Well integrate new features of PyTorch

Enable bf16 datatype support both for inference and training
Fully support channels_last both for inference and training
Extend a complier option to support Dynamo
Support JIT tracing and cover more models with JIT support
Enable quantization support

Increase benchmark coverage

Increase model coverage

Add models from community with popularity (e.g, RNN-T)
Add models from real customers (Multi-Band MelGAN, ViT and Wav2vec)
Fix some models not implemented in CPU (e.g, DALLE2_pytorch, moco, pytorch_struct, tacotron2, timm_efficientdet, vision_maskrcnn)
Add typical GNN workloads

Port OpBench to TorchBench

Increase OpBench coverage
Complete support of dtypes, memory-format and inplace version for ops

The text was updated successfully, but these errors were encountered:

xuzhao9 · 2022-11-09T17:51:19Z

To better help with different stakeholders, we are migrating away from test_bench.py to the new "userbenchmark" approach. In benchmark/userbenchmark (https://github.com/pytorch/benchmark/tree/main/userbenchmark), we encourage users to develop their customized benchmarks with TorchBench models and use the "run_benchmark.py" driver to drive their benchmark.

Benefits of using TorchBench userbenchmark:

We can decouple benchmarks with their infrastructures, and run different benchmarks on different machines. For example, CPU benchmarks don't need GPU machines, and GPU benchmarks don't need too powerful CPU.
We are also decoupling the benchmark model code with the experiments we are running on them. We believe this design is much clearer and can easily attribute the code to the correct owners.
We can easily support specific userbenchmark on the PyTorch PR level. For example, specify "RUN_TORCHBENCH: " can run a userbenchmark in a PyTorch PR for A/B testing and get result visualization on PyTorch HUD (e.g., https://hud.pytorch.org/userbenchmark_view?url=https:%2F%2Fossci-metrics.s3.amazonaws.com%2Ftorchbench-pr-test%2Fpr84626%2Fresult.csv)
In the future we can also bisect userbenchmark metrics to pinpoint problematic commits.

Therefore, I believe the above section "Fit for typical user scenarios" should be developed as a new TorchBench userbenchmark, instead of modifying test_bench.py.

We are still keeping test.py for unit testing purpose for now.

I am happy to answer any questions about TorchBench userbenchmarks from Intel, please feel free to reach out on Slack, or here at GitHub.

yanbing-j · 2022-11-09T23:50:35Z

@xuzhao9 Thanks for the information! We will look into userbenchmark and update this roadmap.
I have some questions, will userbenchmark replace test_bench completely to guarantee the PR quality in the future? And how about torchbenchmark? Any userbenchmark support CPU at present? Thanks!

xuzhao9 · 2022-11-11T22:23:13Z

@yanbing-j My answers:

Yes, the plan is to use userbenchmark to replace test_bench.py. However, the PR quality will still be guaranteed with test.py, which is what we are doing right now. We will keep test.py but deprecate test_bench.py.
torchbenchmark describes the benchmark model code, and we won't change that.
The "release-test" (https://github.com/pytorch/benchmark/tree/main/userbenchmark/release-test) userbenchmark tests both CPU and GPU performance on a couple of models in pytorch/examples. We are also working to deliver more CPU userbenchmarks soon. The next one will be measuring the stableness of both CPU and GPU tests across torchbench.

chuanqi129 · 2022-11-17T06:11:26Z

Hi @xuzhao9, thanks for the future plan sharing, may I know is there any guideline or document to demonstrate how to enable a new benchmark under userbenchmark?

We are also working to deliver more CPU userbenchmarks soon.

And could you also share the roughly timeline for this CPU userbenchmarks deliver plan? And in your perspective, do you want @yanbing-j and I work based on this CPU userbenchmark in nearly future or we can define a new one?

xuzhao9 · 2022-11-29T22:47:15Z

@chuanqi129 We plan to deliver the first CPU userbenchmark within a month, it will be about the stableness of CPU latency across all TorchBench models. I suggest Intel can work on a their own userbenchmark (a new one).

I created a userbenchmark doc here: #1328

chuanqi129 · 2022-11-30T02:33:39Z

Thanks @xuzhao9 for the update, I will check it.

Summary: We are adding the documentations on how to develop a TorchBench userbenchmark for our customers. Related to #1293 Pull Request resolved: #1328 Reviewed By: davidberard98 Differential Revision: D41778804 Pulled By: xuzhao9 fbshipit-source-id: 89e019ad2014aa0b60b82c7104b93fa63b71e169

…#1371) Summary: In order to standardize the performance evluation and increase coverage, added the `channels_last` for all torchbench models, and enable it with `run.py` entry for debug using. This PR as a first step to standardize and increase coverage for TorchBench, which works for the roadmap #1293. Took `alexnet` as an example, which run on CLX 8280L (28cc) ```shell python run.py alexnet -d cpu -m eager -t eval Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 108.189 milliseconds ``` ```shell python run.py alexnet -d cpu -m eager -t eval --channels-last Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 72.930 milliseconds ``` Pull Request resolved: #1371 Reviewed By: davidberard98 Differential Revision: D43273579 Pulled By: xuzhao9 fbshipit-source-id: 9597d996d27dd228445e3e8122e5e7131cc93669

Summary: Enabled fuser for jit backend and added fuser3 for llga path Works for Roadmap #1293 for jit support, below is an example on CLX machine ``` $ python run.py alexnet -d cpu -m jit -t eval Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 70.918 milliseconds Correctness: True $ python run.py alexnet -d cpu -m jit --fuser fuser0 -t eval Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 70.741 milliseconds Correctness: True $ python run.py alexnet -d cpu -m jit --fuser fuser3 -t eval Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 64.179 milliseconds Correctness: True ``` Pull Request resolved: #1449 Reviewed By: davidberard98 Differential Revision: D43837782 Pulled By: xuzhao9 fbshipit-source-id: 313578112f1a406d42bc5d0d599e5fc20f4bfd0b

Summary: Enable fx int8 for most models on cpu device Works for Roadmap #1293 for fx int8 support, below is an example on CLX machine ``` $ python run.py alexnet -d cpu -t eval --precision fp32 -m eager Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 93.586 milliseconds CPU Peak Memory: 6.3857 GB $ python run.py alexnet -d cpu -t eval --precision fx_int8 -m eager Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 21.892 milliseconds CPU Peak Memory: 1.4150 GB $ python run.py alexnet -d cpu -t eval --precision fp32 -m jit Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 70.556 milliseconds CPU Peak Memory: 1.5918 GB Correctness: True $ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 21.176 milliseconds CPU Peak Memory: 1.6758 GB Correctness: True $ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit --quant-engine fbgemm Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 29.487 milliseconds CPU Peak Memory: 1.6777 GB Correctness: True ``` Pull Request resolved: #1485 Reviewed By: weiwangmeta Differential Revision: D44256938 Pulled By: xuzhao9 fbshipit-source-id: 1754028660b6908e66616531a42571e9c08690e6

Summary: This PR is to add typical GNN workloads which is one task in #1293. This task includes: - Add models in `torchbenchmark`: including `GCN`, `GraphSage` and `GAT`. - Use real datasets as inputs: Split subgraph from `Reddit`. - Add metrics Pull Request resolved: #1422 Reviewed By: weiwangmeta Differential Revision: D43946504 Pulled By: xuzhao9 fbshipit-source-id: 6e979ed871c2d3bffa16ca1c6b713d71f56bb7e3

Summary: Fix #1548 . Works for Roadmap #1293 for Increase benchmark coverage, Before: ```bash python run.py llama -d cpu Traceback (most recent call last): File "run.py", line 298, in <module> m = Model(device=args.device, test=args.test, jit=(args.mode == "jit"), batch_size=args.bs, extra_args=extra_args) File "/workspace/benchmark/torchbenchmark/util/model.py", line 20, in __call__ obj = type.__call__(cls, *args, **kwargs) File "/workspace/benchmark/torchbenchmark/models/llama/__init__.py", line 16, in __init__ super().__init__(test=test, device=device, jit=jit, batch_size=batch_size, extra_args=extra_args) File "/workspace/benchmark/torchbenchmark/util/model.py", line 84, in __init__ self.determine_batch_size(batch_size) File "/workspace/benchmark/torchbenchmark/util/model.py", line 216, in determine_batch_size raise NotImplementedError(f"Test {self.test} is not implemented.") NotImplementedError: Test eval is not implemented. ``` After: ```bash python run.py llama -d cpu --bs 32 Running eval method from llama on cpu in eager mode with input batch size 32. CPU Total Wall Time: 11.997 milliseconds CPU Peak Memory: 1.3799 GB python run.py llama -d cpu --bs 16 Running eval method from llama on cpu in eager mode with input batch size 16. CPU Total Wall Time: 9.870 milliseconds CPU Peak Memory: 1.3770 GB ``` Pull Request resolved: #1549 Reviewed By: aaronenyeshi Differential Revision: D45005325 Pulled By: xuzhao9 fbshipit-source-id: 265532b33f83e87fecf94eac95e29f65ad8083f4

Summary: This PR is to add amp support in CPU in TorchBench, which contributes to #1293. To be compatible with current amp implementation, we add 3 options in `--precision`: `--precision bf16`: use `enable_bf16` to convert model and inputs to bf16 `--precision amp_bf16`: use `torch.cpu.amp.autocast(dtype=torch.bfloat16)` (can extend to cuda bf16 when ready) `--precision amp_fp16`: use `torch.cuda.amp.autocast(dtype=torch.float16)` (can extend to cpu fp16 when ready) `--precision amp`: use torch.autocast(device), same as --amp ### Performance Test in Copper Lake machine. $ python run.py alexnet -d cpu -m eager -t eval --precision fp32 Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision fp32. CPU Total Wall Time: 92.600 milliseconds CPU Peak Memory: 1.1299 GB $ python run.py alexnet -d cpu -m eager -t eval --precision bf16 Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision bf16. CPU Total Wall Time: 56.580 milliseconds CPU Peak Memory: 0.6934 GB $ python run.py alexnet -d cpu -m eager -t eval --precision amp_bf16 Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16. CPU Total Wall Time: 71.385 milliseconds CPU Peak Memory: 0.9922 GB $ python run.py alexnet -d cpu -m eager -t train --precision fp32 Running train method from alexnet on cpu in eager mode with input batch size 128 and precision fp32. CPU Total Wall Time: 306.164 milliseconds CPU Peak Memory: 2.0977 GB $ python run.py alexnet -d cpu -m eager -t train --precision bf16 Running train method from alexnet on cpu in eager mode with input batch size 128 and precision bf16. CPU Total Wall Time: 180.958 milliseconds CPU Peak Memory: 1.2686 GB $ python run.py alexnet -d cpu -m eager -t train --precision amp_bf16 Running train method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16. CPU Total Wall Time: 233.332 milliseconds CPU Peak Memory: 2.0117 GB Pull Request resolved: #1516 Reviewed By: aaronenyeshi Differential Revision: D44883144 Pulled By: xuzhao9 fbshipit-source-id: 75251f9eec128b3a1dbca39540193b89059ec183

Summary: Add initial cpu userbenchmark for torchbench Works for Roadmap #1293 for cpu userbenchmark extend with below functions. - [x] Add core binding option, support multi-instances test. - [x] Add gomp/iomp option. - [x] Add memory allocator option. - [x] Support all enabled cpu features test based on torchbench models, e.g. channels-last / fx_int8 / jit with fusers - [x] Support latency and cpu_peak_mem metrics for now, will extend to fps-like report - [x] Add `README.md` For example, in below cml, we tested 2 models fx_int8 inference with batch size 8 on CLX socket 0 and 4 instances at the same time. ```shell $ python run_benchmark.py cpu --model resnet50,alexnet --test eval -b 8 --precision fx_int8 --launcher --launcher-args "--node-id 0 --ninstances 4" Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'resnet50', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')] 2023-04-20 00:43:37,960 - __main__ - INFO - Use JeMalloc memory allocator 2023-04-20 00:43:37,960 - __main__ - INFO - OMP_NUM_THREADS=7 2023-04-20 00:43:37,960 - __main__ - INFO - Using Intel OpenMP 2023-04-20 00:43:37,960 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 2023-04-20 00:43:37,960 - __main__ - INFO - KMP_BLOCKTIME=1 2023-04-20 00:43:37,960 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done] [Done] [Done] [Done] Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'alexnet', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')] 2023-04-20 00:43:53,444 - __main__ - INFO - Use JeMalloc memory allocator 2023-04-20 00:43:53,444 - __main__ - INFO - OMP_NUM_THREADS=7 2023-04-20 00:43:53,444 - __main__ - INFO - Using Intel OpenMP 2023-04-20 00:43:53,444 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 2023-04-20 00:43:53,444 - __main__ - INFO - KMP_BLOCKTIME=1 2023-04-20 00:43:53,444 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done] [Done] [Done] [Done] ``` We can find the test results in `.userbenchmark/cpu/cpu-20230420004336`, `cpu` userbenchmark will create a subfolder for each test, and aggregate all test results into `metrics-20230420004336.json`. For each sub-folder, it contains instances logs named with instance PID for that model test. ```shell $ ls .userbenchmark/cpu/cpu-20230420004336 eval_alexnet_eager/ eval_resnet50_eager/ $ ls .userbenchmark/cpu/cpu-20230420004336/eval_alexnet_eager/ metrics-3347653.json metrics-3347654.json metrics-3347655.json metrics-3347656.json $ cat .userbenchmark/cpu/metrics-20230420004336.json { "name": "cpu", "environ": { "pytorch_git_version": "de1114554c38322273c066c091d455519d45472d" }, "metrics": { "alexnet-eval-eager_latency": 58.309660750000006, "alexnet-eval-eager_cmem": 0.416259765625, "resnet50-eval-eager_latency": 335.04970325, "resnet50-eval-eager_cmem": 0.90673828125 } } ``` Pull Request resolved: #1559 Reviewed By: aaronenyeshi Differential Revision: D45450175 Pulled By: xuzhao9 fbshipit-source-id: 8e7528f4d694eae182ee601cd80bc6e57cd14e3c

xuzhao9 · 2023-05-05T18:19:32Z

Hi @chuanqi129 , I tried running your userbenchmark on our CI runner, and it failed with error: https://github.com/pytorch/benchmark/actions/runs/4885488240

Also, please let me know about which runner you would like to deploy your benchmark on.

chuanqi129 · 2023-05-06T08:46:33Z

Hi @chuanqi129 , I tried running your userbenchmark on our CI runner, and it failed with error: https://github.com/pytorch/benchmark/actions/runs/4885488240

Also, please let me know about which runner you would like to deploy your benchmark on.

Thanks @xuzhao9 for your great support for the cpu userbenchmark. I'm sorry about the late reply due to the out of office for labor holiday and AL. I will focus on it recent days and fix the CI runner failures.

Some reply for comments in #1559 as below

For on-demand AWS instances, can you check any of the AWS instances in https://github.com/pytorch/test-infra/blob/main/.github/scale-config.yml can be used? It is the preferred approach.

I have double checked the instance types in Pytorch ci node pool, c5 instance based on 2nd Generation Intel® Xeon® Scalable Processors (CLX), which can support fp32 and int8, but no bf16 datatype support. We can try to use linux.24xlarge instance for initial test.

If not (for example, we can't reach a reasonable low noise level), we can use the AWS metal instance, which is AWS g4dn.metal, with Intel(R) Xeon(R) Platinum 8259CL CPU. Does it support fp32/int8?

The 8259CL also belongs to 2nd Generation Intel® Xeon® Scalable Processors (CLX). So it also support fp32/int8. We can use linux.24xlarge instance firstly, if it has large noise, we can try this metal one later.

BTW, ideally, it will be great if we can deploy the cpu benchmark on c6i.16xlarge instance, because it will same with our dyanmo cpu dashboard used instance. (Nice to have)

xuzhao9 · 2023-05-11T13:33:44Z

@chuanqi129 I am wondering does the dynamo cpu dashboard work on GitHub Actions? Can I have the GitHub Actions workflow file?

chuanqi129 · 2023-05-11T13:42:39Z

@chuanqi129 I am wondering does the dynamo cpu dashboard work on GitHub Actions? Can I have the GitHub Actions workflow file?

No, the dynamo cpu dashboard maintained by our side. Attached this test used Dockerfile and scripts. In this test, all needed components are built from source code. And I also think it will be great if we can integrate this dynamo cpu dashboard test into Pytorch Github Action, but it needs c6i instance.

Summary: Fixed cpu userbenchmark on jit mode works for #1293 Pull Request resolved: #1797 Reviewed By: FindHao Differential Revision: D47917179 Pulled By: xuzhao9 fbshipit-source-id: fcb95f7c9e9b5a3cf199afe1e3d6a1e19036884d

Summary: To stabilize the benchmark results of cpu userbenchmark. Also works for #1293 Pull Request resolved: #1908 Reviewed By: davidberard98 Differential Revision: D49418232 Pulled By: xuzhao9 fbshipit-source-id: b4d0aa97fa06ffaf12984fa5dece6a0d21759fe8

Summary: Works for Roadmap #1293 to increase benchmark coverage. For these 5 models:tacotron2, yolov3, nvidia_deeprecommender, LearningToPoint and pytorch_CycleGAN_and_pix2pix, when running on custom devices except for CPU and CUDA(e.g. XPU), it will raise the error as it's hard-coded with CPU/CUDA backends. In this PR, we accept the device args as a param within the training process and inference process which will cover the model initializing and data transposition for these custom devices. Pull Request resolved: #2230 Reviewed By: aaronenyeshi Differential Revision: D56097643 Pulled By: xuzhao9 fbshipit-source-id: deba28fee42b5119f62dbddc15e017bf00eb6843

Summary: This Pull Request relates to [Roadmap Issue #1293 by enhancing our benchmark coverage. Currently, Torchbench utilizes a custom random seed function that is incompatible with the XPU device backend. This incompatibility affects models that include random data augmentation operations, leading to accuracy check failures due to variations in input data across two separate runs. In this PR, we introduce support for setting a random seed for the XPU backend. Pull Request resolved: #2270 Reviewed By: davidberard98 Differential Revision: D57884498 Pulled By: xuzhao9 fbshipit-source-id: 24234674333945782b233191e42a8b344e90d74c

Summary: We are adding the documentations on how to develop a TorchBench userbenchmark for our customers. Related to pytorch/benchmark#1293 Pull Request resolved: pytorch/benchmark#1328 Reviewed By: davidberard98 Differential Revision: D41778804 Pulled By: xuzhao9 fbshipit-source-id: 89e019ad2014aa0b60b82c7104b93fa63b71e169

… (#1371) Summary: In order to standardize the performance evluation and increase coverage, added the `channels_last` for all torchbench models, and enable it with `run.py` entry for debug using. This PR as a first step to standardize and increase coverage for TorchBench, which works for the roadmap pytorch/benchmark#1293. Took `alexnet` as an example, which run on CLX 8280L (28cc) ```shell python run.py alexnet -d cpu -m eager -t eval Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 108.189 milliseconds ``` ```shell python run.py alexnet -d cpu -m eager -t eval --channels-last Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 72.930 milliseconds ``` Pull Request resolved: pytorch/benchmark#1371 Reviewed By: davidberard98 Differential Revision: D43273579 Pulled By: xuzhao9 fbshipit-source-id: 9597d996d27dd228445e3e8122e5e7131cc93669

Summary: Enabled fuser for jit backend and added fuser3 for llga path Works for Roadmap pytorch/benchmark#1293 for jit support, below is an example on CLX machine ``` $ python run.py alexnet -d cpu -m jit -t eval Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 70.918 milliseconds Correctness: True $ python run.py alexnet -d cpu -m jit --fuser fuser0 -t eval Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 70.741 milliseconds Correctness: True $ python run.py alexnet -d cpu -m jit --fuser fuser3 -t eval Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 64.179 milliseconds Correctness: True ``` Pull Request resolved: pytorch/benchmark#1449 Reviewed By: davidberard98 Differential Revision: D43837782 Pulled By: xuzhao9 fbshipit-source-id: 313578112f1a406d42bc5d0d599e5fc20f4bfd0b

Summary: Enable fx int8 for most models on cpu device Works for Roadmap pytorch/benchmark#1293 for fx int8 support, below is an example on CLX machine ``` $ python run.py alexnet -d cpu -t eval --precision fp32 -m eager Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 93.586 milliseconds CPU Peak Memory: 6.3857 GB $ python run.py alexnet -d cpu -t eval --precision fx_int8 -m eager Running eval method from alexnet on cpu in eager mode with input batch size 128. CPU Total Wall Time: 21.892 milliseconds CPU Peak Memory: 1.4150 GB $ python run.py alexnet -d cpu -t eval --precision fp32 -m jit Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 70.556 milliseconds CPU Peak Memory: 1.5918 GB Correctness: True $ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 21.176 milliseconds CPU Peak Memory: 1.6758 GB Correctness: True $ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit --quant-engine fbgemm Running eval method from alexnet on cpu in jit mode with input batch size 128. CPU Total Wall Time: 29.487 milliseconds CPU Peak Memory: 1.6777 GB Correctness: True ``` Pull Request resolved: pytorch/benchmark#1485 Reviewed By: weiwangmeta Differential Revision: D44256938 Pulled By: xuzhao9 fbshipit-source-id: 1754028660b6908e66616531a42571e9c08690e6

Summary: This PR is to add typical GNN workloads which is one task in pytorch/benchmark#1293. This task includes: - Add models in `torchbenchmark`: including `GCN`, `GraphSage` and `GAT`. - Use real datasets as inputs: Split subgraph from `Reddit`. - Add metrics Pull Request resolved: pytorch/benchmark#1422 Reviewed By: weiwangmeta Differential Revision: D43946504 Pulled By: xuzhao9 fbshipit-source-id: 6e979ed871c2d3bffa16ca1c6b713d71f56bb7e3

Summary: Fix pytorch/benchmark#1548 . Works for Roadmap pytorch/benchmark#1293 for Increase benchmark coverage, Before: ```bash python run.py llama -d cpu Traceback (most recent call last): File "run.py", line 298, in <module> m = Model(device=args.device, test=args.test, jit=(args.mode == "jit"), batch_size=args.bs, extra_args=extra_args) File "/workspace/benchmark/torchbenchmark/util/model.py", line 20, in __call__ obj = type.__call__(cls, *args, **kwargs) File "/workspace/benchmark/torchbenchmark/models/llama/__init__.py", line 16, in __init__ super().__init__(test=test, device=device, jit=jit, batch_size=batch_size, extra_args=extra_args) File "/workspace/benchmark/torchbenchmark/util/model.py", line 84, in __init__ self.determine_batch_size(batch_size) File "/workspace/benchmark/torchbenchmark/util/model.py", line 216, in determine_batch_size raise NotImplementedError(f"Test {self.test} is not implemented.") NotImplementedError: Test eval is not implemented. ``` After: ```bash python run.py llama -d cpu --bs 32 Running eval method from llama on cpu in eager mode with input batch size 32. CPU Total Wall Time: 11.997 milliseconds CPU Peak Memory: 1.3799 GB python run.py llama -d cpu --bs 16 Running eval method from llama on cpu in eager mode with input batch size 16. CPU Total Wall Time: 9.870 milliseconds CPU Peak Memory: 1.3770 GB ``` Pull Request resolved: pytorch/benchmark#1549 Reviewed By: aaronenyeshi Differential Revision: D45005325 Pulled By: xuzhao9 fbshipit-source-id: 265532b33f83e87fecf94eac95e29f65ad8083f4

Summary: This PR is to add amp support in CPU in TorchBench, which contributes to pytorch/benchmark#1293. To be compatible with current amp implementation, we add 3 options in `--precision`: `--precision bf16`: use `enable_bf16` to convert model and inputs to bf16 `--precision amp_bf16`: use `torch.cpu.amp.autocast(dtype=torch.bfloat16)` (can extend to cuda bf16 when ready) `--precision amp_fp16`: use `torch.cuda.amp.autocast(dtype=torch.float16)` (can extend to cpu fp16 when ready) `--precision amp`: use torch.autocast(device), same as --amp ### Performance Test in Copper Lake machine. $ python run.py alexnet -d cpu -m eager -t eval --precision fp32 Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision fp32. CPU Total Wall Time: 92.600 milliseconds CPU Peak Memory: 1.1299 GB $ python run.py alexnet -d cpu -m eager -t eval --precision bf16 Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision bf16. CPU Total Wall Time: 56.580 milliseconds CPU Peak Memory: 0.6934 GB $ python run.py alexnet -d cpu -m eager -t eval --precision amp_bf16 Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16. CPU Total Wall Time: 71.385 milliseconds CPU Peak Memory: 0.9922 GB $ python run.py alexnet -d cpu -m eager -t train --precision fp32 Running train method from alexnet on cpu in eager mode with input batch size 128 and precision fp32. CPU Total Wall Time: 306.164 milliseconds CPU Peak Memory: 2.0977 GB $ python run.py alexnet -d cpu -m eager -t train --precision bf16 Running train method from alexnet on cpu in eager mode with input batch size 128 and precision bf16. CPU Total Wall Time: 180.958 milliseconds CPU Peak Memory: 1.2686 GB $ python run.py alexnet -d cpu -m eager -t train --precision amp_bf16 Running train method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16. CPU Total Wall Time: 233.332 milliseconds CPU Peak Memory: 2.0117 GB Pull Request resolved: pytorch/benchmark#1516 Reviewed By: aaronenyeshi Differential Revision: D44883144 Pulled By: xuzhao9 fbshipit-source-id: 75251f9eec128b3a1dbca39540193b89059ec183

Summary: Add initial cpu userbenchmark for torchbench Works for Roadmap pytorch/benchmark#1293 for cpu userbenchmark extend with below functions. - [x] Add core binding option, support multi-instances test. - [x] Add gomp/iomp option. - [x] Add memory allocator option. - [x] Support all enabled cpu features test based on torchbench models, e.g. channels-last / fx_int8 / jit with fusers - [x] Support latency and cpu_peak_mem metrics for now, will extend to fps-like report - [x] Add `README.md` For example, in below cml, we tested 2 models fx_int8 inference with batch size 8 on CLX socket 0 and 4 instances at the same time. ```shell $ python run_benchmark.py cpu --model resnet50,alexnet --test eval -b 8 --precision fx_int8 --launcher --launcher-args "--node-id 0 --ninstances 4" Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'resnet50', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')] 2023-04-20 00:43:37,960 - __main__ - INFO - Use JeMalloc memory allocator 2023-04-20 00:43:37,960 - __main__ - INFO - OMP_NUM_THREADS=7 2023-04-20 00:43:37,960 - __main__ - INFO - Using Intel OpenMP 2023-04-20 00:43:37,960 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 2023-04-20 00:43:37,960 - __main__ - INFO - KMP_BLOCKTIME=1 2023-04-20 00:43:37,960 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done] [Done] [Done] [Done] Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'alexnet', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')] 2023-04-20 00:43:53,444 - __main__ - INFO - Use JeMalloc memory allocator 2023-04-20 00:43:53,444 - __main__ - INFO - OMP_NUM_THREADS=7 2023-04-20 00:43:53,444 - __main__ - INFO - Using Intel OpenMP 2023-04-20 00:43:53,444 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 2023-04-20 00:43:53,444 - __main__ - INFO - KMP_BLOCKTIME=1 2023-04-20 00:43:53,444 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336 Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done] [Done] [Done] [Done] ``` We can find the test results in `.userbenchmark/cpu/cpu-20230420004336`, `cpu` userbenchmark will create a subfolder for each test, and aggregate all test results into `metrics-20230420004336.json`. For each sub-folder, it contains instances logs named with instance PID for that model test. ```shell $ ls .userbenchmark/cpu/cpu-20230420004336 eval_alexnet_eager/ eval_resnet50_eager/ $ ls .userbenchmark/cpu/cpu-20230420004336/eval_alexnet_eager/ metrics-3347653.json metrics-3347654.json metrics-3347655.json metrics-3347656.json $ cat .userbenchmark/cpu/metrics-20230420004336.json { "name": "cpu", "environ": { "pytorch_git_version": "de1114554c38322273c066c091d455519d45472d" }, "metrics": { "alexnet-eval-eager_latency": 58.309660750000006, "alexnet-eval-eager_cmem": 0.416259765625, "resnet50-eval-eager_latency": 335.04970325, "resnet50-eval-eager_cmem": 0.90673828125 } } ``` Pull Request resolved: pytorch/benchmark#1559 Reviewed By: aaronenyeshi Differential Revision: D45450175 Pulled By: xuzhao9 fbshipit-source-id: 8e7528f4d694eae182ee601cd80bc6e57cd14e3c