Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Roadmap WIP] Standardize and increase coverage for TorchBench #1293

Open
12 of 17 tasks
yanbing-j opened this issue Nov 9, 2022 · 11 comments
Open
12 of 17 tasks

[Roadmap WIP] Standardize and increase coverage for TorchBench #1293

yanbing-j opened this issue Nov 9, 2022 · 11 comments
Labels

Comments

@yanbing-j
Copy link
Contributor

yanbing-j commented Nov 9, 2022

Motivation

TorchBench is a collection of open-source benchmarks used to evaluate PyTorch performance. It provides a standardized API for benchmark drivers, both for evaluation (eager/jit) and training. Plenty of popular models are involved in TorchBench. Users are convenient to debug and profile.

In order to standardize the performance evluation and increase coverage, TorchBench can be enhanced in the following 3 aspects in CPU:

  • Fit for typical user scenarios
  • Well integrate new features of PyTorch
  • Increase benchmark coverage

Detailed proposal

Fit for typical user scenarios (especially in userbenchmark)

add a new userbenchmark with CPU runtime configuration options, enable those configurations into test.py/run.py also for sanity check or debugging

  • Add core binding option, may leverage torch launcher
  • Add gomp/iomp option
  • Add memory allocator option

support performance metrics in the new CPU userbenchmark

  • Add throughput: Samples / Total time
  • Add latency: Total time / samples
  • Add fps-like report

Well integrate new features of PyTorch

  • Enable bf16 datatype support both for inference and training
  • Fully support channels_last both for inference and training
  • Extend a complier option to support Dynamo
  • Support JIT tracing and cover more models with JIT support
  • Enable quantization support

Increase benchmark coverage

Increase model coverage

  • Add models from community with popularity (e.g, RNN-T)
  • Add models from real customers (Multi-Band MelGAN, ViT and Wav2vec)
  • Fix some models not implemented in CPU (e.g, DALLE2_pytorch, moco, pytorch_struct, tacotron2, timm_efficientdet, vision_maskrcnn)
  • Add typical GNN workloads

Port OpBench to TorchBench

  • Increase OpBench coverage
  • Complete support of dtypes, memory-format and inplace version for ops
@xuzhao9
Copy link
Contributor

xuzhao9 commented Nov 9, 2022

To better help with different stakeholders, we are migrating away from test_bench.py to the new "userbenchmark" approach. In benchmark/userbenchmark (https://github.com/pytorch/benchmark/tree/main/userbenchmark), we encourage users to develop their customized benchmarks with TorchBench models and use the "run_benchmark.py" driver to drive their benchmark.

Benefits of using TorchBench userbenchmark:

  • We can decouple benchmarks with their infrastructures, and run different benchmarks on different machines. For example, CPU benchmarks don't need GPU machines, and GPU benchmarks don't need too powerful CPU.
  • We are also decoupling the benchmark model code with the experiments we are running on them. We believe this design is much clearer and can easily attribute the code to the correct owners.
  • We can easily support specific userbenchmark on the PyTorch PR level. For example, specify "RUN_TORCHBENCH: " can run a userbenchmark in a PyTorch PR for A/B testing and get result visualization on PyTorch HUD (e.g., https://hud.pytorch.org/userbenchmark_view?url=https:%2F%2Fossci-metrics.s3.amazonaws.com%2Ftorchbench-pr-test%2Fpr84626%2Fresult.csv)
  • In the future we can also bisect userbenchmark metrics to pinpoint problematic commits.

Therefore, I believe the above section "Fit for typical user scenarios" should be developed as a new TorchBench userbenchmark, instead of modifying test_bench.py.

We are still keeping test.py for unit testing purpose for now.

I am happy to answer any questions about TorchBench userbenchmarks from Intel, please feel free to reach out on Slack, or here at GitHub.

@yanbing-j
Copy link
Contributor Author

yanbing-j commented Nov 9, 2022

@xuzhao9 Thanks for the information! We will look into userbenchmark and update this roadmap.
I have some questions, will userbenchmark replace test_bench completely to guarantee the PR quality in the future? And how about torchbenchmark? Any userbenchmark support CPU at present? Thanks!

@yanbing-j yanbing-j changed the title [Roadmap] Standardize and increase coverage for TorchBench [Roadmap WIP] Standardize and increase coverage for TorchBench Nov 9, 2022
@xuzhao9
Copy link
Contributor

xuzhao9 commented Nov 11, 2022

@yanbing-j My answers:

  1. Yes, the plan is to use userbenchmark to replace test_bench.py. However, the PR quality will still be guaranteed with test.py, which is what we are doing right now. We will keep test.py but deprecate test_bench.py.
  2. torchbenchmark describes the benchmark model code, and we won't change that.
  3. The "release-test" (https://github.com/pytorch/benchmark/tree/main/userbenchmark/release-test) userbenchmark tests both CPU and GPU performance on a couple of models in pytorch/examples. We are also working to deliver more CPU userbenchmarks soon. The next one will be measuring the stableness of both CPU and GPU tests across torchbench.

@chuanqi129
Copy link
Contributor

chuanqi129 commented Nov 17, 2022

Hi @xuzhao9, thanks for the future plan sharing, may I know is there any guideline or document to demonstrate how to enable a new benchmark under userbenchmark?

We are also working to deliver more CPU userbenchmarks soon.

And could you also share the roughly timeline for this CPU userbenchmarks deliver plan? And in your perspective, do you want @yanbing-j and I work based on this CPU userbenchmark in nearly future or we can define a new one?

@xuzhao9
Copy link
Contributor

xuzhao9 commented Nov 29, 2022

@chuanqi129 We plan to deliver the first CPU userbenchmark within a month, it will be about the stableness of CPU latency across all TorchBench models. I suggest Intel can work on a their own userbenchmark (a new one).

I created a userbenchmark doc here: #1328

@chuanqi129
Copy link
Contributor

Thanks @xuzhao9 for the update, I will check it.

facebook-github-bot pushed a commit that referenced this issue Dec 9, 2022
Summary:
We are adding the documentations on how to develop a TorchBench userbenchmark for our customers.

Related to #1293

Pull Request resolved: #1328

Reviewed By: davidberard98

Differential Revision: D41778804

Pulled By: xuzhao9

fbshipit-source-id: 89e019ad2014aa0b60b82c7104b93fa63b71e169
facebook-github-bot pushed a commit that referenced this issue Feb 15, 2023
…#1371)

Summary:
In order to standardize the performance evluation and increase coverage, added the `channels_last` for all torchbench models, and enable it with `run.py` entry for debug using. This PR as a first step to standardize and increase coverage for TorchBench, which works for the roadmap #1293.

Took `alexnet` as an example, which run on CLX 8280L (28cc)
```shell
python run.py alexnet -d cpu -m eager -t eval
Running eval method from alexnet on cpu in eager mode with input batch size 128.
CPU Total Wall Time: 108.189 milliseconds
```

```shell
python run.py alexnet -d cpu -m eager -t eval --channels-last
Running eval method from alexnet on cpu in eager mode with input batch size 128.
CPU Total Wall Time:  72.930 milliseconds
```

Pull Request resolved: #1371

Reviewed By: davidberard98

Differential Revision: D43273579

Pulled By: xuzhao9

fbshipit-source-id: 9597d996d27dd228445e3e8122e5e7131cc93669
facebook-github-bot pushed a commit that referenced this issue Mar 7, 2023
Summary:
Enabled fuser for jit backend and added fuser3 for llga path

Works for Roadmap #1293 for jit support, below is an example on CLX machine

```
$ python run.py alexnet -d cpu -m jit    -t eval
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  70.918 milliseconds
Correctness:                         True
$ python run.py alexnet -d cpu -m jit  --fuser fuser0  -t eval
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  70.741 milliseconds
Correctness:                         True
$ python run.py alexnet -d cpu -m jit  --fuser fuser3  -t eval
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  64.179 milliseconds
Correctness:                         True
```

Pull Request resolved: #1449

Reviewed By: davidberard98

Differential Revision: D43837782

Pulled By: xuzhao9

fbshipit-source-id: 313578112f1a406d42bc5d0d599e5fc20f4bfd0b
facebook-github-bot pushed a commit that referenced this issue Mar 21, 2023
Summary:
Enable fx int8 for most models on cpu device

Works for Roadmap #1293 for fx int8 support, below is an example on CLX machine

```
$ python run.py alexnet -d cpu -t eval --precision fp32 -m eager
Running eval method from alexnet on cpu in eager mode with input batch size 128.
CPU Total Wall Time:  93.586 milliseconds
CPU Peak Memory:                6.3857 GB

$ python run.py alexnet -d cpu -t eval --precision fx_int8 -m eager
Running eval method from alexnet on cpu in eager mode with input batch size 128.
CPU Total Wall Time:  21.892 milliseconds
CPU Peak Memory:                1.4150 GB

$ python run.py alexnet -d cpu -t eval --precision fp32 -m jit
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  70.556 milliseconds
CPU Peak Memory:                1.5918 GB
Correctness:                         True

$ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  21.176 milliseconds
CPU Peak Memory:                1.6758 GB
Correctness:                         True

$ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit --quant-engine fbgemm
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  29.487 milliseconds
CPU Peak Memory:                1.6777 GB
Correctness:                         True
```

Pull Request resolved: #1485

Reviewed By: weiwangmeta

Differential Revision: D44256938

Pulled By: xuzhao9

fbshipit-source-id: 1754028660b6908e66616531a42571e9c08690e6
facebook-github-bot pushed a commit that referenced this issue Mar 22, 2023
Summary:
This PR is to add typical GNN workloads which is one task in #1293.

This task includes:

- Add models in `torchbenchmark`: including `GCN`, `GraphSage` and `GAT`.
- Use real datasets as inputs: Split subgraph from `Reddit`.
- Add metrics

Pull Request resolved: #1422

Reviewed By: weiwangmeta

Differential Revision: D43946504

Pulled By: xuzhao9

fbshipit-source-id: 6e979ed871c2d3bffa16ca1c6b713d71f56bb7e3
facebook-github-bot pushed a commit that referenced this issue Apr 14, 2023
Summary:
Fix #1548 . Works for Roadmap #1293 for Increase benchmark coverage,

Before:
```bash
python run.py llama -d cpu
Traceback (most recent call last):
  File "run.py", line 298, in <module>
    m = Model(device=args.device, test=args.test, jit=(args.mode == "jit"), batch_size=args.bs, extra_args=extra_args)
  File "/workspace/benchmark/torchbenchmark/util/model.py", line 20, in __call__
    obj = type.__call__(cls, *args, **kwargs)
  File "/workspace/benchmark/torchbenchmark/models/llama/__init__.py", line 16, in __init__
    super().__init__(test=test, device=device, jit=jit, batch_size=batch_size, extra_args=extra_args)
  File "/workspace/benchmark/torchbenchmark/util/model.py", line 84, in __init__
    self.determine_batch_size(batch_size)
  File "/workspace/benchmark/torchbenchmark/util/model.py", line 216, in determine_batch_size
    raise NotImplementedError(f"Test {self.test} is not implemented.")
NotImplementedError: Test eval is not implemented.
```

After:

```bash
python run.py llama -d cpu --bs 32
Running eval method from llama on cpu in eager mode with input batch size 32.
CPU Total Wall Time:  11.997 milliseconds
CPU Peak Memory:                1.3799 GB

python run.py llama -d cpu --bs 16
Running eval method from llama on cpu in eager mode with input batch size 16.
CPU Total Wall Time:   9.870 milliseconds
CPU Peak Memory:                1.3770 GB
```

Pull Request resolved: #1549

Reviewed By: aaronenyeshi

Differential Revision: D45005325

Pulled By: xuzhao9

fbshipit-source-id: 265532b33f83e87fecf94eac95e29f65ad8083f4
facebook-github-bot pushed a commit that referenced this issue Apr 24, 2023
Summary:
This PR is to add amp support in CPU in TorchBench, which contributes to #1293.

To be compatible with current amp implementation, we add 3 options in `--precision`:
`--precision bf16`: use `enable_bf16` to convert model and inputs to bf16
`--precision amp_bf16`: use `torch.cpu.amp.autocast(dtype=torch.bfloat16)` (can extend to cuda bf16 when ready)
`--precision amp_fp16`: use `torch.cuda.amp.autocast(dtype=torch.float16)` (can extend to cpu fp16 when ready)

`--precision amp`: use torch.autocast(device), same as --amp

### Performance

Test in Copper Lake machine.

$ python run.py alexnet -d cpu -m eager -t eval --precision fp32
Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision fp32.
CPU Total Wall Time:  92.600 milliseconds
CPU Peak Memory:                1.1299 GB

$ python run.py alexnet -d cpu -m eager -t eval --precision bf16
Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision bf16.
CPU Total Wall Time:  56.580 milliseconds
CPU Peak Memory:                0.6934 GB

$ python run.py alexnet -d cpu -m eager -t eval --precision amp_bf16
Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16.
CPU Total Wall Time:  71.385 milliseconds
CPU Peak Memory:                0.9922 GB

$ python run.py alexnet -d cpu -m eager -t train --precision fp32
Running train method from alexnet on cpu in eager mode with input batch size 128 and precision fp32.
CPU Total Wall Time: 306.164 milliseconds
CPU Peak Memory:                2.0977 GB

$ python run.py alexnet -d cpu -m eager -t train --precision bf16
Running train method from alexnet on cpu in eager mode with input batch size 128 and precision bf16.
CPU Total Wall Time: 180.958 milliseconds
CPU Peak Memory:                1.2686 GB

$ python run.py alexnet -d cpu -m eager -t train --precision amp_bf16
Running train method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16.
CPU Total Wall Time: 233.332 milliseconds
CPU Peak Memory:                2.0117 GB

Pull Request resolved: #1516

Reviewed By: aaronenyeshi

Differential Revision: D44883144

Pulled By: xuzhao9

fbshipit-source-id: 75251f9eec128b3a1dbca39540193b89059ec183
facebook-github-bot pushed a commit that referenced this issue May 2, 2023
Summary:
Add initial cpu userbenchmark for torchbench

Works for Roadmap #1293 for cpu userbenchmark extend with below functions.

- [x] Add core binding option, support multi-instances test.
- [x] Add gomp/iomp option.
- [x] Add memory allocator option.
- [x] Support all enabled cpu features test based on torchbench models, e.g. channels-last / fx_int8 / jit with fusers
- [x] Support latency and cpu_peak_mem metrics for now, will extend to fps-like report
- [x] Add `README.md`

For example, in below cml, we tested 2 models fx_int8 inference with batch size 8 on CLX socket 0 and 4 instances at the same time.
```shell
$ python run_benchmark.py cpu --model resnet50,alexnet --test eval -b 8 --precision fx_int8 --launcher --launcher-args "--node-id 0 --ninstances 4"
Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'resnet50', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')]
2023-04-20 00:43:37,960 - __main__ - INFO - Use JeMalloc memory allocator
2023-04-20 00:43:37,960 - __main__ - INFO - OMP_NUM_THREADS=7
2023-04-20 00:43:37,960 - __main__ - INFO - Using Intel OpenMP
2023-04-20 00:43:37,960 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2023-04-20 00:43:37,960 - __main__ - INFO - KMP_BLOCKTIME=1
2023-04-20 00:43:37,960 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done]
 [Done]
 [Done]
 [Done]
Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'alexnet', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')]
2023-04-20 00:43:53,444 - __main__ - INFO - Use JeMalloc memory allocator
2023-04-20 00:43:53,444 - __main__ - INFO - OMP_NUM_THREADS=7
2023-04-20 00:43:53,444 - __main__ - INFO - Using Intel OpenMP
2023-04-20 00:43:53,444 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2023-04-20 00:43:53,444 - __main__ - INFO - KMP_BLOCKTIME=1
2023-04-20 00:43:53,444 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done]
 [Done]
 [Done]
 [Done]
```
We can find the test results in `.userbenchmark/cpu/cpu-20230420004336`, `cpu` userbenchmark will create a subfolder for each test, and aggregate all test results into `metrics-20230420004336.json`. For each sub-folder, it contains instances logs named with instance PID for that model test.
```shell
$ ls .userbenchmark/cpu/cpu-20230420004336
eval_alexnet_eager/  eval_resnet50_eager/
$ ls .userbenchmark/cpu/cpu-20230420004336/eval_alexnet_eager/
metrics-3347653.json  metrics-3347654.json  metrics-3347655.json  metrics-3347656.json
$ cat .userbenchmark/cpu/metrics-20230420004336.json
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "de1114554c38322273c066c091d455519d45472d"
    },
    "metrics": {
        "alexnet-eval-eager_latency": 58.309660750000006,
        "alexnet-eval-eager_cmem": 0.416259765625,
        "resnet50-eval-eager_latency": 335.04970325,
        "resnet50-eval-eager_cmem": 0.90673828125
    }
}
```

Pull Request resolved: #1559

Reviewed By: aaronenyeshi

Differential Revision: D45450175

Pulled By: xuzhao9

fbshipit-source-id: 8e7528f4d694eae182ee601cd80bc6e57cd14e3c
@xuzhao9
Copy link
Contributor

xuzhao9 commented May 5, 2023

Hi @chuanqi129 , I tried running your userbenchmark on our CI runner, and it failed with error: https://github.com/pytorch/benchmark/actions/runs/4885488240

Also, please let me know about which runner you would like to deploy your benchmark on.

@chuanqi129
Copy link
Contributor

Hi @chuanqi129 , I tried running your userbenchmark on our CI runner, and it failed with error: https://github.com/pytorch/benchmark/actions/runs/4885488240

Also, please let me know about which runner you would like to deploy your benchmark on.

Thanks @xuzhao9 for your great support for the cpu userbenchmark. I'm sorry about the late reply due to the out of office for labor holiday and AL. I will focus on it recent days and fix the CI runner failures.

Some reply for comments in #1559 as below

For on-demand AWS instances, can you check any of the AWS instances in https://github.com/pytorch/test-infra/blob/main/.github/scale-config.yml can be used? It is the preferred approach.

I have double checked the instance types in Pytorch ci node pool, c5 instance based on 2nd Generation Intel® Xeon® Scalable Processors (CLX), which can support fp32 and int8, but no bf16 datatype support. We can try to use linux.24xlarge instance for initial test.

If not (for example, we can't reach a reasonable low noise level), we can use the AWS metal instance, which is AWS g4dn.metal, with Intel(R) Xeon(R) Platinum 8259CL CPU. Does it support fp32/int8?

The 8259CL also belongs to 2nd Generation Intel® Xeon® Scalable Processors (CLX). So it also support fp32/int8. We can use linux.24xlarge instance firstly, if it has large noise, we can try this metal one later.

BTW, ideally, it will be great if we can deploy the cpu benchmark on c6i.16xlarge instance, because it will same with our dyanmo cpu dashboard used instance. (Nice to have)

@xuzhao9
Copy link
Contributor

xuzhao9 commented May 11, 2023

@chuanqi129 I am wondering does the dynamo cpu dashboard work on GitHub Actions? Can I have the GitHub Actions workflow file?

@chuanqi129
Copy link
Contributor

@chuanqi129 I am wondering does the dynamo cpu dashboard work on GitHub Actions? Can I have the GitHub Actions workflow file?

No, the dynamo cpu dashboard maintained by our side. Attached this test used Dockerfile and scripts. In this test, all needed components are built from source code. And I also think it will be great if we can integrate this dynamo cpu dashboard test into Pytorch Github Action, but it needs c6i instance.

facebook-github-bot pushed a commit that referenced this issue Jul 31, 2023
Summary:
Fixed cpu userbenchmark on jit mode

works for #1293

Pull Request resolved: #1797

Reviewed By: FindHao

Differential Revision: D47917179

Pulled By: xuzhao9

fbshipit-source-id: fcb95f7c9e9b5a3cf199afe1e3d6a1e19036884d
facebook-github-bot pushed a commit that referenced this issue Sep 20, 2023
Summary:
To stabilize the benchmark results of cpu userbenchmark.

Also works for #1293

Pull Request resolved: #1908

Reviewed By: davidberard98

Differential Revision: D49418232

Pulled By: xuzhao9

fbshipit-source-id: b4d0aa97fa06ffaf12984fa5dece6a0d21759fe8
facebook-github-bot pushed a commit that referenced this issue Apr 13, 2024
Summary:
Works for Roadmap #1293 to increase benchmark coverage.

For these 5 models:tacotron2, yolov3, nvidia_deeprecommender, LearningToPoint and pytorch_CycleGAN_and_pix2pix,
when running on custom devices except for CPU and CUDA(e.g. XPU), it will raise the error as it's hard-coded with CPU/CUDA backends.
In this PR, we accept the device args as a param within the training process and inference process which will cover the model initializing and data transposition for these custom devices.

Pull Request resolved: #2230

Reviewed By: aaronenyeshi

Differential Revision: D56097643

Pulled By: xuzhao9

fbshipit-source-id: deba28fee42b5119f62dbddc15e017bf00eb6843
facebook-github-bot pushed a commit that referenced this issue May 30, 2024
Summary:
This Pull Request relates to [Roadmap Issue #1293 by enhancing our benchmark coverage.

Currently, Torchbench utilizes a custom random seed function that is incompatible with the XPU device backend.
This incompatibility affects models that include random data augmentation operations, leading to accuracy check failures due to variations in input data across two separate runs.

In this PR, we introduce support for setting a random seed for the XPU backend.

Pull Request resolved: #2270

Reviewed By: davidberard98

Differential Revision: D57884498

Pulled By: xuzhao9

fbshipit-source-id: 24234674333945782b233191e42a8b344e90d74c
gairgeio added a commit to gairgeio/benchmark that referenced this issue Aug 2, 2024
Summary:
We are adding the documentations on how to develop a TorchBench userbenchmark for our customers.

Related to pytorch/benchmark#1293

Pull Request resolved: pytorch/benchmark#1328

Reviewed By: davidberard98

Differential Revision: D41778804

Pulled By: xuzhao9

fbshipit-source-id: 89e019ad2014aa0b60b82c7104b93fa63b71e169
gairgeio added a commit to gairgeio/benchmark that referenced this issue Aug 2, 2024
… (#1371)

Summary:
In order to standardize the performance evluation and increase coverage, added the `channels_last` for all torchbench models, and enable it with `run.py` entry for debug using. This PR as a first step to standardize and increase coverage for TorchBench, which works for the roadmap pytorch/benchmark#1293.

Took `alexnet` as an example, which run on CLX 8280L (28cc)
```shell
python run.py alexnet -d cpu -m eager -t eval
Running eval method from alexnet on cpu in eager mode with input batch size 128.
CPU Total Wall Time: 108.189 milliseconds
```

```shell
python run.py alexnet -d cpu -m eager -t eval --channels-last
Running eval method from alexnet on cpu in eager mode with input batch size 128.
CPU Total Wall Time:  72.930 milliseconds
```

Pull Request resolved: pytorch/benchmark#1371

Reviewed By: davidberard98

Differential Revision: D43273579

Pulled By: xuzhao9

fbshipit-source-id: 9597d996d27dd228445e3e8122e5e7131cc93669
gairgeio added a commit to gairgeio/benchmark that referenced this issue Aug 2, 2024
Summary:
Enabled fuser for jit backend and added fuser3 for llga path

Works for Roadmap pytorch/benchmark#1293 for jit support, below is an example on CLX machine

```
$ python run.py alexnet -d cpu -m jit    -t eval
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  70.918 milliseconds
Correctness:                         True
$ python run.py alexnet -d cpu -m jit  --fuser fuser0  -t eval
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  70.741 milliseconds
Correctness:                         True
$ python run.py alexnet -d cpu -m jit  --fuser fuser3  -t eval
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  64.179 milliseconds
Correctness:                         True
```

Pull Request resolved: pytorch/benchmark#1449

Reviewed By: davidberard98

Differential Revision: D43837782

Pulled By: xuzhao9

fbshipit-source-id: 313578112f1a406d42bc5d0d599e5fc20f4bfd0b
gairgeio added a commit to gairgeio/benchmark that referenced this issue Aug 2, 2024
Summary:
Enable fx int8 for most models on cpu device

Works for Roadmap pytorch/benchmark#1293 for fx int8 support, below is an example on CLX machine

```
$ python run.py alexnet -d cpu -t eval --precision fp32 -m eager
Running eval method from alexnet on cpu in eager mode with input batch size 128.
CPU Total Wall Time:  93.586 milliseconds
CPU Peak Memory:                6.3857 GB

$ python run.py alexnet -d cpu -t eval --precision fx_int8 -m eager
Running eval method from alexnet on cpu in eager mode with input batch size 128.
CPU Total Wall Time:  21.892 milliseconds
CPU Peak Memory:                1.4150 GB

$ python run.py alexnet -d cpu -t eval --precision fp32 -m jit
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  70.556 milliseconds
CPU Peak Memory:                1.5918 GB
Correctness:                         True

$ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  21.176 milliseconds
CPU Peak Memory:                1.6758 GB
Correctness:                         True

$ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit --quant-engine fbgemm
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  29.487 milliseconds
CPU Peak Memory:                1.6777 GB
Correctness:                         True
```

Pull Request resolved: pytorch/benchmark#1485

Reviewed By: weiwangmeta

Differential Revision: D44256938

Pulled By: xuzhao9

fbshipit-source-id: 1754028660b6908e66616531a42571e9c08690e6
gairgeio added a commit to gairgeio/benchmark that referenced this issue Aug 2, 2024
Summary:
This PR is to add typical GNN workloads which is one task in pytorch/benchmark#1293.

This task includes:

- Add models in `torchbenchmark`: including `GCN`, `GraphSage` and `GAT`.
- Use real datasets as inputs: Split subgraph from `Reddit`.
- Add metrics

Pull Request resolved: pytorch/benchmark#1422

Reviewed By: weiwangmeta

Differential Revision: D43946504

Pulled By: xuzhao9

fbshipit-source-id: 6e979ed871c2d3bffa16ca1c6b713d71f56bb7e3
gairgeio added a commit to gairgeio/benchmark that referenced this issue Aug 2, 2024
Summary:
Fix pytorch/benchmark#1548 . Works for Roadmap pytorch/benchmark#1293 for Increase benchmark coverage,

Before:
```bash
python run.py llama -d cpu
Traceback (most recent call last):
  File "run.py", line 298, in <module>
    m = Model(device=args.device, test=args.test, jit=(args.mode == "jit"), batch_size=args.bs, extra_args=extra_args)
  File "/workspace/benchmark/torchbenchmark/util/model.py", line 20, in __call__
    obj = type.__call__(cls, *args, **kwargs)
  File "/workspace/benchmark/torchbenchmark/models/llama/__init__.py", line 16, in __init__
    super().__init__(test=test, device=device, jit=jit, batch_size=batch_size, extra_args=extra_args)
  File "/workspace/benchmark/torchbenchmark/util/model.py", line 84, in __init__
    self.determine_batch_size(batch_size)
  File "/workspace/benchmark/torchbenchmark/util/model.py", line 216, in determine_batch_size
    raise NotImplementedError(f"Test {self.test} is not implemented.")
NotImplementedError: Test eval is not implemented.
```

After:

```bash
python run.py llama -d cpu --bs 32
Running eval method from llama on cpu in eager mode with input batch size 32.
CPU Total Wall Time:  11.997 milliseconds
CPU Peak Memory:                1.3799 GB

python run.py llama -d cpu --bs 16
Running eval method from llama on cpu in eager mode with input batch size 16.
CPU Total Wall Time:   9.870 milliseconds
CPU Peak Memory:                1.3770 GB
```

Pull Request resolved: pytorch/benchmark#1549

Reviewed By: aaronenyeshi

Differential Revision: D45005325

Pulled By: xuzhao9

fbshipit-source-id: 265532b33f83e87fecf94eac95e29f65ad8083f4
gairgeio added a commit to gairgeio/benchmark that referenced this issue Aug 2, 2024
Summary:
This PR is to add amp support in CPU in TorchBench, which contributes to pytorch/benchmark#1293.

To be compatible with current amp implementation, we add 3 options in `--precision`:
`--precision bf16`: use `enable_bf16` to convert model and inputs to bf16
`--precision amp_bf16`: use `torch.cpu.amp.autocast(dtype=torch.bfloat16)` (can extend to cuda bf16 when ready)
`--precision amp_fp16`: use `torch.cuda.amp.autocast(dtype=torch.float16)` (can extend to cpu fp16 when ready)

`--precision amp`: use torch.autocast(device), same as --amp

### Performance

Test in Copper Lake machine.

$ python run.py alexnet -d cpu -m eager -t eval --precision fp32
Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision fp32.
CPU Total Wall Time:  92.600 milliseconds
CPU Peak Memory:                1.1299 GB

$ python run.py alexnet -d cpu -m eager -t eval --precision bf16
Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision bf16.
CPU Total Wall Time:  56.580 milliseconds
CPU Peak Memory:                0.6934 GB

$ python run.py alexnet -d cpu -m eager -t eval --precision amp_bf16
Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16.
CPU Total Wall Time:  71.385 milliseconds
CPU Peak Memory:                0.9922 GB

$ python run.py alexnet -d cpu -m eager -t train --precision fp32
Running train method from alexnet on cpu in eager mode with input batch size 128 and precision fp32.
CPU Total Wall Time: 306.164 milliseconds
CPU Peak Memory:                2.0977 GB

$ python run.py alexnet -d cpu -m eager -t train --precision bf16
Running train method from alexnet on cpu in eager mode with input batch size 128 and precision bf16.
CPU Total Wall Time: 180.958 milliseconds
CPU Peak Memory:                1.2686 GB

$ python run.py alexnet -d cpu -m eager -t train --precision amp_bf16
Running train method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16.
CPU Total Wall Time: 233.332 milliseconds
CPU Peak Memory:                2.0117 GB

Pull Request resolved: pytorch/benchmark#1516

Reviewed By: aaronenyeshi

Differential Revision: D44883144

Pulled By: xuzhao9

fbshipit-source-id: 75251f9eec128b3a1dbca39540193b89059ec183
gairgeio added a commit to gairgeio/benchmark that referenced this issue Aug 2, 2024
Summary:
Add initial cpu userbenchmark for torchbench

Works for Roadmap pytorch/benchmark#1293 for cpu userbenchmark extend with below functions.

- [x] Add core binding option, support multi-instances test.
- [x] Add gomp/iomp option.
- [x] Add memory allocator option.
- [x] Support all enabled cpu features test based on torchbench models, e.g. channels-last / fx_int8 / jit with fusers
- [x] Support latency and cpu_peak_mem metrics for now, will extend to fps-like report
- [x] Add `README.md`

For example, in below cml, we tested 2 models fx_int8 inference with batch size 8 on CLX socket 0 and 4 instances at the same time.
```shell
$ python run_benchmark.py cpu --model resnet50,alexnet --test eval -b 8 --precision fx_int8 --launcher --launcher-args "--node-id 0 --ninstances 4"
Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'resnet50', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')]
2023-04-20 00:43:37,960 - __main__ - INFO - Use JeMalloc memory allocator
2023-04-20 00:43:37,960 - __main__ - INFO - OMP_NUM_THREADS=7
2023-04-20 00:43:37,960 - __main__ - INFO - Using Intel OpenMP
2023-04-20 00:43:37,960 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2023-04-20 00:43:37,960 - __main__ - INFO - KMP_BLOCKTIME=1
2023-04-20 00:43:37,960 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done]
 [Done]
 [Done]
 [Done]
Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'alexnet', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')]
2023-04-20 00:43:53,444 - __main__ - INFO - Use JeMalloc memory allocator
2023-04-20 00:43:53,444 - __main__ - INFO - OMP_NUM_THREADS=7
2023-04-20 00:43:53,444 - __main__ - INFO - Using Intel OpenMP
2023-04-20 00:43:53,444 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2023-04-20 00:43:53,444 - __main__ - INFO - KMP_BLOCKTIME=1
2023-04-20 00:43:53,444 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done]
 [Done]
 [Done]
 [Done]
```
We can find the test results in `.userbenchmark/cpu/cpu-20230420004336`, `cpu` userbenchmark will create a subfolder for each test, and aggregate all test results into `metrics-20230420004336.json`. For each sub-folder, it contains instances logs named with instance PID for that model test.
```shell
$ ls .userbenchmark/cpu/cpu-20230420004336
eval_alexnet_eager/  eval_resnet50_eager/
$ ls .userbenchmark/cpu/cpu-20230420004336/eval_alexnet_eager/
metrics-3347653.json  metrics-3347654.json  metrics-3347655.json  metrics-3347656.json
$ cat .userbenchmark/cpu/metrics-20230420004336.json
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "de1114554c38322273c066c091d455519d45472d"
    },
    "metrics": {
        "alexnet-eval-eager_latency": 58.309660750000006,
        "alexnet-eval-eager_cmem": 0.416259765625,
        "resnet50-eval-eager_latency": 335.04970325,
        "resnet50-eval-eager_cmem": 0.90673828125
    }
}
```

Pull Request resolved: pytorch/benchmark#1559

Reviewed By: aaronenyeshi

Differential Revision: D45450175

Pulled By: xuzhao9

fbshipit-source-id: 8e7528f4d694eae182ee601cd80bc6e57cd14e3c
bestappsandcodereviews7 added a commit to bestappsandcodereviews7/benchmark that referenced this issue Aug 16, 2024
Summary:
We are adding the documentations on how to develop a TorchBench userbenchmark for our customers.

Related to pytorch/benchmark#1293

Pull Request resolved: pytorch/benchmark#1328

Reviewed By: davidberard98

Differential Revision: D41778804

Pulled By: xuzhao9

fbshipit-source-id: 89e019ad2014aa0b60b82c7104b93fa63b71e169
bestappsandcodereviews7 added a commit to bestappsandcodereviews7/benchmark that referenced this issue Aug 16, 2024
… (#1371)

Summary:
In order to standardize the performance evluation and increase coverage, added the `channels_last` for all torchbench models, and enable it with `run.py` entry for debug using. This PR as a first step to standardize and increase coverage for TorchBench, which works for the roadmap pytorch/benchmark#1293.

Took `alexnet` as an example, which run on CLX 8280L (28cc)
```shell
python run.py alexnet -d cpu -m eager -t eval
Running eval method from alexnet on cpu in eager mode with input batch size 128.
CPU Total Wall Time: 108.189 milliseconds
```

```shell
python run.py alexnet -d cpu -m eager -t eval --channels-last
Running eval method from alexnet on cpu in eager mode with input batch size 128.
CPU Total Wall Time:  72.930 milliseconds
```

Pull Request resolved: pytorch/benchmark#1371

Reviewed By: davidberard98

Differential Revision: D43273579

Pulled By: xuzhao9

fbshipit-source-id: 9597d996d27dd228445e3e8122e5e7131cc93669
bestappsandcodereviews7 added a commit to bestappsandcodereviews7/benchmark that referenced this issue Aug 16, 2024
Summary:
Enabled fuser for jit backend and added fuser3 for llga path

Works for Roadmap pytorch/benchmark#1293 for jit support, below is an example on CLX machine

```
$ python run.py alexnet -d cpu -m jit    -t eval
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  70.918 milliseconds
Correctness:                         True
$ python run.py alexnet -d cpu -m jit  --fuser fuser0  -t eval
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  70.741 milliseconds
Correctness:                         True
$ python run.py alexnet -d cpu -m jit  --fuser fuser3  -t eval
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  64.179 milliseconds
Correctness:                         True
```

Pull Request resolved: pytorch/benchmark#1449

Reviewed By: davidberard98

Differential Revision: D43837782

Pulled By: xuzhao9

fbshipit-source-id: 313578112f1a406d42bc5d0d599e5fc20f4bfd0b
bestappsandcodereviews7 added a commit to bestappsandcodereviews7/benchmark that referenced this issue Aug 16, 2024
Summary:
Enable fx int8 for most models on cpu device

Works for Roadmap pytorch/benchmark#1293 for fx int8 support, below is an example on CLX machine

```
$ python run.py alexnet -d cpu -t eval --precision fp32 -m eager
Running eval method from alexnet on cpu in eager mode with input batch size 128.
CPU Total Wall Time:  93.586 milliseconds
CPU Peak Memory:                6.3857 GB

$ python run.py alexnet -d cpu -t eval --precision fx_int8 -m eager
Running eval method from alexnet on cpu in eager mode with input batch size 128.
CPU Total Wall Time:  21.892 milliseconds
CPU Peak Memory:                1.4150 GB

$ python run.py alexnet -d cpu -t eval --precision fp32 -m jit
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  70.556 milliseconds
CPU Peak Memory:                1.5918 GB
Correctness:                         True

$ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  21.176 milliseconds
CPU Peak Memory:                1.6758 GB
Correctness:                         True

$ python run.py alexnet -d cpu -t eval --precision fx_int8 -m jit --quant-engine fbgemm
Running eval method from alexnet on cpu in jit mode with input batch size 128.
CPU Total Wall Time:  29.487 milliseconds
CPU Peak Memory:                1.6777 GB
Correctness:                         True
```

Pull Request resolved: pytorch/benchmark#1485

Reviewed By: weiwangmeta

Differential Revision: D44256938

Pulled By: xuzhao9

fbshipit-source-id: 1754028660b6908e66616531a42571e9c08690e6
bestappsandcodereviews7 added a commit to bestappsandcodereviews7/benchmark that referenced this issue Aug 16, 2024
Summary:
This PR is to add typical GNN workloads which is one task in pytorch/benchmark#1293.

This task includes:

- Add models in `torchbenchmark`: including `GCN`, `GraphSage` and `GAT`.
- Use real datasets as inputs: Split subgraph from `Reddit`.
- Add metrics

Pull Request resolved: pytorch/benchmark#1422

Reviewed By: weiwangmeta

Differential Revision: D43946504

Pulled By: xuzhao9

fbshipit-source-id: 6e979ed871c2d3bffa16ca1c6b713d71f56bb7e3
bestappsandcodereviews7 added a commit to bestappsandcodereviews7/benchmark that referenced this issue Aug 16, 2024
Summary:
Fix pytorch/benchmark#1548 . Works for Roadmap pytorch/benchmark#1293 for Increase benchmark coverage,

Before:
```bash
python run.py llama -d cpu
Traceback (most recent call last):
  File "run.py", line 298, in <module>
    m = Model(device=args.device, test=args.test, jit=(args.mode == "jit"), batch_size=args.bs, extra_args=extra_args)
  File "/workspace/benchmark/torchbenchmark/util/model.py", line 20, in __call__
    obj = type.__call__(cls, *args, **kwargs)
  File "/workspace/benchmark/torchbenchmark/models/llama/__init__.py", line 16, in __init__
    super().__init__(test=test, device=device, jit=jit, batch_size=batch_size, extra_args=extra_args)
  File "/workspace/benchmark/torchbenchmark/util/model.py", line 84, in __init__
    self.determine_batch_size(batch_size)
  File "/workspace/benchmark/torchbenchmark/util/model.py", line 216, in determine_batch_size
    raise NotImplementedError(f"Test {self.test} is not implemented.")
NotImplementedError: Test eval is not implemented.
```

After:

```bash
python run.py llama -d cpu --bs 32
Running eval method from llama on cpu in eager mode with input batch size 32.
CPU Total Wall Time:  11.997 milliseconds
CPU Peak Memory:                1.3799 GB

python run.py llama -d cpu --bs 16
Running eval method from llama on cpu in eager mode with input batch size 16.
CPU Total Wall Time:   9.870 milliseconds
CPU Peak Memory:                1.3770 GB
```

Pull Request resolved: pytorch/benchmark#1549

Reviewed By: aaronenyeshi

Differential Revision: D45005325

Pulled By: xuzhao9

fbshipit-source-id: 265532b33f83e87fecf94eac95e29f65ad8083f4
bestappsandcodereviews7 added a commit to bestappsandcodereviews7/benchmark that referenced this issue Aug 16, 2024
Summary:
This PR is to add amp support in CPU in TorchBench, which contributes to pytorch/benchmark#1293.

To be compatible with current amp implementation, we add 3 options in `--precision`:
`--precision bf16`: use `enable_bf16` to convert model and inputs to bf16
`--precision amp_bf16`: use `torch.cpu.amp.autocast(dtype=torch.bfloat16)` (can extend to cuda bf16 when ready)
`--precision amp_fp16`: use `torch.cuda.amp.autocast(dtype=torch.float16)` (can extend to cpu fp16 when ready)

`--precision amp`: use torch.autocast(device), same as --amp

### Performance

Test in Copper Lake machine.

$ python run.py alexnet -d cpu -m eager -t eval --precision fp32
Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision fp32.
CPU Total Wall Time:  92.600 milliseconds
CPU Peak Memory:                1.1299 GB

$ python run.py alexnet -d cpu -m eager -t eval --precision bf16
Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision bf16.
CPU Total Wall Time:  56.580 milliseconds
CPU Peak Memory:                0.6934 GB

$ python run.py alexnet -d cpu -m eager -t eval --precision amp_bf16
Running eval method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16.
CPU Total Wall Time:  71.385 milliseconds
CPU Peak Memory:                0.9922 GB

$ python run.py alexnet -d cpu -m eager -t train --precision fp32
Running train method from alexnet on cpu in eager mode with input batch size 128 and precision fp32.
CPU Total Wall Time: 306.164 milliseconds
CPU Peak Memory:                2.0977 GB

$ python run.py alexnet -d cpu -m eager -t train --precision bf16
Running train method from alexnet on cpu in eager mode with input batch size 128 and precision bf16.
CPU Total Wall Time: 180.958 milliseconds
CPU Peak Memory:                1.2686 GB

$ python run.py alexnet -d cpu -m eager -t train --precision amp_bf16
Running train method from alexnet on cpu in eager mode with input batch size 128 and precision amp_bf16.
CPU Total Wall Time: 233.332 milliseconds
CPU Peak Memory:                2.0117 GB

Pull Request resolved: pytorch/benchmark#1516

Reviewed By: aaronenyeshi

Differential Revision: D44883144

Pulled By: xuzhao9

fbshipit-source-id: 75251f9eec128b3a1dbca39540193b89059ec183
bestappsandcodereviews7 added a commit to bestappsandcodereviews7/benchmark that referenced this issue Aug 16, 2024
Summary:
Add initial cpu userbenchmark for torchbench

Works for Roadmap pytorch/benchmark#1293 for cpu userbenchmark extend with below functions.

- [x] Add core binding option, support multi-instances test.
- [x] Add gomp/iomp option.
- [x] Add memory allocator option.
- [x] Support all enabled cpu features test based on torchbench models, e.g. channels-last / fx_int8 / jit with fusers
- [x] Support latency and cpu_peak_mem metrics for now, will extend to fps-like report
- [x] Add `README.md`

For example, in below cml, we tested 2 models fx_int8 inference with batch size 8 on CLX socket 0 and 4 instances at the same time.
```shell
$ python run_benchmark.py cpu --model resnet50,alexnet --test eval -b 8 --precision fx_int8 --launcher --launcher-args "--node-id 0 --ninstances 4"
Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'resnet50', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')]
2023-04-20 00:43:37,960 - __main__ - INFO - Use JeMalloc memory allocator
2023-04-20 00:43:37,960 - __main__ - INFO - OMP_NUM_THREADS=7
2023-04-20 00:43:37,960 - __main__ - INFO - Using Intel OpenMP
2023-04-20 00:43:37,960 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2023-04-20 00:43:37,960 - __main__ - INFO - KMP_BLOCKTIME=1
2023-04-20 00:43:37,960 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done]
 [Done]
 [Done]
 [Done]
Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'alexnet', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')]
2023-04-20 00:43:53,444 - __main__ - INFO - Use JeMalloc memory allocator
2023-04-20 00:43:53,444 - __main__ - INFO - OMP_NUM_THREADS=7
2023-04-20 00:43:53,444 - __main__ - INFO - Using Intel OpenMP
2023-04-20 00:43:53,444 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2023-04-20 00:43:53,444 - __main__ - INFO - KMP_BLOCKTIME=1
2023-04-20 00:43:53,444 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done]
 [Done]
 [Done]
 [Done]
```
We can find the test results in `.userbenchmark/cpu/cpu-20230420004336`, `cpu` userbenchmark will create a subfolder for each test, and aggregate all test results into `metrics-20230420004336.json`. For each sub-folder, it contains instances logs named with instance PID for that model test.
```shell
$ ls .userbenchmark/cpu/cpu-20230420004336
eval_alexnet_eager/  eval_resnet50_eager/
$ ls .userbenchmark/cpu/cpu-20230420004336/eval_alexnet_eager/
metrics-3347653.json  metrics-3347654.json  metrics-3347655.json  metrics-3347656.json
$ cat .userbenchmark/cpu/metrics-20230420004336.json
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "de1114554c38322273c066c091d455519d45472d"
    },
    "metrics": {
        "alexnet-eval-eager_latency": 58.309660750000006,
        "alexnet-eval-eager_cmem": 0.416259765625,
        "resnet50-eval-eager_latency": 335.04970325,
        "resnet50-eval-eager_cmem": 0.90673828125
    }
}
```

Pull Request resolved: pytorch/benchmark#1559

Reviewed By: aaronenyeshi

Differential Revision: D45450175

Pulled By: xuzhao9

fbshipit-source-id: 8e7528f4d694eae182ee601cd80bc6e57cd14e3c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants