Skip to content

A public mirror of our benchmarking runner repository

Notifications You must be signed in to change notification settings

faster-cpython/benchmarking-public

Repository files navigation

Faster CPython Benchmark Infrastructure

๐Ÿ”’ โ–ถ๏ธ START A BENCHMARK RUN

Results

Here are some recent and important revisions. ๐Ÿ‘‰ Complete list of results.

Key: ๐Ÿ“„: table, ๐Ÿ“ˆ: time plot, ๐Ÿง : memory plot

Most recent pystats on main (151934a)

linux aarch64 (arminc)

date fork/ref hash/flags vs. 3.10.4: vs. 3.12.0: vs. 3.13.0b2: vs. base:
2024-08-04 python/151934a324789c58cca9 151934a (JIT) 1.22x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.06x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.07x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.09x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-04 python/151934a324789c58cca9 151934a (NOGIL) 1.21x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.56x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.57x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.61x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-04 python/151934a324789c58cca9 151934a 1.34x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.03x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.02x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/7aca84e557d0a6d242f3 7aca84e (JIT) 1.22x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.06x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.07x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†‘
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-02 python/7aca84e557d0a6d242f3 7aca84e 1.34x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.03x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.02x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-02 python/498376d7a7d6f704f22a 498376d 1.34x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.03x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.02x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/498376d7a7d6f704f22a 498376d (JIT) 1.22x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.06x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.07x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.10x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-07-27 brandtbucher/faster_jit_builds 60b7e71 (JIT) 1.22x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.05x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.06x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-07-25 python/5f6001130f8ada871193 5f60011 (JIT) 1.23x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.05x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.06x โ†“
๐Ÿ“„๐Ÿ“ˆ

linux x86_64 (linux)

date fork/ref hash/flags vs. 3.10.4: vs. 3.12.0: vs. 3.13.0b2: vs. base:
2024-08-05 faster-cpython/use_attributes_to_gu 3d7e23f 1.43x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.08x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.05x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-05 python/1422500d020bd199b263 1422500 1.43x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.09x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.05x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-08-04 python/151934a324789c58cca9 151934a (JIT) 1.43x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.08x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.05x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†‘
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-04 python/151934a324789c58cca9 151934a (NOGIL) 1.05x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.36x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.41x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.47x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-04 python/151934a324789c58cca9 151934a 1.42x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.08x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.05x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/7aca84e557d0a6d242f3 7aca84e (JIT) 1.43x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.08x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.05x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†‘
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-02 python/7aca84e557d0a6d242f3 7aca84e 1.43x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.08x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.05x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-02 python/498376d7a7d6f704f22a 498376d 1.43x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.08x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.05x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/498376d7a7d6f704f22a 498376d (JIT) 1.42x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.08x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.05x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-02 python/main d57f8a9 (T2) 1.23x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.07x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.10x โ†“
๐Ÿ“„๐Ÿ“ˆ
2024-08-01 brandtbucher/no_progress_needed 0197884 (JIT) 1.42x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.08x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.04x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-01 python/df13a1821a90fcfb75ec df13a18 (JIT) 1.42x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.08x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.04x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-07-27 brandtbucher/faster_jit_builds 60b7e71 (JIT) 1.42x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.08x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.04x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-07-25 python/5f6001130f8ada871193 5f60011 (JIT) 1.43x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.09x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.05x โ†‘
๐Ÿ“„๐Ÿ“ˆ

linux x86_64 (pythonperf2)

date fork/ref hash/flags vs. 3.10.4: vs. 3.12.0: vs. 3.13.0b2: vs. base:
2024-08-04 python/151934a324789c58cca9 151934a (JIT) 1.33x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.02x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-04 python/151934a324789c58cca9 151934a (NOGIL) 1.11x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.44x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.44x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.47x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-04 python/151934a324789c58cca9 151934a 1.34x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.02x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.02x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/7aca84e557d0a6d242f3 7aca84e (JIT) 1.33x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.02x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-02 python/7aca84e557d0a6d242f3 7aca84e 1.34x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.02x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.02x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-02 python/498376d7a7d6f704f22a 498376d 1.35x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.03x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.03x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/498376d7a7d6f704f22a 498376d (JIT) 1.34x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.02x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.02x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-07-27 brandtbucher/faster_jit_builds 60b7e71 (JIT) 1.33x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.02x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†‘
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-07-25 python/5f6001130f8ada871193 5f60011 (JIT) 1.33x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†‘
๐Ÿ“„๐Ÿ“ˆ

windows amd64 (pythonperf1)

date fork/ref hash/flags vs. 3.10.4: vs. 3.12.0: vs. 3.13.0b2: vs. base:
2024-08-04 python/151934a324789c58cca9 151934a (JIT) 1.25x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.05x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.09x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-08-04 python/151934a324789c58cca9 151934a 1.14x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.04x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.09x โ†“
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/7aca84e557d0a6d242f3 7aca84e (JIT) 1.24x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.05x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.08x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/7aca84e557d0a6d242f3 7aca84e 1.15x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.04x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.09x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/498376d7a7d6f704f22a 498376d 1.14x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.04x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.10x โ†“
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/498376d7a7d6f704f22a 498376d (JIT) 1.25x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.06x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.10x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-07-27 brandtbucher/faster_jit_builds 60b7e71 (JIT) 1.26x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.07x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-07-25 python/5f6001130f8ada871193 5f60011 (JIT) 1.26x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.06x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†‘
๐Ÿ“„๐Ÿ“ˆ

windows x86 (pythonperf1_win32)

date fork/ref hash/flags vs. 3.10.4: vs. 3.12.0: vs. 3.13.0b2: vs. base:
2024-08-04 python/151934a324789c58cca9 151934a (JIT) 1.21x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.22x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.04x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.09x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-08-04 python/151934a324789c58cca9 151934a 1.11x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.11x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.05x โ†“
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/7aca84e557d0a6d242f3 7aca84e (JIT) 1.20x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.22x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.03x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.10x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/7aca84e557d0a6d242f3 7aca84e 1.09x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.10x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.06x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†“
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/498376d7a7d6f704f22a 498376d 1.10x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.11x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.06x โ†“
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/498376d7a7d6f704f22a 498376d (JIT) 1.20x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.21x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.03x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.09x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-07-27 brandtbucher/faster_jit_builds 60b7e71 (JIT) 1.21x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.22x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.04x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-07-25 python/5f6001130f8ada871193 5f60011 (JIT) 1.20x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.22x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.03x โ†‘
๐Ÿ“„๐Ÿ“ˆ

darwin arm64 (darwin)

date fork/ref hash/flags vs. 3.10.4: vs. 3.12.0: vs. 3.13.0b2: vs. base:
2024-08-04 python/151934a324789c58cca9 151934a (JIT) 1.27x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.06x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.02x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.02x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-04 python/151934a324789c58cca9 151934a (NOGIL) 1.18x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.38x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.47x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.48x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-04 python/151934a324789c58cca9 151934a 1.29x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.08x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/7aca84e557d0a6d242f3 7aca84e (JIT) 1.27x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.07x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-02 python/7aca84e557d0a6d242f3 7aca84e 1.30x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.09x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†‘
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-08-02 python/498376d7a7d6f704f22a 498376d 1.30x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.09x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†‘
๐Ÿ“„๐Ÿ“ˆ
2024-08-02 python/498376d7a7d6f704f22a 498376d (JIT) 1.27x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.07x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.02x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-07-27 brandtbucher/faster_jit_builds 60b7e71 (JIT) 1.27x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.07x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†“
๐Ÿ“„๐Ÿ“ˆ
1.00x โ†“
๐Ÿ“„๐Ÿ“ˆ๐Ÿง 
2024-07-25 python/5f6001130f8ada871193 5f60011 (JIT) 1.27x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.07x โ†‘
๐Ÿ“„๐Ÿ“ˆ
1.01x โ†“
๐Ÿ“„๐Ÿ“ˆ

* indicates that the exact same versions of pyperformance was not used.

For the results above, the "faster/slower" result is a geometric mean of each of the benchmarks. The "reliability (rel)" number is the likelihood that the change is faster or slower based on the Hierarchical Performance Testing (HPT) method. For more details, visit each individual result's README.md.

Longitudinal results

Below are longitudinal timing results. There are also ๐Ÿง  longitudinal memory results.

Longitudinal speed improvement

Effect of different configurations

Improvement of the HPT score of key merged benchmarks, computed with pyperf compare. The results have a resolution of 0.01 (1%).

  • linux: Intelยฎ Xeonยฎ W-2255 CPU @ 3.70GHz, running Ubuntu 20.04 LTS, gcc 9.4.0
  • linux2: 12th Gen Intelยฎ Coreโ„ข i9-12900 @ 2.40 GHz, running Ubuntu 22.04 LTS, gcc 11.3.0
  • linux-aarch64: ARM Neoverse N1, running Ubuntu 22.04 LTS, gcc 11.4.0
  • macos: M1 arm64 Macยฎ Mini, running macOS 13.2.1, clang 1400.0.29.202
  • windows: 12th Gen Intelยฎ Coreโ„ข i9-12900 @ 2.40 GHz, running Windows 11 Pro (21H2, 22000.1696), MSVC v143

Data changes

This is a CHANGELOG of how any derived data has changed:

  • 2024-06-27: The HPT values (and the longitudinal plots that are based on them) now correctly exclude any benchmarks in excluded_benchmarks.txt.

Documentation

Running benchmarks from the GitHub web UI

Visit the ๐Ÿ”’ benchmark action and click the "Run Workflow" button.

The available parameters are:

  • fork: The fork of CPython to benchmark. If benchmarking a pull request, this would normally be your GitHub username.
  • ref: The branch, tag or commit SHA to benchmark. If a SHA, it must be the full SHA, since finding it by a prefix is not supported.
  • machine: The machine to run on. One of linux-amd64 (default), windows-amd64, darwin-arm64 or all.
  • benchmark_base: If checked, the base of the selected branch will also be benchmarked. The base is determined by running git merge-base upstream/main $ref.
  • pystats: If checked, collect the pystats from running the benchmarks.

To watch the progress of the benchmark, select it from the ๐Ÿ”’ benchmark action page. It may be canceled from there as well. To show only your benchmark workflows, select your GitHub ID from the "Actor" dropdown.

When the benchmarking is complete, the results are published to this repository and will appear in the master table. Each set of benchmarks will have:

  • The raw .json results from pyperformance.
  • Comparisons against important reference releases, as well as the merge base of the branch if benchmark_base was selected. These include
    • A markdown table produced by pyperf compare_to.
    • A set of "violin" plots showing the distribution of results for each benchmark.
    • A set of plots showing the memory change for each benchmark (for immediate bases only, on non-Windows platforms).

The most convenient way to get results locally is to clone this repo and git pull from it.

Running benchmarks from the GitHub CLI

To automate benchmarking runs, it may be more convenient to use the GitHub CLI. Once you have gh installed and configured, you can run benchmarks by cloning this repository and then from inside it:

$ gh workflow run benchmark.yml -f fork=me -f ref=my_branch

Any of the parameters described above are available at the commandline using the -f key=value syntax.

Collecting Linux perf profiling data

To collect Linux perf sampling profile data for a benchmarking run, run the _benchmark action and check the perf checkbox.

Creating a custom comparison

If the default comparisons generated by this tool aren't sufficient, you can check out the repo and use the same infrastructure to generate any arbitrary comparison.

Check out a local copy of this repo:

$ git clone https://github.com/faster-cpython/benchmarking-public

Create a new virtual environment, activate it and install the dependencies into it:

$ cd benchmarking-public
$ python -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Run bench_runner's compare tool:

usage:
        Generate a set of comparisons between arbitrary commits. The commits
        must already exist in the dataset.

       [-h] --output-dir OUTPUT_DIR [--type {1:n,n:n}] commit [commit ...]

positional arguments:
  commit                Commits to compare. Must be a git commit hash prefix. May optionally have a friendly name
                        after a comma, e.g. c0ffee,main. If ends with a "T", use the Tier 2 run for that commit. If
                        ends with a "J", use the JIT run for that commit. If ends with a "N", use the NOGIL run for
                        that commit.

options:
  -h, --help            show this help message and exit
  --output-dir OUTPUT_DIR
                        Directory to output results to.
  --type {1:n,n:n}      Compare the first commit to all others, or do the full product of all commits

For example:

$ python -m bench_runner compare e418fc3,default e418fc3J,jit --output comparison --type 1:n

Developer docs

The infrastructure to make all of this work is the bench_runner project. Look there for more detailed developer docs.

Details about how results are collected

The easiest way to reproduce what is here is to use the bench_runner project library directly, but if you want to run parts of it in a different context or better understand how the numbers are calculated, this section describes some of the things that the benchmarking infrastructure does.

Benchmarks from pyperformance and python-macrobenchmarks

These results combine benchmarks that live in the pyperformance and pyston/python-macrobenchmarks projects, so running the default set from pyperformance will definitely produce different results. To combine these benchmarks in the same run, clone both repos side-by-side in the same directory and use a manifest file to combine them. This file should be passed to pyperformance run:

pyperformance run --manifest benchmarks.manifest

Different configurations

Benchmarks and stats collection can happen in three different configurations. Here "configuration" may be a combination of both build-time and run-time flags:

  • Default: A PGO build of CPython (./configure --enable-optimizations --with-lto=yes).
  • Tier 2: The same build as above, but with the PYTHON_UOPS environment variable set at runtime to use the Tier 2 interpreter.
  • JIT: A JIT and PGO build of CPython (./configure --enable-optimizations --with-lto=yes --enable-experimental-jit).

Information about the configuration of the run is in the README.md at the root of each run directory. The directory name will also include PYTHON_UOPS for Tier 2 and JIT for JIT.

To reduce the number of unknown variables when comparing results, runs are always compared against runs of the same configuration. Be aware that sometimes the base commit on main may predate the configuration becoming available, for example, before the JIT compiler was merged into main. (An exception to this rule are the weekly benchmarks of upstream main, there Tier 2 and JIT configurations are compared against default configurations of the same commit, but that isn't relevant for the common case of testing a pull request).

An additional sharp edge is that, by default, pyperformance does not pass environment variables to the child process that actually does the work. Therefore for a Tier 2 configuration, the --inherit-environ=PYTHON_UOPS flag must be passed to pyperformance run when running benchmarks.

For detailed information, see how configurations affect build time flags in the Github Actions configuration..

Timing benchmarks

Timing benchmarks are notoriously noisy. There are a few techniques to reduce this:

  • Where available (on Linux), we use pyperf tune to set CPU affinity and other things that make the benchmarks more reproducible. For this reason, we know that the benchmarks are more predictable on Linux than on the other platforms.
  • pyperf has the concept of "warmup" runs, while caches are warming up and other things about the system are still stabilizing. These runs are excluded from the timing results. This is generally effective at reducing variability, but also may exclude real work done during optimization, for example.
  • We use the Hierarchical Performance Testing (HPT) method (see below) to statistically reduce the effect of benchmarks that have more variability. This is a different method than the simple geometric mean that pyperf uses by default. We provide both numbers in our results.

pystats

pystats are a set of counters in CPython that measure things like the number of times each bytecode instruction is executed. (Detailed documentation of all of the counters should be added to CPython in the future).

Collecting pystats requires a special build of CPython with pystats enabled: (./configure --enable-pystats).

pystats must also be enabled at runtime, either using the -Xpystats command line argument or sys._stats_on(). pyperformance/pyperf handles this step automatically when running on a pystats-enabled build. Stats collection is enabled during actual benchmarking code, and disabled while running the "benchmarking harness" code in pyperf itself. pyperf has the concept of "warmup" runs, which allow things like cache lines to warmup before actually timing benchmarks. While they aren't included in the timing benchmarks, these warmup runs are included in pystats collection since often Tier 2/JIT traces are created during warmup, and we don't want the stats to appear as if the traces ran but were not created.

Any statistics collected are then dumped at exit to the /tmp/py_stats directory with a random filename. Lastly, the Tools/scripts/summarize_stats.py script (in the CPython repo) is used to read all of the files from /tmp/py_stats and produce a human-readable markdown summary and a JSON file with aggregate data. Because of this design, it is imperative that:

  • The /tmp/py_stats directory is cleared before data collection.
  • No other Python processes are run that could also produce pystats data. Especially, this means benchmarks can not run in parallel.

For more information, see the actual code to collect pystats.

HPT

Hierarchical performance testing (HPT) is a method introduced in this paper:

T. Chen, Y. Chen, Q. Guo, O. Temam, Y. Wu and W. Hu, "Statistical performance comparisons of computers," IEEE International Symposium on High-Performance Comp Architecture, New Orleans, LA, USA, 2012, pp. 1-12, doi: 10.1109/HPCA.2012.6169043.

From the abstract:

In traditional performance comparisons, the impact of performance variability is usually ignored (i.e., the means of performance measurements are compared regardless of the variability), or in the few cases where it is factored in using parametric confidence techniques, the confidence is either erroneously computed based on the distribution of performance measurements (with the implicit assumption that it obeys the normal law), instead of the distribution of sample mean of performance measurements, or too few measurements are considered for the distribution of sample mean to be normal. โ€ฆ We propose a non-parametric Hierarchical Performance Testing (HPT) framework for performance comparison, which is significantly more practical than standard parametric techniques because it does not require to collect a large number of measurements in order to achieve a normal distribution of the sample mean.

For each result, we compute a reliability score, as well as the estimated speedup at the 90th, 95th and 99th percentile.

The inclusion of HPT scores is considered experimental as we learn about their usefulness for decision-making.

About

A public mirror of our benchmarking runner repository

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published