1BRC in Python

I am a sucker for optimization. So when I heard about 1BRC I got intrigued and started experimenting with python implementations. I achieved my goal and implemented the fastest implementation running on CPython, without any external libraries .

For my implementation I followed these principles:

Python
- Python is the primary language I use at my work. So I wanted to see how much I can push performance of native python app.
CPython Interpreter
- CPython remains the most popular interpreter with widest support, so I wanted to push for performance using this as this will also be translatable in day to day work.
Don't use any 3rd party packages like DuckDb or Polars
- Even though DuckDB and Polars based implementation is highly suited and also the fastest, my aim was to stick to native python libraries so as to push to find the best implementation.
- This also applies to C extensions and usage of SIMD or Vector instructions
  - I want to keep the code portable and maintainable. SIMD instructions can be super performant and allow for taking full advantage of the processor without relying on compiler optimization, but they are not easy to maintain and the code will need to be re-written when changing processor architecture.
Use cpu profiler (like line_profiler)
- My aim was to use python profiler to optimize the code as much as possible
Use GPT for code generation
- I wanted to generate most of the code using GPT to help me jump start my implementation
Take inspiration from existing python attempts
- I wanted to look at existing attempts of 1BRC in python and take inspiration from them and then build my optimizations on top.
- Credits:

Performance Benchmark (Macbook Pro M2 Pro 32GB RAM) (1 Billion Rows)

File cached in memory

Interpreter	File	Time (sec)
Python3	py_1brc_final.py	24.882
Python3	py_1brc_mypyc.py (process_chunk.py precompiled using mypyc)	24.441
Python3	calculateAverage.py (from https://github.com/ifnesi/1brc)	36.303
Python3	calculateAveragePyPy.py (from https://github.com/ifnesi/1brc)	60.60
Python3	calculateAverageDuckDB.py (from https://github.com/ifnesi/1brc)	9.556
Python3	calculateAveragePolars.py (from https://github.com/ifnesi/1brc)	12.058
Python3	doug_booty4.py (from https://github.com/dougmercer-yt/1brc)	62.91

Interpreter	File	Time (sec)
PyPy3	py_1brc_final.py	16.741
PyPy3	py_1brc_mypyc.py	16.726
PyPy3	calculateAverage.py (from https://github.com/ifnesi/1brc)	28.647
PyPy3	calculateAveragePyPy.py (from https://github.com/ifnesi/1brc)	15.249
PyPy3	doug_booty4.py (from https://github.com/dougmercer-yt/1brc)	8.507

File not cached in memory (run `sudo purge` before each run)

Interpreter	File	Time (sec)
Python3	py_1brc_final.py	29.608
Python3	py_1brc_mypyc.py (process_chunk.py precompiled using mypyc)	29.654
Python3	calculateAverage.py (from https://github.com/ifnesi/1brc)	53.255
Python3	calculateAveragePyPy.py (from https://github.com/ifnesi/1brc)	61.83
Python3	calculateAverageDuckDB.py (from https://github.com/ifnesi/1brc)	9.726
Python3	calculateAveragePolars.py (from https://github.com/ifnesi/1brc)	16.799
Python3	doug_booty4.py (from https://github.com/dougmercer-yt/1brc)	69.68

Interpreter	File	Time (sec)
PyPy3	py_1brc_final.py	20.473
PyPy3	py_1brc_mypyc.py	20.333
PyPy3	calculateAverage.py (from https://github.com/ifnesi/1brc)	43.903
PyPy3	calculateAveragePyPy.py (from https://github.com/ifnesi/1brc)	16.374
PyPy3	doug_booty4.py (from https://github.com/dougmercer-yt/1brc)	11.162

Generation of measurements.txt

Follow the instructions from here

Execution

I started by asking GPT to generate python code using below prompt:

For a file with 1 billion rows which has each row having a city name and the corresponding temperature, where the city temperature can be repeated multiple times. Provide a python code which prints the min, mean and max temperature for each city in alphabetical fashion. Make the code most optimal without any dependency on external libraries.
- As you can realize the above prompt missed providing details of the file format, optimizing for speed and not memory and giving any direction for multi-processing. After providing all those inputs in subsequent prompts I got the implementation I needed.
- GPT kept on going towards providing solutions using numpy and pandas and I had to reinforce in subsequent prompts to not rely on third party libraries
- At the end I got a code which spawned multiple processes and read the file using memory mapping. The code even though functional, but was far from being optimized. Main issues were:
  - The memory mapped file was being mapped for full file size by all child processes, which led to very in-efficient read performance
  - Data was shared between processes using multiprocessing.managers.DictProxy which was very in-efficient, as it would synchronize access to the dictionary, which was not really needed.
  - The access pattern to read and store data in dictionary was very sub-optimal
```
    combined_data[city]['sum'] += data['sum']
    combined_data[city]['count'] += data['count']
    combined_data[city]['min'] = min(combined_data[city]['min'], data['min'])
    combined_data[city]['max'] = max(combined_data[city]['max'], data['max'])
```
I used memory mapped file to read the data in chunks equally divided by the number of processes.
Using profiler (line_profiler) I was able to identify and fix performance at multiple places.
- For sharing data between main and child processes it is more optimal to use Pool.starmap or Pool.map.
- Below code is more optimal when updating dictionary elements (inspired by https://github.com/ifnesi/1brc):
```
    city_info = combined_data[city]
    city_info['sum'] += data['sum']
    city_info['count'] += data['count']
    if city_info['min'] > data['min']
        city_info['min'] = data['min']
    if city_info['max'] < data['max']
        city_info['max'] = data['max']
```
- Type annotations improve readability and performance.
- Removed few if checks which were redundant.

Learnings

Python Multiprocessing is very powerful in enabling multi core processing.
- Using Pool.starmap to execute tasks in independent processes is a very effective way to parallelizing work.
- Using multiprocessing.managers.DictProxy to transfer data between processes can be very slow as the runtime synchronizes access to the dictionary.
Code generated by GPT is most likely going to be sub-optimal and you need to spend time in optimizing the code by understanding the core logic.
PyPy gives good boost over CPython but compatibility of PyPy with external libraries is a limiting factor.
MyPyC optimizations might not always give the performance boost you expect. So always measure after making the change.
Strategies of reading and parsing each of the lines, identifying the semi colon, casting to float are the operations which you will perform 1 billion times, so effectively doing them will have good impact on the program execution.
Same applies to the choice of data structure for storing the values, as you will be doing lookup in the map/dictionary/hashtable a billion times.
The order of reading data from the file will have impact on performance, so optimize for that. Random access will hurt performance, so when reading in multiple threads ensure that each thread is reading contiguous portions of the file sequentially.
- The identify_chunks method identifies portions of the file which we can read in individual threads/processes.

Some optimizations which did not work:

In CPython custom integer parsing (as shown below ) was slower than just casing the byte values to float

def to_int(x: bytes) -> int:
    # Parse sign
    if x[0] == 45:  # ASCII for "-"
        sign = -1
        idx = 1
    else:
        sign = 1
        idx = 0
    # Check the position of the decimal point
    if x[idx + 1] == 46:  # ASCII for "."
        # -#.# or #.#
        # 528 == ord("0") * 11
        result = sign * ((x[idx] * 10 + x[idx + 2]) - 528)
    else:
        # -##.# or ##.#
        # 5328 == ord("0") * 111
        result = sign * ((x[idx] * 100 + x[idx + 1] * 10 + x[idx + 3]) - 5328)

    return result

Mypyc compilation was not any faster than default CPython implementation.
Using a custom data class City did not improve performance over a List[float].

Optimizing for PyPy does not make the implementation any faster in CPython, but optimizing for CPython does make the implementation faster in PyPy.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.vscode		.vscode
archive		archive
.gitignore		.gitignore
MISC.md		MISC.md
README.md		README.md
measurements_1mil.txt		measurements_1mil.txt
process_chunk.py		process_chunk.py
py-1brc-cext.py		py-1brc-cext.py
py-1brc.py		py-1brc.py
py_1brc_2.py		py_1brc_2.py
py_1brc_final.py		py_1brc_final.py
py_1brc_mypyc.py		py_1brc_mypyc.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1BRC in Python

Performance Benchmark (Macbook Pro M2 Pro 32GB RAM) (1 Billion Rows)

File cached in memory

File not cached in memory (run `sudo purge` before each run)

Generation of measurements.txt

Execution

Learnings

About

Releases

Packages

Contributors 2

Languages

pappuks/1brc

Folders and files

Latest commit

History

Repository files navigation

1BRC in Python

Performance Benchmark (Macbook Pro M2 Pro 32GB RAM) (1 Billion Rows)

File cached in memory

File not cached in memory (run sudo purge before each run)

Generation of measurements.txt

Execution

Learnings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

File not cached in memory (run `sudo purge` before each run)

Packages