Optimize data preprocessing by using numpy #771

zhuzilin · 2023-01-18T06:09:07Z

This PR is trying to accelerate preprocess_data.py. There are mainly 2 optimizations:

pass numpy.ndarray to add_item of IndexDatasetBuilder to avoid an immediate tensor.numpy().
use np.cumsum instead of a loop to calculate the pointer offsets.

Thank you for your time on reviewing this PR :)

StellaAthena · 2023-01-18T21:08:20Z

Thank you for this contribution! Can you provide some stats on the speed-up provided in your testing?

zhuzilin · 2023-01-19T03:37:41Z

@StellaAthena I didn't test the impact of changing np.array(torch.IntTensor(list, dtype=torch.int64).numpy(), dtype=np.int64) into np.array(list, dtype=np.int64) because it's a clear win. And for the _get_pointers method, here is the result of a unit test comparison and the script:

	size	old/s	new/s	enhance
	1e5	0.007	0.002	3.159x
	1e6	0.082	0.024	3.405x
	1e7	0.831	0.238	3.490x
	1e8	8.725	2.436	3.582x

code:

import numpy as np
from time import time

def _get_pointers(sizes):
    dtype_size = 4
    address = 0
    pointers = []

    for size in sizes:
        pointers.append(address)
        address += size * dtype_size

    pointers = np.array(pointers, dtype=np.int64)
    return pointers

def _get_pointers_new(sizes):
    dtype_size = 4
    address = np.zeros(len(sizes), dtype=np.int64)
    sizes = np.array(sizes, dtype=np.int64)

    np.cumsum(sizes[:-1], dtype=np.int64, out=address[1:])
    pointers = address * dtype_size
    return pointers

if __name__ == "__main__":
    stats = []
    for n in range(5, 9):
        length = 10 ** n
        sizes = np.random.randint(1000, 2000, size=(length,), dtype=np.int64).tolist()
        start = time()
        for _ in range(3):
            output = _get_pointers(sizes)
        end = time()

        start_new = time()
        for _ in range(3):
            output_new = _get_pointers_new(sizes)
        end_new = time()

        stats.append((n, (end - start) / 3, (end_new - start_new) / 3))
        assert np.array_equal(output, output_new)

    print("\tsize\told/s\tnew/s\tenhance")
    for n, origin_time, new_time in stats:
        print(f"\t1e{n}\t{origin_time:.3f}\t{new_time:.3f}\t{origin_time / new_time:.3f}x")

The new method is around 3 times as fast as the old method.

optimize data preprocessing by using numpy

f255e9b

zhuzilin requested a review from a team as a code owner January 18, 2023 06:09

zhuzilin requested review from Quentin-Anthony and StellaAthena January 18, 2023 06:09

StellaAthena approved these changes Jan 20, 2023

View reviewed changes

StellaAthena merged commit d36f623 into EleutherAI:main Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize data preprocessing by using numpy #771

Optimize data preprocessing by using numpy #771

zhuzilin commented Jan 18, 2023

StellaAthena commented Jan 18, 2023

zhuzilin commented Jan 19, 2023 •

edited

Loading

Optimize data preprocessing by using numpy #771

Optimize data preprocessing by using numpy #771

Conversation

zhuzilin commented Jan 18, 2023

StellaAthena commented Jan 18, 2023

zhuzilin commented Jan 19, 2023 • edited Loading

zhuzilin commented Jan 19, 2023 •

edited

Loading