Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize data preprocessing by using numpy #771

Merged
merged 1 commit into from
Jan 20, 2023

Conversation

zhuzilin
Copy link
Contributor

This PR is trying to accelerate preprocess_data.py. There are mainly 2 optimizations:

  • pass numpy.ndarray to add_item of IndexDatasetBuilder to avoid an immediate tensor.numpy().
  • use np.cumsum instead of a loop to calculate the pointer offsets.

Thank you for your time on reviewing this PR :)

@StellaAthena
Copy link
Member

Thank you for this contribution! Can you provide some stats on the speed-up provided in your testing?

@zhuzilin
Copy link
Contributor Author

zhuzilin commented Jan 19, 2023

@StellaAthena I didn't test the impact of changing np.array(torch.IntTensor(list, dtype=torch.int64).numpy(), dtype=np.int64) into np.array(list, dtype=np.int64) because it's a clear win. And for the _get_pointers method, here is the result of a unit test comparison and the script:

	size	old/s	new/s	enhance
	1e5	0.007	0.002	3.159x
	1e6	0.082	0.024	3.405x
	1e7	0.831	0.238	3.490x
	1e8	8.725	2.436	3.582x

code:

import numpy as np
from time import time

def _get_pointers(sizes):
    dtype_size = 4
    address = 0
    pointers = []

    for size in sizes:
        pointers.append(address)
        address += size * dtype_size

    pointers = np.array(pointers, dtype=np.int64)
    return pointers

def _get_pointers_new(sizes):
    dtype_size = 4
    address = np.zeros(len(sizes), dtype=np.int64)
    sizes = np.array(sizes, dtype=np.int64)

    np.cumsum(sizes[:-1], dtype=np.int64, out=address[1:])
    pointers = address * dtype_size
    return pointers

if __name__ == "__main__":
    stats = []
    for n in range(5, 9):
        length = 10 ** n
        sizes = np.random.randint(1000, 2000, size=(length,), dtype=np.int64).tolist()
        start = time()
        for _ in range(3):
            output = _get_pointers(sizes)
        end = time()

        start_new = time()
        for _ in range(3):
            output_new = _get_pointers_new(sizes)
        end_new = time()

        stats.append((n, (end - start) / 3, (end_new - start_new) / 3))
        assert np.array_equal(output, output_new)

    print("\tsize\told/s\tnew/s\tenhance")
    for n, origin_time, new_time in stats:
        print(f"\t1e{n}\t{origin_time:.3f}\t{new_time:.3f}\t{origin_time / new_time:.3f}x")

The new method is around 3 times as fast as the old method.

@StellaAthena StellaAthena merged commit d36f623 into EleutherAI:main Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants