Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change from concurrent.futures to multiprocessing #354

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

donglihe-hub
Copy link
Contributor

@donglihe-hub donglihe-hub commented Jan 23, 2024

What does this PR do?

Change from a higher level interface to a lower one because:

  1. As described in concurrent.futures.ProcessPoolExecutor pool deadlocks when submitting many tasks python/cpython#105829, when there are many tasks submitted to a concurrent.futures.ProcessPoolExecutor pool, there is probability that deadlocks will occur with CPython. The same example run with multiprocessing.pool.Pool had no such problem.
  2. The issue has been fixed in Python>=3.11.6 and >=1.12.1. For the sake of compatibility, I rewrote the parallel tokenization using multiprocessing.

Trying to find out the best num_processes

I tested tokenization with various num_processes (using RegexTokenization):

linux, fork

num_processes AmazonCat-13K EUR-Lex Wiki10-31K
no parallel 108.49 s 6.66 s 10.31 s
2 92.31 s 5.81 s 9.92 s
4 61.62 s 3.80 s 6.01 s
8 59.44 s 3.20 s 4.65 s
16 49.05 s 3.90 s 5.72 s
32 52.18 s 3.44 s 4.96 s
64 57.61 s 5.37 s 7.09 s
128 86.96 s 8.54 s 11.16 s
256 119.51 s 14.57 s 19.10 s

Adds-on: I re-ran the codes again. This time 16 had the best perfomance in all cases.

num_processes AmazonCat-13K EUR-Lex Wiki10-31K
no parallel 97.36 s 6.50 s 10.57 s
8 62.09 s 4.09 s 6.29 s
16 42.34 s 3.59 s 5.67 s
32 69.60 s 4.72 s 6.70 s

Based on the results, I believe 16 is a reasonable choice for num_processes. For small datasets, a difference of 1 to 2 seconds is negligible. For large datasets like AmazonCat-13K, 16 has the least running time than other settings.

Having said that, the results are device- and system-specific. This means the choice for num_processes might be different, for example, on intel CPU or on Windows (I'm using AMD server CPU and Linux).

I tested multiprocessing on Windows. Since on Windows and MacOS doesn't has "fork" as start method, the running is longer using "spawn" as start method (spawn takes more time to start than fork).

win32, spawn

num_processes EUR-Lex Wiki10-31K
no parallel 6.02 s 15.61 s
2 14.85 s 23.68 s
4 17.57 s 19.42 s
8 20.08 s 23.44 s
12 26.21 s 35.86 s

I also tested spawn on linux

linux, spawn

num_processes EUR-Lex Wiki10-31K
no parallel 6.61 s 10.64 s
2 8.41 s 12.67 s
4 6.07 s 8.60 s
8 5.88 s 8.65 s
12 5.36 s 6.81 s
16 5.42 s 6.85 s
32 6.36 s 8.86 s

It turned out the support for multiprocessing is more complicated than I think. So I'll limited the use of multiprocessing on Linux only.

Test CLI & API (bash tests/autotest.sh)

Test APIs used by main.py.

  • Test Pass
    • (Copy and paste the last outputted line here.)
  • Not Applicable (i.e., the PR does not include API changes.)

Check API Document

If any new APIs are added, please check if the description of the APIs is added to API document.

  • API document is updated (linear, nn)
  • Not Applicable (i.e., the PR does not include API changes.)

Test quickstart & API (bash tests/docs/test_changed_document.sh)

If any APIs in quickstarts or tutorials are modified, please run this test to check if the current examples can run correctly after the modified APIs are released.

@donglihe-hub donglihe-hub force-pushed the multiprocessing branch 3 times, most recently from 5c58aa6 to 206f6f5 Compare January 23, 2024 11:55
@donglihe-hub donglihe-hub marked this pull request as ready for review January 23, 2024 12:13
@donglihe-hub donglihe-hub force-pushed the multiprocessing branch 5 times, most recently from 7941e41 to 763d476 Compare January 24, 2024 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant