Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#385 improve documentation on chsnks and chunk size parameter #423

Merged
merged 3 commits into from
Sep 3, 2018

Conversation

earthgecko
Copy link
Contributor

  • Updated docs with with more info on chunksize
  • Updated chunksize docstrings with more info on chunksize

Modified:
docs/text/parallelization.rst
tsfresh/convenience/relevant_extraction.py
tsfresh/feature_extraction/extraction.py
tsfresh/feature_selection/relevance.py
tsfresh/feature_selection/selection.py
tsfresh/transformers/feature_augmenter.py
tsfresh/transformers/relevant_feature_augmenter.py

- Updated docs with with more info on chunksize
- Updated chunksize docstrings with more info on chunksize

Modified:
docs/text/parallelization.rst
tsfresh/convenience/relevant_extraction.py
tsfresh/feature_extraction/extraction.py
tsfresh/feature_selection/relevance.py
tsfresh/feature_selection/selection.py
tsfresh/transformers/feature_augmenter.py
tsfresh/transformers/relevant_feature_augmenter.py
@coveralls
Copy link

coveralls commented Aug 30, 2018

Coverage Status

Coverage remained the same at 97.444% when pulling 8ab8d12 on earthgecko:improve_chunk_size_docs into abb3237 on blue-yonder:master.

Added a removed blank line that was introduced from some other testing
Modified:
tsfresh/feature_extraction/extraction.py
@MaxBenChrist
Copy link
Collaborator

MaxBenChrist commented Sep 3, 2018

Good idea! Can you change that description a little bit? I would write

:class:`multiprocessing.Pool` is parallelisation parameter. One data chunk is defined as a singular time series for one id and one kind. The chunksize is the number of chunks that are submitted as one task to one worker process.  
If you set the chunksize to 10, then it means that one worker task corresponds to calculate all features for 10 id/kind time series combinations.  
If it is set it to None, depending on distributor, heuristics are used to find the optimal chunksize.
The chunksize can have an crucial influence on the optimal cluster performance and should be optimised in benchmarks for the problem at hand.

@earthgecko
Copy link
Contributor Author

@MaxBenChrist modified as requested, all done,

@MaxBenChrist
Copy link
Collaborator

Thx!

@MaxBenChrist MaxBenChrist merged commit 925dd64 into blue-yonder:master Sep 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants