Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

How can I make batch instances(converting input text fields to tensor objects) faster or efficiently? #2962

Closed
ahnjaewoo opened this issue Jun 18, 2019 · 7 comments
Assignees

Comments

@ahnjaewoo
Copy link

ahnjaewoo commented Jun 18, 2019

System

  • OS: Linux
  • Python version: 3.6.8
  • AllenNLP version: v0.8.4
  • PyTorch version: 1.1.0

Question

I have 2 questions(=problems) in terms of making batch instances.

First of all, generating batch instances with any iterator is too slow, when the input tokens are quite large. Specifically, I am using 80 sentences with 25 words per sentence in a single batch with the batch size of 32 as input tokens. While training the seq2seq attention model, the average speed of training is about 3-4 iterations/sec. And it is quite slower than the result executed from the same Tensorflow codes.

Second, I found that there is a caching option for batch generation. Actually, when I turned on the cache mode, the speed increased to about 20 iterations/sec. However, I realized that the order of batch instances is different in each order, but the configuration of the batches is always the same. I think it is a serious problem, because constant batch instances would make the model converge faster even if it's not the case.

For these reasons, I am asking for methods of making batch instances efficiently.

@DeNeutoy
Copy link
Contributor

Hi, this is an annoying problem in allennlp. The reason this happens is that the code which generates batches is very general and generic, so we cannot specialise to particular types of indexing which might be faster.

Firstly, one thing you can do to make this problem not so severe is to use a

https://allenai.github.io/allennlp-docs/api/allennlp.data.iterators.html#multiprocess-iterator

in combination with a https://allenai.github.io/allennlp-docs/api/allennlp.data.dataset_readers.multiprocess_dataset_reader.html

with num_processes == 1, just to make the batching happen in a separate process.

However, in general we would like to speed up the iterators wherever possible. It seems you've done some speed investigations - if you have any short scripts which you have used to diagnose these problems, we would very much like to see them!

@ahnjaewoo
Copy link
Author

ahnjaewoo commented Jun 21, 2019

Thanks for the useful information.

I think that the problem arises from the iterator, not the dataset reader.
And actually, I had tried the multiprocess iterator but it wasn't stable as I expected.

By the way, instead of sharing the short scripts, I could explain the details.
First of all, I am using the Wizard-of-Wikipedia dataset (https://github.com/facebookresearch/ParlAI/tree/master/projects/wizard_of_wikipedia)
and during the "read" process in the dataset reader, there are several text fields as below.

fields = {
    'context': TextField(context, self._token_indexers),
    'response': TextField(response, self._token_indexers),
    'chosen_topic': TextField(chosen_topic, self._token_indexers),
    'knowledge_titles': ListField([TextField(title, self._token_indexers) for title in knowledge_titles]),
    'knowledge_sentences':ListField([TextField(sentence, self._token_indexers) for sentence in knowledge_sentences]),
    'checked_title': TextField(checked_title, self._token_indexers),
    'checked_sentence': TextField(checked_sentence, self._token_indexers),
    'checked_sentenced_idx': LabelField(checked_sentence_idx, skip_indexing=True)
    }
yield self.text_to_instance(fields)

But the problem happens during these fields enter the iterator. While they are being converted to tensor objects, the process is not parallelized, but consists of several python for-loop codes.

As a result, I think that the problem comes from the single thread problem(i.e. python for-loop.)

@DeNeutoy
Copy link
Contributor

Have you tried using the MultiprocessIterator as well? Typically we have found that to resolve the problems with data generation speed, even with num_processes=1.

Any profiling you have of the data generation process would be great - i'm sure there are places the code can be made a bit more efficient.

@ahnjaewoo
Copy link
Author

ahnjaewoo commented Jun 22, 2019

Yeap, I tried to use the MultiprocessIterator as I mentioned above.
And, I also set the parameter num_workers=1 in MultiprocessIterator and trained it, but the average speed of training was about 1-3 iterations/sec.

I can offer any profiling in terms of data-generation process, but I think the point is not improving the speed by a slight margin, but rather modifying the iterator functions fundamentally to support appropriate parallelization.

Consequently, I realized that the batch generation is an annoying problem in allennlp and now I'm asking AllenNLP to optimize the function even if it will take a long time.

Thanks!

@mnschmit
Copy link
Contributor

mnschmit commented Aug 21, 2019

I also have problems with inefficient tensorification/batching. Likewise using MultiprocessIterator and MultiprocessDatasetReader did not bring the same speed-up as using the cache_instances flag on the underlying Iterator (which I cannot do anymore with the MultiprocessIterator btw).

As caching "solves" the problem for now, I wondered if making the cache persistent (saving the tensors with hdf5 or so) would be an option. So transforming the input into tensors has to happen only once and we get an immediate speedup since the first epoch, i.e., for all experiments except the first one. Assuming, of course, the loading of the cache does not take too long.
Are there any downsides or possible issues to this idea? At least as a workaround before coming up with better ideas long-term?

@matt-gardner
Copy link
Contributor

You have to totally redo how we iterate over data if you want to cache the tensors directly, so it's a whole lot of work. That's the main downside.

brendan-ai2 added a commit that referenced this issue Aug 29, 2019
- Adds a script to benchmark iterators.
  - Average speed
  - Introspects queues
- Removes a bottleneck when `MultiprocessDatasetReader` and `MultiprocessIterator` are used in conjunction.
  - Specifically, removes a redundant queue that was populated by a single process.
- Removes managers which have significant overhead.
- Results on training_config/bidirectional_language_model.jsonnet:
  - original code, no multiprocessing: 0.047 s/b over 10000 batches
  - original code, workers = 1: 0.073 s/b over 10000 batches
  - original code, workers = 10: 0.078 s/b over 10000 batches
  - this PR (-queue), workers = 1: 0.073 s/b over 10000 batches
  - this PR (-queue), workers = 10: 0.046 s/b over 10000 batches
  - this PR (-queue, - manager), workers = 1: 0.063 s/b over 10000 batches
  - this PR (-queue, - manager), workers = 10: 0.020 s/b over 10000 batches
  - Notably, previously we did not see any benefit from scaling to multiple workers. Now we do, albeit worse than linearly. More work required there.
- Related issues: #2962, #1890
reiyw pushed a commit to reiyw/allennlp that referenced this issue Nov 12, 2019
…3119)

- Adds a script to benchmark iterators.
  - Average speed
  - Introspects queues
- Removes a bottleneck when `MultiprocessDatasetReader` and `MultiprocessIterator` are used in conjunction.
  - Specifically, removes a redundant queue that was populated by a single process.
- Removes managers which have significant overhead.
- Results on training_config/bidirectional_language_model.jsonnet:
  - original code, no multiprocessing: 0.047 s/b over 10000 batches
  - original code, workers = 1: 0.073 s/b over 10000 batches
  - original code, workers = 10: 0.078 s/b over 10000 batches
  - this PR (-queue), workers = 1: 0.073 s/b over 10000 batches
  - this PR (-queue), workers = 10: 0.046 s/b over 10000 batches
  - this PR (-queue, - manager), workers = 1: 0.063 s/b over 10000 batches
  - this PR (-queue, - manager), workers = 10: 0.020 s/b over 10000 batches
  - Notably, previously we did not see any benefit from scaling to multiple workers. Now we do, albeit worse than linearly. More work required there.
- Related issues: allenai#2962, allenai#1890
@matt-gardner
Copy link
Contributor

Closing as largely addressed by the new data loaders (#3700).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants