How can I make batch instances(converting input text fields to tensor objects) faster or efficiently? #2962

ahnjaewoo · 2019-06-18T09:56:28Z

System

OS: Linux
Python version: 3.6.8
AllenNLP version: v0.8.4
PyTorch version: 1.1.0

Question

I have 2 questions(=problems) in terms of making batch instances.

First of all, generating batch instances with any iterator is too slow, when the input tokens are quite large. Specifically, I am using 80 sentences with 25 words per sentence in a single batch with the batch size of 32 as input tokens. While training the seq2seq attention model, the average speed of training is about 3-4 iterations/sec. And it is quite slower than the result executed from the same Tensorflow codes.

Second, I found that there is a caching option for batch generation. Actually, when I turned on the cache mode, the speed increased to about 20 iterations/sec. However, I realized that the order of batch instances is different in each order, but the configuration of the batches is always the same. I think it is a serious problem, because constant batch instances would make the model converge faster even if it's not the case.

For these reasons, I am asking for methods of making batch instances efficiently.

DeNeutoy · 2019-06-20T18:30:55Z

Hi, this is an annoying problem in allennlp. The reason this happens is that the code which generates batches is very general and generic, so we cannot specialise to particular types of indexing which might be faster.

Firstly, one thing you can do to make this problem not so severe is to use a

https://allenai.github.io/allennlp-docs/api/allennlp.data.iterators.html#multiprocess-iterator

in combination with a https://allenai.github.io/allennlp-docs/api/allennlp.data.dataset_readers.multiprocess_dataset_reader.html

with num_processes == 1, just to make the batching happen in a separate process.

However, in general we would like to speed up the iterators wherever possible. It seems you've done some speed investigations - if you have any short scripts which you have used to diagnose these problems, we would very much like to see them!

ahnjaewoo · 2019-06-21T02:06:42Z

Thanks for the useful information.

I think that the problem arises from the iterator, not the dataset reader.
And actually, I had tried the multiprocess iterator but it wasn't stable as I expected.

By the way, instead of sharing the short scripts, I could explain the details.
First of all, I am using the Wizard-of-Wikipedia dataset (https://github.com/facebookresearch/ParlAI/tree/master/projects/wizard_of_wikipedia)
and during the "read" process in the dataset reader, there are several text fields as below.

fields = {
    'context': TextField(context, self._token_indexers),
    'response': TextField(response, self._token_indexers),
    'chosen_topic': TextField(chosen_topic, self._token_indexers),
    'knowledge_titles': ListField([TextField(title, self._token_indexers) for title in knowledge_titles]),
    'knowledge_sentences':ListField([TextField(sentence, self._token_indexers) for sentence in knowledge_sentences]),
    'checked_title': TextField(checked_title, self._token_indexers),
    'checked_sentence': TextField(checked_sentence, self._token_indexers),
    'checked_sentenced_idx': LabelField(checked_sentence_idx, skip_indexing=True)
    }
yield self.text_to_instance(fields)

But the problem happens during these fields enter the iterator. While they are being converted to tensor objects, the process is not parallelized, but consists of several python for-loop codes.

As a result, I think that the problem comes from the single thread problem(i.e. python for-loop.)

DeNeutoy · 2019-06-21T17:18:27Z

Have you tried using the MultiprocessIterator as well? Typically we have found that to resolve the problems with data generation speed, even with num_processes=1.

Any profiling you have of the data generation process would be great - i'm sure there are places the code can be made a bit more efficient.

ahnjaewoo · 2019-06-22T07:32:19Z

Yeap, I tried to use the MultiprocessIterator as I mentioned above.
And, I also set the parameter num_workers=1 in MultiprocessIterator and trained it, but the average speed of training was about 1-3 iterations/sec.

I can offer any profiling in terms of data-generation process, but I think the point is not improving the speed by a slight margin, but rather modifying the iterator functions fundamentally to support appropriate parallelization.

Consequently, I realized that the batch generation is an annoying problem in allennlp and now I'm asking AllenNLP to optimize the function even if it will take a long time.

Thanks!

mnschmit · 2019-08-21T13:25:16Z

I also have problems with inefficient tensorification/batching. Likewise using MultiprocessIterator and MultiprocessDatasetReader did not bring the same speed-up as using the cache_instances flag on the underlying Iterator (which I cannot do anymore with the MultiprocessIterator btw).

As caching "solves" the problem for now, I wondered if making the cache persistent (saving the tensors with hdf5 or so) would be an option. So transforming the input into tensors has to happen only once and we get an immediate speedup since the first epoch, i.e., for all experiments except the first one. Assuming, of course, the loading of the cache does not take too long.
Are there any downsides or possible issues to this idea? At least as a workaround before coming up with better ideas long-term?

matt-gardner · 2019-08-21T14:07:02Z

You have to totally redo how we iterate over data if you want to cache the tensors directly, so it's a whole lot of work. That's the main downside.

- Adds a script to benchmark iterators. - Average speed - Introspects queues - Removes a bottleneck when `MultiprocessDatasetReader` and `MultiprocessIterator` are used in conjunction. - Specifically, removes a redundant queue that was populated by a single process. - Removes managers which have significant overhead. - Results on training_config/bidirectional_language_model.jsonnet: - original code, no multiprocessing: 0.047 s/b over 10000 batches - original code, workers = 1: 0.073 s/b over 10000 batches - original code, workers = 10: 0.078 s/b over 10000 batches - this PR (-queue), workers = 1: 0.073 s/b over 10000 batches - this PR (-queue), workers = 10: 0.046 s/b over 10000 batches - this PR (-queue, - manager), workers = 1: 0.063 s/b over 10000 batches - this PR (-queue, - manager), workers = 10: 0.020 s/b over 10000 batches - Notably, previously we did not see any benefit from scaling to multiple workers. Now we do, albeit worse than linearly. More work required there. - Related issues: #2962, #1890

…3119) - Adds a script to benchmark iterators. - Average speed - Introspects queues - Removes a bottleneck when `MultiprocessDatasetReader` and `MultiprocessIterator` are used in conjunction. - Specifically, removes a redundant queue that was populated by a single process. - Removes managers which have significant overhead. - Results on training_config/bidirectional_language_model.jsonnet: - original code, no multiprocessing: 0.047 s/b over 10000 batches - original code, workers = 1: 0.073 s/b over 10000 batches - original code, workers = 10: 0.078 s/b over 10000 batches - this PR (-queue), workers = 1: 0.073 s/b over 10000 batches - this PR (-queue), workers = 10: 0.046 s/b over 10000 batches - this PR (-queue, - manager), workers = 1: 0.063 s/b over 10000 batches - this PR (-queue, - manager), workers = 10: 0.020 s/b over 10000 batches - Notably, previously we did not see any benefit from scaling to multiple workers. Now we do, albeit worse than linearly. More work required there. - Related issues: allenai#2962, allenai#1890

matt-gardner · 2020-02-27T21:06:58Z

Closing as largely addressed by the new data loaders (#3700).

DeNeutoy added the Longterm Issue label Jun 20, 2019

DeNeutoy assigned brendan-ai2 Jun 21, 2019

brendan-ai2 mentioned this issue Aug 7, 2019

Benchmark iterator, avoid redundant queue, remove managers. #3119

Merged

DeNeutoy mentioned this issue Feb 13, 2020

Add isort to sort the imports automatically #3770

Closed

matt-gardner closed this as completed Feb 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I make batch instances(converting input text fields to tensor objects) faster or efficiently? #2962

How can I make batch instances(converting input text fields to tensor objects) faster or efficiently? #2962

ahnjaewoo commented Jun 18, 2019 •

edited

Loading

DeNeutoy commented Jun 20, 2019

ahnjaewoo commented Jun 21, 2019 •

edited

Loading

DeNeutoy commented Jun 21, 2019

ahnjaewoo commented Jun 22, 2019 •

edited

Loading

mnschmit commented Aug 21, 2019 •

edited

Loading

matt-gardner commented Aug 21, 2019

matt-gardner commented Feb 27, 2020

How can I make batch instances(converting input text fields to tensor objects) faster or efficiently? #2962

How can I make batch instances(converting input text fields to tensor objects) faster or efficiently? #2962

Comments

ahnjaewoo commented Jun 18, 2019 • edited Loading

DeNeutoy commented Jun 20, 2019

ahnjaewoo commented Jun 21, 2019 • edited Loading

DeNeutoy commented Jun 21, 2019

ahnjaewoo commented Jun 22, 2019 • edited Loading

mnschmit commented Aug 21, 2019 • edited Loading

matt-gardner commented Aug 21, 2019

matt-gardner commented Feb 27, 2020

ahnjaewoo commented Jun 18, 2019 •

edited

Loading

ahnjaewoo commented Jun 21, 2019 •

edited

Loading

ahnjaewoo commented Jun 22, 2019 •

edited

Loading

mnschmit commented Aug 21, 2019 •

edited

Loading