-
Notifications
You must be signed in to change notification settings - Fork 2.2k
How can I make batch instances(converting input text fields to tensor objects) faster or efficiently? #2962
Comments
Hi, this is an annoying problem in allennlp. The reason this happens is that the code which generates batches is very general and generic, so we cannot specialise to particular types of indexing which might be faster. Firstly, one thing you can do to make this problem not so severe is to use a https://allenai.github.io/allennlp-docs/api/allennlp.data.iterators.html#multiprocess-iterator in combination with a https://allenai.github.io/allennlp-docs/api/allennlp.data.dataset_readers.multiprocess_dataset_reader.html with num_processes == 1, just to make the batching happen in a separate process. However, in general we would like to speed up the iterators wherever possible. It seems you've done some speed investigations - if you have any short scripts which you have used to diagnose these problems, we would very much like to see them! |
Thanks for the useful information. I think that the problem arises from the iterator, not the dataset reader. By the way, instead of sharing the short scripts, I could explain the details. fields = {
'context': TextField(context, self._token_indexers),
'response': TextField(response, self._token_indexers),
'chosen_topic': TextField(chosen_topic, self._token_indexers),
'knowledge_titles': ListField([TextField(title, self._token_indexers) for title in knowledge_titles]),
'knowledge_sentences':ListField([TextField(sentence, self._token_indexers) for sentence in knowledge_sentences]),
'checked_title': TextField(checked_title, self._token_indexers),
'checked_sentence': TextField(checked_sentence, self._token_indexers),
'checked_sentenced_idx': LabelField(checked_sentence_idx, skip_indexing=True)
}
yield self.text_to_instance(fields) But the problem happens during these fields enter the iterator. While they are being converted to tensor objects, the process is not parallelized, but consists of several python for-loop codes. As a result, I think that the problem comes from the single thread problem(i.e. python for-loop.) |
Have you tried using the Any profiling you have of the data generation process would be great - i'm sure there are places the code can be made a bit more efficient. |
Yeap, I tried to use the I can offer any profiling in terms of data-generation process, but I think the point is not improving the speed by a slight margin, but rather modifying the iterator functions fundamentally to support appropriate parallelization. Consequently, I realized that the batch generation is an annoying problem in allennlp and now I'm asking AllenNLP to optimize the function even if it will take a long time. Thanks! |
I also have problems with inefficient tensorification/batching. Likewise using As caching "solves" the problem for now, I wondered if making the cache persistent (saving the tensors with hdf5 or so) would be an option. So transforming the input into tensors has to happen only once and we get an immediate speedup since the first epoch, i.e., for all experiments except the first one. Assuming, of course, the loading of the cache does not take too long. |
You have to totally redo how we iterate over data if you want to cache the tensors directly, so it's a whole lot of work. That's the main downside. |
- Adds a script to benchmark iterators. - Average speed - Introspects queues - Removes a bottleneck when `MultiprocessDatasetReader` and `MultiprocessIterator` are used in conjunction. - Specifically, removes a redundant queue that was populated by a single process. - Removes managers which have significant overhead. - Results on training_config/bidirectional_language_model.jsonnet: - original code, no multiprocessing: 0.047 s/b over 10000 batches - original code, workers = 1: 0.073 s/b over 10000 batches - original code, workers = 10: 0.078 s/b over 10000 batches - this PR (-queue), workers = 1: 0.073 s/b over 10000 batches - this PR (-queue), workers = 10: 0.046 s/b over 10000 batches - this PR (-queue, - manager), workers = 1: 0.063 s/b over 10000 batches - this PR (-queue, - manager), workers = 10: 0.020 s/b over 10000 batches - Notably, previously we did not see any benefit from scaling to multiple workers. Now we do, albeit worse than linearly. More work required there. - Related issues: #2962, #1890
…3119) - Adds a script to benchmark iterators. - Average speed - Introspects queues - Removes a bottleneck when `MultiprocessDatasetReader` and `MultiprocessIterator` are used in conjunction. - Specifically, removes a redundant queue that was populated by a single process. - Removes managers which have significant overhead. - Results on training_config/bidirectional_language_model.jsonnet: - original code, no multiprocessing: 0.047 s/b over 10000 batches - original code, workers = 1: 0.073 s/b over 10000 batches - original code, workers = 10: 0.078 s/b over 10000 batches - this PR (-queue), workers = 1: 0.073 s/b over 10000 batches - this PR (-queue), workers = 10: 0.046 s/b over 10000 batches - this PR (-queue, - manager), workers = 1: 0.063 s/b over 10000 batches - this PR (-queue, - manager), workers = 10: 0.020 s/b over 10000 batches - Notably, previously we did not see any benefit from scaling to multiple workers. Now we do, albeit worse than linearly. More work required there. - Related issues: allenai#2962, allenai#1890
Closing as largely addressed by the new data loaders (#3700). |
System
Question
I have 2 questions(=problems) in terms of making batch instances.
First of all, generating batch instances with any iterator is too slow, when the input tokens are quite large. Specifically, I am using 80 sentences with 25 words per sentence in a single batch with the batch size of 32 as input tokens. While training the seq2seq attention model, the average speed of training is about 3-4 iterations/sec. And it is quite slower than the result executed from the same Tensorflow codes.
Second, I found that there is a caching option for batch generation. Actually, when I turned on the cache mode, the speed increased to about 20 iterations/sec. However, I realized that the order of batch instances is different in each order, but the configuration of the batches is always the same. I think it is a serious problem, because constant batch instances would make the model converge faster even if it's not the case.
For these reasons, I am asking for methods of making batch instances efficiently.
The text was updated successfully, but these errors were encountered: