Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

How to register a MultiProcessDatasetReader and config it in config.json? #3794

Closed
haoyuan80s opened this issue Feb 17, 2020 · 3 comments
Closed

Comments

@haoyuan80s
Copy link

Hi,

I have a data reader config as the follows in a config.json file:

"dataset_reader" : {
    "type": "my_data_reader"
}

I am trying to make is multi-processing. I am able to do it in python as the follows:

reader_ = DatasetReader.by_name(config['dataset_reader'].pop('type'))()
reader = MultiprocessDatasetReader(reader_, num_workers=32)

How could I config a multi-process-data-reader in my config.json file?

Thanks!

@DeNeutoy
Copy link
Contributor

DeNeutoy commented Feb 18, 2020

Hi @haoyuan80s,

Unfortunately the correct answer to this at the moment is to not bother, because you won't see any effective speedup. We are working very hard on fixing this problem, see #3386 , #3529 #3700 etc.

If you have found it to be faster, then you should just be able to have something like this config file
https://github.com/allenai/allennlp/blob/v0.9.0/training_config/bidirectional_language_model.jsonnet#L41

@haoyuan80s
Copy link
Author

There is actually no speed up. OK, I will just wait.

Thanks

@dirkgr
Copy link
Member

dirkgr commented Feb 21, 2020

I will close this issue then, since there is no action other than what @DeNeutoy is already working on.

@dirkgr dirkgr closed this as completed Feb 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants