-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Note about chunking data #3
Comments
This was referenced May 30, 2017
for those has trouble pre-processing this should help |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
For simplicity, we originally provided code to produce a single
train.bin
,val.bin
andtest.bin
file for the data. However in our experiments for the paper we split the data into several chunks, each containing 1000 examples (i.e.train_000.bin
,train_001.bin
, ...,train_287.bin
). In the interest of reproducibility,make_datafiles.py
has now been updated to also produce chunked data that's saved infinished_files/chunked
and the README for the Tensorflow code now gives instructions for chunked data. If you've already runmake_datafiles.py
to obtaintrain.bin/val.bin/test.bin
files, then just runin Python, from the
cnn-dailymail
directory, to get the chunked files (it takes a few seconds).To use your chunked datafiles with the Tensorflow code, set e.g.
You don't have to restart training from the beginning to switch to the chunked datafiles.
Why does it matter?
The multi-threaded batcher code is originally from the TextSum project. The idea is that each input thread calls
example_generator
, which randomizes the chunks, and then reads from the chunks in that order. Thus 16 threads concurrently fill the input queue with examples drawn from different, randomly-chosen chunks. If your data is in a single file however, then the multi-threaded batcher will result in 16 threads concurrently filling the input queue with examples drawn in order from the same single.bin
file. Firstly, this might produce batches containing more duplicate examples than we want. Secondly, reading through the dataset in order may produce different training results to reading through in randomized chunks.(From a speed point of view, the multi-threaded batcher is probably unnecessary for many systems anyway).
The text was updated successfully, but these errors were encountered: