Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Note about chunking data #3

Closed
abisee opened this issue May 30, 2017 · 2 comments
Closed

Note about chunking data #3

abisee opened this issue May 30, 2017 · 2 comments

Comments

@abisee
Copy link
Owner

abisee commented May 30, 2017

For simplicity, we originally provided code to produce a single train.bin, val.bin and test.bin file for the data. However in our experiments for the paper we split the data into several chunks, each containing 1000 examples (i.e. train_000.bin, train_001.bin, ..., train_287.bin). In the interest of reproducibility, make_datafiles.py has now been updated to also produce chunked data that's saved in finished_files/chunked and the README for the Tensorflow code now gives instructions for chunked data. If you've already run make_datafiles.py to obtain train.bin/val.bin/test.bin files, then just run

import make_datafiles
make_datafiles.chunk_all()

in Python, from the cnn-dailymail directory, to get the chunked files (it takes a few seconds).

To use your chunked datafiles with the Tensorflow code, set e.g.

--data_path=/path/to/chunked/train_*

You don't have to restart training from the beginning to switch to the chunked datafiles.

Why does it matter?

The multi-threaded batcher code is originally from the TextSum project. The idea is that each input thread calls example_generator, which randomizes the chunks, and then reads from the chunks in that order. Thus 16 threads concurrently fill the input queue with examples drawn from different, randomly-chosen chunks. If your data is in a single file however, then the multi-threaded batcher will result in 16 threads concurrently filling the input queue with examples drawn in order from the same single .bin file. Firstly, this might produce batches containing more duplicate examples than we want. Secondly, reading through the dataset in order may produce different training results to reading through in randomized chunks.

  • If you're concerned about duplicate examples in batches, either chunk your data or switch the batcher to single-threaded by setting
self._num_example_q_threads = 1 # num threads to fill example queue
self._num_batch_q_threads = 1  # num threads to fill batch queue

(From a speed point of view, the multi-threaded batcher is probably unnecessary for many systems anyway).

  • If you're concerned about reproducibility and the effect on training of reading the data in randomized chunks vs. from a single file, then chunk your data.
@tifoit
Copy link

tifoit commented Jul 30, 2018

@joanlamrack
Copy link

joanlamrack commented Dec 17, 2021

for those has trouble pre-processing this should help
https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants