Note about chunking data #3

abisee · 2017-05-30T22:25:04Z

For simplicity, we originally provided code to produce a single train.bin, val.bin and test.bin file for the data. However in our experiments for the paper we split the data into several chunks, each containing 1000 examples (i.e. train_000.bin, train_001.bin, ..., train_287.bin). In the interest of reproducibility, make_datafiles.py has now been updated to also produce chunked data that's saved in finished_files/chunked and the README for the Tensorflow code now gives instructions for chunked data. If you've already run make_datafiles.py to obtain train.bin/val.bin/test.bin files, then just run

import make_datafiles
make_datafiles.chunk_all()

in Python, from the cnn-dailymail directory, to get the chunked files (it takes a few seconds).

To use your chunked datafiles with the Tensorflow code, set e.g.

--data_path=/path/to/chunked/train_*

You don't have to restart training from the beginning to switch to the chunked datafiles.

Why does it matter?

The multi-threaded batcher code is originally from the TextSum project. The idea is that each input thread calls example_generator, which randomizes the chunks, and then reads from the chunks in that order. Thus 16 threads concurrently fill the input queue with examples drawn from different, randomly-chosen chunks. If your data is in a single file however, then the multi-threaded batcher will result in 16 threads concurrently filling the input queue with examples drawn in order from the same single .bin file. Firstly, this might produce batches containing more duplicate examples than we want. Secondly, reading through the dataset in order may produce different training results to reading through in randomized chunks.

If you're concerned about duplicate examples in batches, either chunk your data or switch the batcher to single-threaded by setting

self._num_example_q_threads = 1 # num threads to fill example queue
self._num_batch_q_threads = 1  # num threads to fill batch queue

(From a speed point of view, the multi-threaded batcher is probably unnecessary for many systems anyway).

If you're concerned about reproducibility and the effect on training of reading the data in randomized chunks vs. from a single file, then chunk your data.

The text was updated successfully, but these errors were encountered:

tifoit · 2018-07-30T04:57:44Z

https://blog.csdn.net/jdbc/article/details/81283218

joanlamrack · 2021-12-17T16:34:45Z

for those has trouble pre-processing this should help
https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail

This was referenced May 30, 2017

Note about chunking data abisee/pointer-generator#14

Closed

Failure to replicate results abisee/pointer-generator#16

Closed

abisee closed this as completed Aug 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Note about chunking data #3

Note about chunking data #3

abisee commented May 30, 2017 •

edited

Loading

tifoit commented Jul 30, 2018

joanlamrack commented Dec 17, 2021 •

edited

Loading

Note about chunking data #3

Note about chunking data #3

Comments

abisee commented May 30, 2017 • edited Loading

tifoit commented Jul 30, 2018

joanlamrack commented Dec 17, 2021 • edited Loading

abisee commented May 30, 2017 •

edited

Loading

joanlamrack commented Dec 17, 2021 •

edited

Loading