Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) #624

zhangguanheng66 · 2019-10-23T21:03:21Z

Re-write the language modeling datasets with a new pattern which was first applied for text classification datasets in v.0.4.0 release.

Motivation

The motivation for the new pattern is to simplify data processing in Torchtext and grant users more flexibility to build the pipeline. There are three major issues we want to solve:

Remove the dependency of Field class, which couples tokenizer, vocabulary, split, batching and sampling, padding, and numericalization together. It's like a "black box" and confuses users about what's going on inside it. Instead, with the new pattern, the components mentioned above should be some basic building blocks. Users can build the data processing pipeline with the orthogonal components.
Incompatible with DataLoader and Sampler in torch.utils.data. Some duplicate functionals (e.g. Iterator, Batch, splits in torchtext) should be replaced by the corresponding functionals in torch.utils.data to reduce the maintaince efforts.
Unnecessary data structure. For example, Example class adds no structure, and should be replaced with tuple/dict or namedtuple.

API for new language modeling datasets

To support "one-command" data loading, we have built a pipeline here to support PennTreebank, WikiText103, WikiText2. Users are welcome to build their ones if they follow the pattern. To load the new datasets, simple call the dataset API, as follow:

from torchtext.datasets import WikiText2
train_dataset, valid_dataset, test_dataset = WikiText2()

If you want to use a specific tokenizer,

from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("spacy")
train_dataset, valid_dataset, test_dataset = WikiText2(tokenizer=tokenizer)

If you just want the valid set:

vocab = train_dataset.get_vocab()
valid_dataset, = WikiText2(tokenizer=tokenizer, vocab=vocab, data_select='valid')

Legacy code

We decide to move the old language modeling datasets (a.k.a. PennTreebank, WikiText103, WikiText2) to a legacy folder torchtext.legacy.datasets. In the past, you may use those datasets as follow:

import torchtext.data as data
from torchtext.datasets import WikiText2
TEXT = data.Field(lower=True, batch_first=True)
train_dataset, valid_dataset, test_dataset = WikiText2.splits(TEXT)

You can still use the legacy datasets, as follow:

import torchtext.data as data
from torchtext.legacy.datasets import WikiText2
TEXT = data.Field(lower=True, batch_first=True)
train_dataset, valid_dataset, test_dataset = WikiText2.splits(TEXT)

Difference

With the old pattern, users have to create a Field object including a specific tokenizer. In the new dataset API, user can pass the tokenizer directly to the dataset constructor

# Old pattern
TEXT = torchtext.data.Field(tokenize=get_tokenizer("basic_english"))

# New pattern
from torchtext.data.utils import get_tokenizer
train_dataset, test_dataset, valid_dataset = WikiText2(tokenizer=get_tokenizer("spacy"))

In the old dataset, vocab object is associated with Field class, and there is no way to apply a pre-trained vocab object. In the new dataset, vocab object can be obtained by

vocab = train_dataset.get_vocab()

and apply to generate new datasets

train_dataset, test_dataset, valid_dataset = WikiText103(vocab=vocab)

The datasets with the new pattern return a tensor of token IDs, instead of tokens in the old pattern. If users would like to check the tokens, simply use the following command:

train_vocab = train_dataset.get_vocab()
tokens = [train_vocab.itos[id] for id in train_dataset]

Unlike the old pattern using BucketIterator.splits, users are encouraged to use torch.utils.data.DataLoader to generate batches of data. You can specify how exactly the samples need to be batched using collate_fn.

# Generate 8x8 batches
import torch
from torch.utils.data import DataLoader
num_row = 8
def generate_rows(data):
	data = torch.tensor(data).view(num_row, -1).t().contiguous()
	return data
dataloader = DataLoader(train_dataset, batch_size=64, num_workers=4, collate_fn=generate_rows)
for batch in dataloader:
    # Send batch to the model.

cpuhrsch · 2019-11-05T01:05:20Z

torchtext/datasets/language_modeling.py

+}
+
+
+def _read_text_iterator(data_path, tokenizer):


These two functions look similar to what's in the classification datasets - is there a way to share this code via some simple abstractions?

I was thinking about that. The only difference is the label in text classification dataset.

cpuhrsch · 2019-11-05T01:05:52Z

torchtext/datasets/language_modeling.py

+
+def _create_data_from_iterator(vocab, iterator, include_unk):
+ _data = []
+ with tqdm(unit_scale=0, unit='lines') as t:


Is this noisy by default? Printing progress bars can cause a lot of noise.

I could remove it. But I saw some issues that users complained about no response during loading the data. They wanted to see the progress.

Let's make it quiet by default, because otherwise it'll spam the logs of programs running outside of an interactive environment.

torchtext/datasets/language_modeling.py

cpuhrsch · 2019-11-05T01:11:11Z

Generally I think this looks good. In general, we need to write clear documentation that describes how to migrate to this new Dataset in comparison to the old one.

zhangguanheng66 · 2019-11-05T15:14:18Z

torchtext/datasets/language_modeling.py

+ yield tokens
+
+
+def _create_data_from_iterator(vocab, iterator, include_unk):


This function could be unified as well.

zhangguanheng66 · 2019-11-05T15:19:24Z

Generally I think this looks good. In general, we need to write clear documentation that describes how to migrate to this new Dataset in comparison to the old one.

Yes. I could write some instructions after IMDB datasets.

cpuhrsch · 2019-11-05T22:51:31Z

torchtext/data/functional.py

@@ -178,33 +176,31 @@ def read_text_iterator(path, tokenizer):
 yield tokens


-def create_data_from_iterator(vocab, iterator, include_unk):
- r"""Create data from an token iterator.
+def create_data_from_iterator(vocab, iterator, removed_tokens=None):


Make sure to keep track of things that break BC for the release notes. As far as I can tell this does.

It's the first time we add create_data_from_iterator in torch.data.functional.

cpuhrsch · 2019-11-05T22:53:13Z

test/data/test_builtin_datasets.py

+ self.assertEqual(tokens_ids, [2, 285, 502, 699])
+
+ # Delete the dataset after we're done to save disk space on CI
+ if os.environ.get("TRAVIS") == "true":


This seems like a generic cleanup pattern. You should use the capabilities the test framework brings along for this.

vincentqb · 2019-11-07T22:19:44Z

torchtext/datasets/language_modeling.py

+
+ return (LanguageModelingDataset(torch.Tensor(train_data).long(), vocab),
+ LanguageModelingDataset(torch.Tensor(test_data).long(), vocab),
+ LanguageModelingDataset(torch.Tensor(valid_data).long(), vocab))


A "dataset" is therefore a list of 3 datasets corresponding to train/test/validation? I assume the division is dictated by the source itself, right?

Could the user ever want to just get one of train/test/validation? What do you think of letting the user get one at the time when building a dataset?

For reference: librispeech in audio simply lets the user chose which one to download.

That could be an option with an extra input for that.
But it's kind of convention to generate three datasets together (if there are) and most users will need all the them for training and inference.

What do you think of having a function returning each separately, and then wrap all three with the API here?

Something like

vocab = build_vocab(train) train = _setup_datasets_part("train", vocab) test = _setup_datasets_part("test", vocab) valid = _setup_datasets_part("valid", vocab) return train, test, valid

the flag is probably a good idea. Someone might only want to run validation or testing in a separate process. either accept a string or a tuple of strings i'd say.

We talked about this offline. The decision is to add a keyword argument (a tuple of strings) to identify which datasets will be generated. The default value is a tuple of ('train', 'test'). Users have the flexibility to choose a single dataset with a proper vocab object.

Ok.

In CommonVoice, the implementation doesn't know about the meaning of train/test/etc. Instead, the zip file contains many tsv files (e.g. for train, test, ...), and the user simply specifies which file to use.

An advantage is that the user could decide to create its own custom tsv file (say for training), and load that one.

Ok.

In CommonVoice, the implementation doesn't know about the meaning of train/test/etc. Instead, the zip file contains many tsv files (e.g. for train, test, ...), and the user simply specifies which file to use.

An advantage is that the user could decide to create its own custom tsv file (say for training), and load that one.

Yes, that's kind of the same case for translation dataset so we need to provide an option to choose the file.
However, for most text dataset, the raw data files have been properly tagged with train, test, and valid.

Some datasets in audio even have "train_100", "train_500" etc built-in, see this. With only "train", "test" "valid", how would you like to deal with that here?

In my early response, I propose to add three more flags train_filename, valid_filename, test_filename, which give you the flexibility to choose a specific file for training. I think either way should way and we just need to keep consistent across datasets within domain.

vincentqb · 2019-11-07T22:25:40Z

torchtext/datasets/language_modeling.py

+ vocab, read_text_iterator(valid_path, tokenizer), removed_tokens)
+ valid_data = []
+ for tokens in valid_iter:
+ valid_data += tokens


nit: I don't know that it's valuable to simplify this block, but it'd be more readable by just adding a space between line 91 and 92 :)

vincentqb · 2019-11-07T22:26:58Z

torchtext/datasets/language_modeling.py

+
+def _setup_datasets(dataset_name, tokenizer=get_tokenizer("basic_english"),
+ root='.data', vocab=None, removed_tokens=['<unk>']):
+ if dataset_name == 'PennTreebank':


If we use that function for future datasets, is there a more general strategy than just making PennTreebank a special case?

cpuhrsch · 2019-11-22T18:50:32Z

torchtext/datasets/language_modeling.py

+ _data[item] += tokens
+
+ return tuple(LanguageModelingDataset(torch.tensor(_data[d]).long(), vocab)
+ for d in data_select if _data[d] != [])


I think if _data[d] is empty we actually want to raise an error because the data the user requested is empty OR we simply return a dataset that loads nothing.

If the data is empty we might have downloaded something corrupted.

cpuhrsch · 2019-11-22T18:51:42Z

torchtext/datasets/language_modeling.py

+ if 'valid' in data_select:
+ extracted_files.append(download_from_url(URLS['PennTreebank'][2], root=root))
+ else:
+ dataset_tar = download_from_url(URLS[dataset_name], root=root)


Another thing we could do with these download functions is compare the result to a checksum. We added those capabilities in torchaudio. I think that might actually be worth it, if this checksum calculating is fast. If it's not fast we might need to make this a flag. This could save a lot of grief and is something rather common. Should we do this now? cc @vincentqb

Having a checksum is a good idea. However, since the goal is to soonish put the download function in sync, we should avoid investing time optimizing this particular version. I'd recommend using the one we worked on recently in torchaudio instead.

Yes we should copy this one into here as well. Do you want to create a PR?

@vincentqb @cpuhrsch I'm fine to put one download function across domains if it doesn't break any current download activity.

I think it's worth testing this, since it's our goal down the road :)

cpuhrsch · 2019-11-22T21:02:40Z

Thanks for updating the description! it's much clearer.

Could you also add a paragraph that contrasts the old and new datasets? As in, do we give up on any capabilities, how much does the API differ at each step etc.

cpuhrsch · 2019-11-22T22:12:01Z

torchtext/data/functional.py

+ if removed_tokens is None:
+ yield iter(vocab[token] for token in tokens)
+ else:
+ tokens = list(filter(lambda x: x not in removed_tokens, tokens))


I think you can do this on the fly :). filter([...]) should be able to feed into iter and simply skip undesired items.

Fixed. Thanks.

zhangguanheng66 · 2019-11-25T16:35:49Z

Thanks for updating the description! it's much clearer.

Could you also add a paragraph that contrasts the old and new datasets? As in, do we give up on any capabilities, how much does the API differ at each step etc.

Add a session to compare the difference at the very end. @cpuhrsch

cpuhrsch · 2019-11-25T18:31:02Z

torchtext/data/functional.py

+ if removed_tokens is None:
+ yield iter(vocab[token] for token in tokens)
+ else:
+ yield iter(vocab[token] for token in


You can write this more compactly as iter(filter(lambda x: x is not None, [3, 4, None]))

vincentqb · 2019-11-25T21:02:16Z

torchtext/data/functional.py

+ yield tokens
+
+
+def create_data_from_iterator(vocab, iterator, removed_tokens=None):


Is the name of the function clear? convert_tokens_to_ids?

vincentqb · 2019-11-25T21:03:24Z

torchtext/legacy/__init__.py

@@ -0,0 +1,5 @@
+from . import datasets
+
+__version__ = '0.4.0'


nit: What does the version mean here?

cpuhrsch · 2019-11-25T23:37:26Z

README.rst

+Legacy Code
+===========
+
+We are currently retiring several datasets as legacy code ```torchtext.legacy```:


nit: We have currently retired several datasets and move them under torch.legacy.

cpuhrsch

LGTM. See Vincent's comment "create_data_from_iterator".

Move PennTreebank, WikiText103, WikiText2 to torchtext.legacy

45d53de

zhangguanheng66 changed the title ~~Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2)~~ [WIP] Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) Oct 23, 2019

Guanheng Zhang added 11 commits October 25, 2019 11:30

Some initial work.

1f95483

Merge branch 'master' into legacy_language_modeling

2d3ebe2

Re-write three datasets.

97af9d0

Merge branch 'master' into legacy_language_modeling

544b069

Update tests.

cc127de

Move legacy docs for language modeling dataset.

97cfd05

Update docs.

0ac3e18

Minor debug

56046fa

Update test.

9962732

Minor change in tests.

ad7938e

Flake8

3ff1cce

zhangguanheng66 changed the title ~~[WIP] Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2)~~ Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) Oct 31, 2019

zhangguanheng66 requested review from vincentqb and cpuhrsch October 31, 2019 19:57

Merge branch 'master' into legacy_language_modeling

361f688

cpuhrsch reviewed Nov 5, 2019

View reviewed changes

torchtext/datasets/language_modeling.py Outdated Show resolved Hide resolved

zhangguanheng66 commented Nov 5, 2019

View reviewed changes

Guanheng Zhang added 3 commits November 5, 2019 09:00

Move two funct to data/functional.py.

cc1ae4d

Fix <'unk'> compability issue.

f4018cc

Minor changes.

ff329f9

cpuhrsch reviewed Nov 5, 2019

View reviewed changes

Update unit tests.

65c470c

vincentqb reviewed Nov 7, 2019

View reviewed changes

cpuhrsch reviewed Nov 22, 2019

View reviewed changes

Guanheng Zhang added 4 commits November 22, 2019 13:09

Change create_data_from_iterator to double iter.

e77758e

Add select_to_index.

6d49f40

check subset.

1f60293

Error if dataset is empty.

8bb1cb2

cpuhrsch reviewed Nov 22, 2019

View reviewed changes

Guanheng Zhang added 2 commits November 25, 2019 07:27

filter output is iterable.

6a50f2a

flake8

a29f4bd

cpuhrsch reviewed Nov 25, 2019

View reviewed changes

Guanheng Zhang added 3 commits November 25, 2019 11:18

Add a claimer in README.rst

9206e63

revise create_data_from_iterator

e2ba8bf

Remove a printout.

0993540

vincentqb reviewed Nov 25, 2019

View reviewed changes

Remove version num in legacy.

81055a0

cpuhrsch reviewed Nov 25, 2019

View reviewed changes

Guanheng Zhang added 3 commits November 25, 2019 16:08

remove read_text_iterator func

9dc4752

Update README.

367a340

Update the test case after not using read_text_iterator

b54b883

cpuhrsch approved these changes Nov 26, 2019

View reviewed changes

Guanheng Zhang added 3 commits November 25, 2019 16:35

rename to numericalize_tokens_from_iterator

1478d13

flake8

cf7c188

minor

03dfc27

zhangguanheng66 merged commit f34e4fb into pytorch:master Nov 26, 2019

zhangguanheng66 mentioned this pull request Nov 26, 2019

torchtext iterator that tokenizes each line of words between the tokens <sos> and <eos> #654

Closed

vincentqb mentioned this pull request Nov 26, 2019

Re-write IMDB dataset in torchtext.experimental.datasets #651

Merged

This was referenced Dec 4, 2019

Move the new language modeling datasets to torchtext.experimental.datasets #661

Merged

Overview of issues in torchtext and the plan for revamping #664

Open

		yield tokens


		def _create_data_from_iterator(vocab, iterator, include_unk):

		yield tokens


		def create_data_from_iterator(vocab, iterator, removed_tokens=None):

Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) #624

Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) #624

Conversation

zhangguanheng66 commented Oct 23, 2019 • edited Loading

Motivation

API for new language modeling datasets

Legacy code

Difference

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 Nov 5, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpuhrsch commented Nov 5, 2019

Choose a reason for hiding this comment

zhangguanheng66 commented Nov 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincentqb Nov 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpuhrsch commented Nov 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 commented Nov 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpuhrsch left a comment

Choose a reason for hiding this comment

zhangguanheng66 commented Oct 23, 2019 •

edited

Loading

zhangguanheng66 Nov 5, 2019 •

edited

Loading

vincentqb Nov 8, 2019 •

edited

Loading