Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) #624

Merged

Conversation

zhangguanheng66
Copy link
Contributor

@zhangguanheng66 zhangguanheng66 commented Oct 23, 2019

Re-write the language modeling datasets with a new pattern which was first applied for text classification datasets in v.0.4.0 release.

Motivation

The motivation for the new pattern is to simplify data processing in Torchtext and grant users more flexibility to build the pipeline. There are three major issues we want to solve:

  • Remove the dependency of Field class, which couples tokenizer, vocabulary, split, batching and sampling, padding, and numericalization together. It's like a "black box" and confuses users about what's going on inside it. Instead, with the new pattern, the components mentioned above should be some basic building blocks. Users can build the data processing pipeline with the orthogonal components.
  • Incompatible with DataLoader and Sampler in torch.utils.data. Some duplicate functionals (e.g. Iterator, Batch, splits in torchtext) should be replaced by the corresponding functionals in torch.utils.data to reduce the maintaince efforts.
  • Unnecessary data structure. For example, Example class adds no structure, and should be replaced with tuple/dict or namedtuple.

API for new language modeling datasets

To support "one-command" data loading, we have built a pipeline here to support PennTreebank, WikiText103, WikiText2. Users are welcome to build their ones if they follow the pattern. To load the new datasets, simple call the dataset API, as follow:

from torchtext.datasets import WikiText2
train_dataset, valid_dataset, test_dataset = WikiText2()

If you want to use a specific tokenizer,

from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("spacy")
train_dataset, valid_dataset, test_dataset = WikiText2(tokenizer=tokenizer)

If you just want the valid set:

vocab = train_dataset.get_vocab()
valid_dataset, = WikiText2(tokenizer=tokenizer, vocab=vocab, data_select='valid')

Legacy code

We decide to move the old language modeling datasets (a.k.a. PennTreebank, WikiText103, WikiText2) to a legacy folder torchtext.legacy.datasets. In the past, you may use those datasets as follow:

import torchtext.data as data
from torchtext.datasets import WikiText2
TEXT = data.Field(lower=True, batch_first=True)
train_dataset, valid_dataset, test_dataset = WikiText2.splits(TEXT)

You can still use the legacy datasets, as follow:

import torchtext.data as data
from torchtext.legacy.datasets import WikiText2
TEXT = data.Field(lower=True, batch_first=True)
train_dataset, valid_dataset, test_dataset = WikiText2.splits(TEXT)

Difference

  • With the old pattern, users have to create a Field object including a specific tokenizer. In the new dataset API, user can pass the tokenizer directly to the dataset constructor
# Old pattern
TEXT = torchtext.data.Field(tokenize=get_tokenizer("basic_english"))
# New pattern
from torchtext.data.utils import get_tokenizer
train_dataset, test_dataset, valid_dataset = WikiText2(tokenizer=get_tokenizer("spacy"))
  • In the old dataset, vocab object is associated with Field class, and there is no way to apply a pre-trained vocab object. In the new dataset, vocab object can be obtained by
vocab = train_dataset.get_vocab()

and apply to generate new datasets

train_dataset, test_dataset, valid_dataset = WikiText103(vocab=vocab)
  • The datasets with the new pattern return a tensor of token IDs, instead of tokens in the old pattern. If users would like to check the tokens, simply use the following command:
train_vocab = train_dataset.get_vocab()
tokens = [train_vocab.itos[id] for id in train_dataset]
  • Unlike the old pattern using BucketIterator.splits, users are encouraged to use torch.utils.data.DataLoader to generate batches of data. You can specify how exactly the samples need to be batched using collate_fn.
# Generate 8x8 batches
import torch
from torch.utils.data import DataLoader
num_row = 8
def generate_rows(data):
	data = torch.tensor(data).view(num_row, -1).t().contiguous()
	return data
dataloader = DataLoader(train_dataset, batch_size=64, num_workers=4, collate_fn=generate_rows)
for batch in dataloader:
    # Send batch to the model.

@zhangguanheng66 zhangguanheng66 changed the title Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) [WIP] Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) Oct 23, 2019
@zhangguanheng66 zhangguanheng66 changed the title [WIP] Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) Oct 31, 2019
}


def _read_text_iterator(data_path, tokenizer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions look similar to what's in the classification datasets - is there a way to share this code via some simple abstractions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that. The only difference is the label in text classification dataset.


def _create_data_from_iterator(vocab, iterator, include_unk):
_data = []
with tqdm(unit_scale=0, unit='lines') as t:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this noisy by default? Printing progress bars can cause a lot of noise.

Copy link
Contributor Author

@zhangguanheng66 zhangguanheng66 Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could remove it. But I saw some issues that users complained about no response during loading the data. They wanted to see the progress.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it quiet by default, because otherwise it'll spam the logs of programs running outside of an interactive environment.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Nov 5, 2019

Generally I think this looks good. In general, we need to write clear documentation that describes how to migrate to this new Dataset in comparison to the old one.

yield tokens


def _create_data_from_iterator(vocab, iterator, include_unk):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function could be unified as well.

@zhangguanheng66
Copy link
Contributor Author

Generally I think this looks good. In general, we need to write clear documentation that describes how to migrate to this new Dataset in comparison to the old one.

Yes. I could write some instructions after IMDB datasets.

@@ -178,33 +176,31 @@ def read_text_iterator(path, tokenizer):
yield tokens


def create_data_from_iterator(vocab, iterator, include_unk):
r"""Create data from an token iterator.
def create_data_from_iterator(vocab, iterator, removed_tokens=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to keep track of things that break BC for the release notes. As far as I can tell this does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the first time we add create_data_from_iterator in torch.data.functional.

self.assertEqual(tokens_ids, [2, 285, 502, 699])

# Delete the dataset after we're done to save disk space on CI
if os.environ.get("TRAVIS") == "true":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a generic cleanup pattern. You should use the capabilities the test framework brings along for this.


return (LanguageModelingDataset(torch.Tensor(train_data).long(), vocab),
LanguageModelingDataset(torch.Tensor(test_data).long(), vocab),
LanguageModelingDataset(torch.Tensor(valid_data).long(), vocab))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A "dataset" is therefore a list of 3 datasets corresponding to train/test/validation? I assume the division is dictated by the source itself, right?

Could the user ever want to just get one of train/test/validation? What do you think of letting the user get one at the time when building a dataset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference: librispeech in audio simply lets the user chose which one to download.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be an option with an extra input for that.
But it's kind of convention to generate three datasets together (if there are) and most users will need all the them for training and inference.

Copy link
Contributor

@vincentqb vincentqb Nov 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of having a function returning each separately, and then wrap all three with the API here?

Something like

vocab = build_vocab(train)
train = _setup_datasets_part("train", vocab)
test = _setup_datasets_part("test", vocab)
valid = _setup_datasets_part("valid", vocab)
return train, test, valid

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the flag is probably a good idea. Someone might only want to run validation or testing in a separate process. either accept a string or a tuple of strings i'd say.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this offline. The decision is to add a keyword argument (a tuple of strings) to identify which datasets will be generated. The default value is a tuple of ('train', 'test'). Users have the flexibility to choose a single dataset with a proper vocab object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

In CommonVoice, the implementation doesn't know about the meaning of train/test/etc. Instead, the zip file contains many tsv files (e.g. for train, test, ...), and the user simply specifies which file to use.

An advantage is that the user could decide to create its own custom tsv file (say for training), and load that one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

In CommonVoice, the implementation doesn't know about the meaning of train/test/etc. Instead, the zip file contains many tsv files (e.g. for train, test, ...), and the user simply specifies which file to use.

An advantage is that the user could decide to create its own custom tsv file (say for training), and load that one.

Yes, that's kind of the same case for translation dataset so we need to provide an option to choose the file.
However, for most text dataset, the raw data files have been properly tagged with train, test, and valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some datasets in audio even have "train_100", "train_500" etc built-in, see this. With only "train", "test" "valid", how would you like to deal with that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my early response, I propose to add three more flags train_filename, valid_filename, test_filename, which give you the flexibility to choose a specific file for training. I think either way should way and we just need to keep consistent across datasets within domain.

vocab, read_text_iterator(valid_path, tokenizer), removed_tokens)
valid_data = []
for tokens in valid_iter:
valid_data += tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't know that it's valuable to simplify this block, but it'd be more readable by just adding a space between line 91 and 92 :)


def _setup_datasets(dataset_name, tokenizer=get_tokenizer("basic_english"),
root='.data', vocab=None, removed_tokens=['<unk>']):
if dataset_name == 'PennTreebank':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use that function for future datasets, is there a more general strategy than just making PennTreebank a special case?

_data[item] += tokens

return tuple(LanguageModelingDataset(torch.tensor(_data[d]).long(), vocab)
for d in data_select if _data[d] != [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if _data[d] is empty we actually want to raise an error because the data the user requested is empty OR we simply return a dataset that loads nothing.

If the data is empty we might have downloaded something corrupted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

if 'valid' in data_select:
extracted_files.append(download_from_url(URLS['PennTreebank'][2], root=root))
else:
dataset_tar = download_from_url(URLS[dataset_name], root=root)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing we could do with these download functions is compare the result to a checksum. We added those capabilities in torchaudio. I think that might actually be worth it, if this checksum calculating is fast. If it's not fast we might need to make this a flag. This could save a lot of grief and is something rather common. Should we do this now? cc @vincentqb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a checksum is a good idea. However, since the goal is to soonish put the download function in sync, we should avoid investing time optimizing this particular version. I'd recommend using the one we worked on recently in torchaudio instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should copy this one into here as well. Do you want to create a PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincentqb @cpuhrsch I'm fine to put one download function across domains if it doesn't break any current download activity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth testing this, since it's our goal down the road :)

@cpuhrsch
Copy link
Contributor

Thanks for updating the description! it's much clearer.

Could you also add a paragraph that contrasts the old and new datasets? As in, do we give up on any capabilities, how much does the API differ at each step etc.

if removed_tokens is None:
yield iter(vocab[token] for token in tokens)
else:
tokens = list(filter(lambda x: x not in removed_tokens, tokens))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do this on the fly :). filter([...]) should be able to feed into iter and simply skip undesired items.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks.

@zhangguanheng66
Copy link
Contributor Author

Thanks for updating the description! it's much clearer.

Could you also add a paragraph that contrasts the old and new datasets? As in, do we give up on any capabilities, how much does the API differ at each step etc.

Add a session to compare the difference at the very end. @cpuhrsch

if removed_tokens is None:
yield iter(vocab[token] for token in tokens)
else:
yield iter(vocab[token] for token in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can write this more compactly as iter(filter(lambda x: x is not None, [3, 4, None]))

yield tokens


def create_data_from_iterator(vocab, iterator, removed_tokens=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the name of the function clear? convert_tokens_to_ids?

@@ -0,0 +1,5 @@
from . import datasets

__version__ = '0.4.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: What does the version mean here?

README.rst Outdated
Legacy Code
===========

We are currently retiring several datasets as legacy code ```torchtext.legacy```:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We have currently retired several datasets and move them under torch.legacy.

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. See Vincent's comment "create_data_from_iterator".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants