Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Data V2 #3700

Merged
merged 59 commits into from
Feb 26, 2020
Merged

Data V2 #3700

Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
0c42cb9
example for feedback
DeNeutoy Jan 30, 2020
5ffedfc
Merge branch 'master' into data-v2
DeNeutoy Feb 19, 2020
80049f8
remove all existing multiprocessing
DeNeutoy Feb 19, 2020
6f58c2a
sneak torch datasets inside DatasetReader
DeNeutoy Feb 19, 2020
1b3ad9a
lint
DeNeutoy Feb 19, 2020
effc445
trainer_v2, We Love To See It
DeNeutoy Feb 19, 2020
9d44ad6
datasets have index_with now, not iterators
DeNeutoy Feb 19, 2020
7e89ea6
use iter, custom collate function in allennlp wrapper
DeNeutoy Feb 19, 2020
883b6d7
we don't even need the data in the trainer anymore
DeNeutoy Feb 19, 2020
56d022a
all trainer tests passing
DeNeutoy Feb 20, 2020
01e12f5
black
DeNeutoy Feb 20, 2020
5aea291
make find learning rate work
DeNeutoy Feb 20, 2020
f026946
update test fixtures to new config
DeNeutoy Feb 20, 2020
5973b50
get train command tests mostly working
DeNeutoy Feb 20, 2020
a23f47a
lazily construct samplers, index lazy datasets
DeNeutoy Feb 20, 2020
a76ea0a
Merge branch 'master' into data-v2
DeNeutoy Feb 20, 2020
ebf3854
update some fixtures
DeNeutoy Feb 20, 2020
57a67e5
evaluate tests passing
DeNeutoy Feb 20, 2020
7d21ed8
all command tests passing
DeNeutoy Feb 20, 2020
24a500c
lint
DeNeutoy Feb 20, 2020
fb13769
update model test case, common and module tests passing
DeNeutoy Feb 20, 2020
ef5187f
fix test interdependence introduced by #3762
DeNeutoy Feb 21, 2020
b1ea845
more test interdependence
DeNeutoy Feb 21, 2020
0231616
tests tests tests
DeNeutoy Feb 21, 2020
01d76bb
remove unnecessary brackets
DeNeutoy Feb 21, 2020
12b6efb
Merge branch 'master' into data-v2
DeNeutoy Feb 21, 2020
859d3ca
update a chunk of the configs
DeNeutoy Feb 21, 2020
c22dee3
fix archival test, couple more configs
DeNeutoy Feb 21, 2020
fe5b470
rm pointless gan test
DeNeutoy Feb 21, 2020
7533c91
more tests passing
DeNeutoy Feb 21, 2020
ad45659
add current state of from params changes
DeNeutoy Feb 21, 2020
f944840
Revert "add current state of from params changes"
DeNeutoy Feb 21, 2020
3b12a2f
Merge branch 'master' into data-v2
DeNeutoy Feb 21, 2020
be1f58c
updated understanding of Lazy
DeNeutoy Feb 21, 2020
ebdabe0
add discussion of None comparison to Lazy
DeNeutoy Feb 21, 2020
8693739
lint
DeNeutoy Feb 21, 2020
b9b0650
it's a hard doc life
DeNeutoy Feb 21, 2020
88314c7
pull samplers into separate file
DeNeutoy Feb 21, 2020
14296a1
more docs updates
DeNeutoy Feb 22, 2020
8a08899
fold in #3812
DeNeutoy Feb 22, 2020
3520280
remove torch dataset
DeNeutoy Feb 22, 2020
0f1d8a4
add example to lazy
DeNeutoy Feb 22, 2020
93e1e89
rename to collate
DeNeutoy Feb 22, 2020
40dd695
no kwargs
DeNeutoy Feb 23, 2020
da3b1b4
Revert "fold in #3812"
DeNeutoy Feb 23, 2020
801a8f5
don't break up dataset
DeNeutoy Feb 23, 2020
007fd0c
add comment to iterable dataset len
DeNeutoy Feb 23, 2020
d00e1a9
Merge branch 'master' into data-v2
DeNeutoy Feb 23, 2020
c066804
improve docstrings, build dataloader using partial_objects
DeNeutoy Feb 23, 2020
61c7b14
flake
DeNeutoy Feb 23, 2020
2b56b14
give dataloader a default implementation
DeNeutoy Feb 24, 2020
354010a
safer default for DataLoader init
DeNeutoy Feb 24, 2020
568291d
more coherent dir structure
DeNeutoy Feb 24, 2020
a016103
update imports
DeNeutoy Feb 24, 2020
47db16a
Merge branch 'master' into data-v2
DeNeutoy Feb 24, 2020
04fdb70
add a test for the BucketBatchSampler
DeNeutoy Feb 24, 2020
d1d5c4a
split bucket sampler into own file, tests
DeNeutoy Feb 24, 2020
5f0c8db
PR comments
DeNeutoy Feb 26, 2020
6f63a53
Merge branch 'master' into data-v2
DeNeutoy Feb 26, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
lazily construct samplers, index lazy datasets
  • Loading branch information
DeNeutoy committed Feb 20, 2020
commit a23f47a04588a5f49a97ac4628ad49164b3279f7
16 changes: 14 additions & 2 deletions allennlp/data/dataset_readers/dataset_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,13 +59,19 @@ def __iter__(self) -> Iterator[Instance]:
if self.cache_file is not None and os.path.exists(self.cache_file):
with open(self.cache_file) as data_file:
for line in data_file:
yield self.deserialize(line)
instance = self.deserialize(line)
if self.vocab is not None:
instance.index_fields(self.vocab)
yield instance

# Case 2: Need to cache instances
elif self.cache_file is not None:
with open(self.cache_file, "w") as data_file:
for instance in self.instance_generator():
data_file.write(self.serialize(instance))
data_file.write("\n")
if self.vocab is not None:
instance.index_fields(self.vocab)
yield instance
# Case 3: No cache
else:
Expand All @@ -74,11 +80,17 @@ def __iter__(self) -> Iterator[Instance]:
raise ConfigurationError(
"For a lazy dataset reader, _read() must return a generator"
)
yield from instances
for instance in instances:
if self.vocab is not None:
instance.index_fields(self.vocab)
yield instance

def index_with(self, vocab: Vocabulary):
self.vocab = vocab

def __len__(self):
return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it return 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.... This is just moving previous behaviour from the iterator onto the dataset instead:

https://github.com/allenai/allennlp/blob/master/allennlp/data/iterators/data_iterator.py#L305

We rely in a couple of places that calling len on the iterator (or now, the dataloader) doesn't raise an error. In the case that you have an IterableDataset and you call len, the pytorch dataloader actually spits out a warning - but we need actually calling it to not crash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Maybe you can add a comment there?



class DatasetReader(Registrable):
"""
Expand Down
34 changes: 22 additions & 12 deletions allennlp/data/samplers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from allennlp.common.registrable import Registrable

from allennlp.common.util import add_noise_to_dict_values, lazy_groups_of
from allennlp.common.lazy import Lazy
from allennlp.data.batch import Batch as AllennlpBatch
from allennlp.data.instance import Instance
from allennlp.data.vocabulary import Vocabulary
Expand All @@ -30,7 +31,7 @@ def __iter__(self) -> Iterable[List[int]]:

@Sampler.register("sequential")
class SequentialSampler(Sampler, data.SequentialSampler):
def __init__(self, data_source: data.Dataset):
def __init__(self, data_source: data.Dataset, **kwargs):
super().__init__(data_source)


Expand All @@ -47,7 +48,7 @@ class RandomSampler(Sampler, data.RandomSampler):
"""

def __init__(
self, data_source: data.Dataset, replacement: bool = False, num_samples: int = None
self, data_source: data.Dataset, replacement: bool = False, num_samples: int = None, **kwargs
):
super().__init__(data_source, replacement, num_samples)

Expand All @@ -60,7 +61,7 @@ class SubsetRandomSampler(Sampler, data.SubsetRandomSampler):
indices (sequence): a sequence of indices
"""

def __init__(self, indices: List[int]):
def __init__(self, indices: List[int], **kwargs):
super().__init__(indices)


Expand All @@ -82,7 +83,7 @@ class WeightedRandomSampler(Sampler, data.WeightedRandomSampler):
[0, 1, 4, 3, 2]
"""

def __init__(self, weights: List[float], num_samples: int, replacement: bool = True):
def __init__(self, weights: List[float], num_samples: int, replacement: bool = True, **kwargs):
super().__init__(weights, num_samples, replacement)


Expand All @@ -103,15 +104,15 @@ class BasicBatchSampler(BatchSampler, data.BatchSampler):
[[0, 1, 2], [3, 4, 5], [6, 7, 8]]
"""

def __init__(self, sampler: Sampler, batch_size: int, drop_last: bool):
def __init__(self, sampler: Sampler, batch_size: int, drop_last: bool, **kwargs):
super().__init__(sampler, batch_size, drop_last)


@BatchSampler.register("bucket")
class BatchInstanceSampler(BatchSampler):
def __init__(
self,
data: data.Dataset,
data_source: data.Dataset,
batch_size: int,
sorting_keys: List[Tuple[str, str]] = None,
padding_noise: float = 0.1,
Expand All @@ -121,7 +122,7 @@ def __init__(
self._sorting_keys = sorting_keys
self._padding_noise = padding_noise
self._batch_size = batch_size
self.data = data
self.data_source = data_source

def _argsort_by_padding(self, instances: List[Instance]) -> List[int]:
"""
Expand Down Expand Up @@ -159,7 +160,7 @@ def _argsort_by_padding(self, instances: List[Instance]) -> List[int]:

def __iter__(self) -> Iterable[List[int]]:

indices = self._argsort_by_padding(self.data)
indices = self._argsort_by_padding(self.data_source)
for group in lazy_groups_of(indices, self._batch_size):
yield list(group)

Expand Down Expand Up @@ -195,8 +196,8 @@ def __init__(
dataset: data.Dataset,
batch_size: int = 1,
shuffle: bool = False,
sampler: Sampler = None,
batch_sampler: BatchSampler = None,
sampler: Lazy[Sampler] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this annotation here is not good. It makes it really hard for someone to just instantiate this themselves, because instead of just constructing the sampler and passing it to DataLoader(), they have to wrap it in a Lazy with some kind of lambda function. It's pretty obnoxious. See

This is only recommended for use when you have registered a `@classmethod` as the constructor
for your class, instead of using `__init__`. Having a `Lazy[]` type annotation on an argument
to an `__init__` method makes your class completely dependent on being constructed using the
`FromParams` pipeline, which is not a good idea.

Instead, keep this constructor, but remove the Lazy annotations and everything but the super().__init__() call. Then add a separate method, called from_partial_objects (or something else, if you can think of a better name for it), which is exactly this method, but with cls instead of self, and calling cls() instead of super().__init__(). You can put a docstring on that method that's something like this:

This method exists so that we can have a documented method to construct this class using
`FromParams`. If you are not using `FromParams` or config files, you can safely ignore this
method.
The reason we can't just use `__init__` with `FromParams` here is because there are
sequential dependencies to this class's arguments. Anything that has a `Lazy[]` type
annotation needs something from one of the non-`Lazy` arguments. The `Optimizer` needs to
have the parameters from the `Model` before it's constructed, and the `Schedulers` need to
have the `Optimizer`. Because of this, the typical way we construct things `FromParams`
doesn't work, so we use `Lazy` to allow for constructing the objects sequentially.
If you're not using `FromParams`, you can just construct these arguments in the right order
yourself in your code and call the constructor directly.
.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean - we should probably even make this class private, because you should never use it - you should just use the pytorch dataloader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it private means changing type annotations in other code to a private object, which is also unfortunate. Probably better to just be really clear in the class docstring why this object exists (purely to get type annotations that will let us construct it using our FromParams pipeline), and that you can safely ignore it and use pytorch's DataLoader directly if you don't need the FromParams stuff.

batch_sampler: Lazy[BatchSampler] = None,
num_workers: int = 0,
collate_fn=None,
pin_memory: bool = False,
Expand All @@ -207,12 +208,21 @@ def __init__(
):

collate_fn = allennlp_collocate
if batch_sampler is not None:
batch_sampler_ = batch_sampler.construct(dataset=dataset)
else:
batch_sampler_ = None
if sampler is not None:
sampler_ = sampler.construct(dataset=dataset)
else:
sampler_ = None

super().__init__(
dataset=dataset,
batch_size=batch_size,
shuffle=shuffle,
sampler=sampler,
batch_sampler=batch_sampler,
sampler=sampler_,
batch_sampler=batch_sampler_,
num_workers=num_workers,
collate_fn=collate_fn,
pin_memory=pin_memory,
Expand Down