Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silent failing of batch_sampler when the data points are lists of tensors. #32851

Open
simonverret opened this issue Jan 31, 2020 · 3 comments
Open
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler module: nestedtensor NestedTensor tag see issue #25032 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@simonverret
Copy link

simonverret commented Jan 31, 2020

If a dataset is created from a list of list of tensors, and a custom batch_sample is made to sample from the outer list, then the inner tensors get mingled together in list batches.

To Reproduce

Consider a dataset consisting of two lists of tensors. The Sampler to be used as batch_sampler intend to produce a single batch of two lists with this dataset.

import torch
from torch.utils.data import Dataset, DataLoader, Sampler

list_of_list_of_tensor_data = [
        [ torch.tensor([0,1,2]),
          torch.tensor([3,4,5]) ],
        [ torch.tensor([6,7,8]),
          torch.tensor([9,0,1]) ]
    ]

class ListData(Dataset):
    def __init__(self, data):
        super().__init__()
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx]

dataset = ListData(list_of_list_of_tensor_data)

class MyBatchSampler(Sampler):
    def __init__(self,data):
        super().__init__(data)
    def __iter__(self):        
        yield [0,1]

sampler = MyBatchSampler(list_of_list_of_tensor_data)
loader = DataLoader(dataset, batch_sampler = sampler)

for batch in loader:
    print(batch)

As a result, the batch you get a list of mingled tensors.

 [tensor([[0, 1, 2],
         [6, 7, 8]]), tensor([[3, 4, 5],
         [9, 0, 1]])]

Expected behavior

Of course, the expected behaviour is obtained if you use full tensors instead of lists. Stacking the lists in the dataset's __getitem__ does the trick here:

class TensorData(Dataset):
    def __init__(self, data):
        super().__init__()
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return torch.stack(self.data[idx])

dataset2 = TensorData(list_of_list_of_tensor_data)
loader2 = DataLoader(dataset2, batch_sampler = sampler)

for batch in loader2:
    print(batch)

which yields:

tensor([[[0, 1, 2],
          [3, 4, 5]],

         [[6, 7, 8],
          [9, 0, 1]]])

Environment

PyTorch version: 1.0.1.post2
Is debug build: No
CUDA used to build PyTorch: None

OS: Mac OSX 10.14.6
GCC version: Could not collect
CMake version: version 3.14.3

Python version: 3.7
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.16.2
[pip3] torch==1.0.1.post2
[pip3] torchvision==0.2.2.post3
[conda] Could not collect

Additional Context

Probably that Sampler is not intended to be used in the way described above (see #28743), but since the code works and produce unexpected results, it was hard to identify what we did wrong when we stumble on this.

cc @ssnl @cpuhrsch

@ezyang ezyang added module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: nestedtensor NestedTensor tag see issue #25032 labels Jan 31, 2020
@ezyang
Copy link
Contributor

ezyang commented Jan 31, 2020

I speculatively added nestedtensor on this issue, cc @cpuhrsch

@cpuhrsch
Copy link
Contributor

This issue seems unrelated to NestedTensors.

@simonverret - If I'm understanding you correctly you want to pass in a Dataset that's simply backed by a list but observe unexpected behavior when it comes to sampling.

Try using our own collate_fn. This will give you full control over how batches are assembled. There you can then decide whether you want to return a list of Tensors or other constructs.

@simonverret
Copy link
Author

Thank you @cpuhrsch. Using the dataloader option
collate_fn=lambda x: x
indeed yields the desired behavior.

I am now able to explain my confusion. The dataloader interprets the dataset as a list of (input,target) tuples, whereas I designed the dataset to be a list of inputs only, i.e. [input_list1, input_list2, etc.]

For the specific example given above, the behavior therefore seems the correct by-design one. And the mistake was on my side.

However, here is an example where the problem is more obvious: when the dataset's inner lists are longer than two elements:

[
        [ torch.tensor([0,0,0]),
          torch.tensor([1,1,1]),
          torch.tensor([2,2,2]),
          torch.tensor([3,3,3]) ],
        [ torch.tensor([4,4,4]),
          torch.tensor([5,5,5]),
          torch.tensor([6,6,6]),
          torch.tensor([7,7,7]) ],
        [ torch.tensor([8,8,8]),
          torch.tensor([9,9,9]),
          torch.tensor([10,10,10]),
          torch.tensor([11,11,11]) ],
        [ torch.tensor([12,12,12]),
          torch.tensor([13,13,13]),
          torch.tensor([14,14,14]),
          torch.tensor([15,15,15]) ]
    ]

Then, there should be no ambiguity that these are not (input, target) pairs. Yet a sampler defined to get custom batches:

def __iter__(self):        
        for batch_idx in [[0,2],[1,3]]:
            yield batch_idx

(that is, first and third lists go in the first batch, then second and last lists go in the second batch) instead yields two weird batches of pairs, that is, batch 1:

[ tensor([[0, 0, 0], [8, 8, 8]]),
  tensor([[1, 1, 1], [9, 9, 9]]),
  tensor([[ 2,  2,  2], [10, 10, 10]]),
  tensor([[ 3,  3,  3], [11, 11, 11]]) ]

batch 2:

[ tensor([[ 4,  4,  4], [12, 12, 12]]),
  tensor([[ 5,  5,  5], [13, 13, 13]]),
  tensor([[ 6,  6,  6], [14, 14, 14]]),
  tensor([[ 7,  7,  7], [15, 15, 15]]) ]

Using the dataloader option collate_fn=lambda x: x corrects this and yields the expected results: batch 1:

[ [ tensor([0, 0, 0]), 
     tensor([1, 1, 1]), 
     tensor([2, 2, 2]), 
     tensor([3, 3, 3]) ],
   [ tensor([8, 8, 8]), 
      tensor([9, 9, 9]), 
      tensor([10, 10, 10]), 
      tensor([11, 11, 11]) ] ]

and batch 2:

[ [ tensor([4, 4, 4]),
     tensor([5, 5, 5]),
     tensor([6, 6, 6]),
     tensor([7, 7, 7]) ],
   [ tensor([12, 12, 12]),
      tensor([13, 13, 13]),
      tensor([14, 14, 14]),
      tensor([15, 15, 15]) ] ]

I believe the latter behavior would be a more intuitive default behaviour when the dataset contains list longer than two elements. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler module: nestedtensor NestedTensor tag see issue #25032 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants