Silent failing of batch_sampler when the data points are lists of tensors. #32851

simonverret · 2020-01-31T01:15:02Z

If a dataset is created from a list of list of tensors, and a custom batch_sample is made to sample from the outer list, then the inner tensors get mingled together in list batches.

To Reproduce

Consider a dataset consisting of two lists of tensors. The Sampler to be used as batch_sampler intend to produce a single batch of two lists with this dataset.

import torch
from torch.utils.data import Dataset, DataLoader, Sampler

list_of_list_of_tensor_data = [
        [ torch.tensor([0,1,2]),
          torch.tensor([3,4,5]) ],
        [ torch.tensor([6,7,8]),
          torch.tensor([9,0,1]) ]
    ]

class ListData(Dataset):
    def __init__(self, data):
        super().__init__()
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx]

dataset = ListData(list_of_list_of_tensor_data)

class MyBatchSampler(Sampler):
    def __init__(self,data):
        super().__init__(data)
    def __iter__(self):        
        yield [0,1]

sampler = MyBatchSampler(list_of_list_of_tensor_data)
loader = DataLoader(dataset, batch_sampler = sampler)

for batch in loader:
    print(batch)

As a result, the batch you get a list of mingled tensors.

 [tensor([[0, 1, 2],
         [6, 7, 8]]), tensor([[3, 4, 5],
         [9, 0, 1]])]

Expected behavior

Of course, the expected behaviour is obtained if you use full tensors instead of lists. Stacking the lists in the dataset's __getitem__ does the trick here:

class TensorData(Dataset):
    def __init__(self, data):
        super().__init__()
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return torch.stack(self.data[idx])

dataset2 = TensorData(list_of_list_of_tensor_data)
loader2 = DataLoader(dataset2, batch_sampler = sampler)

for batch in loader2:
    print(batch)

which yields:

tensor([[[0, 1, 2],
          [3, 4, 5]],

         [[6, 7, 8],
          [9, 0, 1]]])

Environment

PyTorch version: 1.0.1.post2
Is debug build: No
CUDA used to build PyTorch: None

OS: Mac OSX 10.14.6
GCC version: Could not collect
CMake version: version 3.14.3

Python version: 3.7
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.16.2
[pip3] torch==1.0.1.post2
[pip3] torchvision==0.2.2.post3
[conda] Could not collect

Additional Context

Probably that Sampler is not intended to be used in the way described above (see #28743), but since the code works and produce unexpected results, it was hard to identify what we did wrong when we stumble on this.

cc @ssnl @cpuhrsch

The text was updated successfully, but these errors were encountered:

ezyang · 2020-01-31T17:26:23Z

I speculatively added nestedtensor on this issue, cc @cpuhrsch

cpuhrsch · 2020-03-12T00:50:08Z

This issue seems unrelated to NestedTensors.

@simonverret - If I'm understanding you correctly you want to pass in a Dataset that's simply backed by a list but observe unexpected behavior when it comes to sampling.

Try using our own collate_fn. This will give you full control over how batches are assembled. There you can then decide whether you want to return a list of Tensors or other constructs.

simonverret · 2020-03-13T21:59:44Z

Thank you @cpuhrsch. Using the dataloader option
collate_fn=lambda x: x
indeed yields the desired behavior.

I am now able to explain my confusion. The dataloader interprets the dataset as a list of (input,target) tuples, whereas I designed the dataset to be a list of inputs only, i.e. [input_list1, input_list2, etc.]

For the specific example given above, the behavior therefore seems the correct by-design one. And the mistake was on my side.

However, here is an example where the problem is more obvious: when the dataset's inner lists are longer than two elements:

[
        [ torch.tensor([0,0,0]),
          torch.tensor([1,1,1]),
          torch.tensor([2,2,2]),
          torch.tensor([3,3,3]) ],
        [ torch.tensor([4,4,4]),
          torch.tensor([5,5,5]),
          torch.tensor([6,6,6]),
          torch.tensor([7,7,7]) ],
        [ torch.tensor([8,8,8]),
          torch.tensor([9,9,9]),
          torch.tensor([10,10,10]),
          torch.tensor([11,11,11]) ],
        [ torch.tensor([12,12,12]),
          torch.tensor([13,13,13]),
          torch.tensor([14,14,14]),
          torch.tensor([15,15,15]) ]
    ]

Then, there should be no ambiguity that these are not (input, target) pairs. Yet a sampler defined to get custom batches:

def __iter__(self):        
        for batch_idx in [[0,2],[1,3]]:
            yield batch_idx

(that is, first and third lists go in the first batch, then second and last lists go in the second batch) instead yields two weird batches of pairs, that is, batch 1:

[ tensor([[0, 0, 0], [8, 8, 8]]),
  tensor([[1, 1, 1], [9, 9, 9]]),
  tensor([[ 2,  2,  2], [10, 10, 10]]),
  tensor([[ 3,  3,  3], [11, 11, 11]]) ]

batch 2:

[ tensor([[ 4,  4,  4], [12, 12, 12]]),
  tensor([[ 5,  5,  5], [13, 13, 13]]),
  tensor([[ 6,  6,  6], [14, 14, 14]]),
  tensor([[ 7,  7,  7], [15, 15, 15]]) ]

Using the dataloader option collate_fn=lambda x: x corrects this and yields the expected results: batch 1:

[ [ tensor([0, 0, 0]), 
     tensor([1, 1, 1]), 
     tensor([2, 2, 2]), 
     tensor([3, 3, 3]) ],
   [ tensor([8, 8, 8]), 
      tensor([9, 9, 9]), 
      tensor([10, 10, 10]), 
      tensor([11, 11, 11]) ] ]

and batch 2:

[ [ tensor([4, 4, 4]),
     tensor([5, 5, 5]),
     tensor([6, 6, 6]),
     tensor([7, 7, 7]) ],
   [ tensor([12, 12, 12]),
      tensor([13, 13, 13]),
      tensor([14, 14, 14]),
      tensor([15, 15, 15]) ] ]

I believe the latter behavior would be a more intuitive default behaviour when the dataset contains list longer than two elements. What do you think?

ezyang added module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: nestedtensor NestedTensor tag see issue #25032 labels Jan 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silent failing of batch_sampler when the data points are lists of tensors. #32851

Silent failing of batch_sampler when the data points are lists of tensors. #32851

simonverret commented Jan 31, 2020 •

edited by pytorch-probot bot

ezyang commented Jan 31, 2020

cpuhrsch commented Mar 12, 2020

simonverret commented Mar 13, 2020

Silent failing of batch_sampler when the data points are lists of tensors. #32851

Silent failing of batch_sampler when the data points are lists of tensors. #32851

Comments

simonverret commented Jan 31, 2020 • edited by pytorch-probot bot

To Reproduce

Expected behavior

Environment

Additional Context

ezyang commented Jan 31, 2020

cpuhrsch commented Mar 12, 2020

simonverret commented Mar 13, 2020

simonverret commented Jan 31, 2020 •

edited by pytorch-probot bot