-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silent failing of batch_sampler when the data points are lists of tensors. #32851
Comments
I speculatively added nestedtensor on this issue, cc @cpuhrsch |
This issue seems unrelated to NestedTensors. @simonverret - If I'm understanding you correctly you want to pass in a Dataset that's simply backed by a list but observe unexpected behavior when it comes to sampling. Try using our own collate_fn. This will give you full control over how batches are assembled. There you can then decide whether you want to return a list of Tensors or other constructs. |
Thank you @cpuhrsch. Using the dataloader option I am now able to explain my confusion. The dataloader interprets the dataset as a list of (input,target) tuples, whereas I designed the dataset to be a list of inputs only, i.e. [input_list1, input_list2, etc.] For the specific example given above, the behavior therefore seems the correct by-design one. And the mistake was on my side. However, here is an example where the problem is more obvious: when the dataset's inner lists are longer than two elements:
Then, there should be no ambiguity that these are not (input, target) pairs. Yet a sampler defined to get custom batches:
(that is, first and third lists go in the first batch, then second and last lists go in the second batch) instead yields two weird batches of pairs, that is, batch 1:
batch 2:
Using the dataloader option
and batch 2:
I believe the latter behavior would be a more intuitive default behaviour when the dataset contains list longer than two elements. What do you think? |
If a dataset is created from a list of list of tensors, and a custom batch_sample is made to sample from the outer list, then the inner tensors get mingled together in list batches.
To Reproduce
Consider a dataset consisting of two lists of tensors. The
Sampler
to be used asbatch_sampler
intend to produce a single batch of two lists with this dataset.As a result, the batch you get a list of mingled tensors.
Expected behavior
Of course, the expected behaviour is obtained if you use full tensors instead of lists. Stacking the lists in the dataset's
__getitem__
does the trick here:which yields:
Environment
PyTorch version: 1.0.1.post2
Is debug build: No
CUDA used to build PyTorch: None
OS: Mac OSX 10.14.6
GCC version: Could not collect
CMake version: version 3.14.3
Python version: 3.7
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Versions of relevant libraries:
[pip3] numpy==1.16.2
[pip3] torch==1.0.1.post2
[pip3] torchvision==0.2.2.post3
[conda] Could not collect
Additional Context
Probably that
Sampler
is not intended to be used in the way described above (see #28743), but since the code works and produce unexpected results, it was hard to identify what we did wrong when we stumble on this.cc @ssnl @cpuhrsch
The text was updated successfully, but these errors were encountered: