Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On Dataset, IterableDataset inheritance #120139

Open
kuraga opened this issue Feb 17, 2024 · 0 comments
Open

On Dataset, IterableDataset inheritance #120139

kuraga opened this issue Feb 17, 2024 · 0 comments
Labels
module: data torch.utils.data triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@kuraga
Copy link
Contributor

kuraga commented Feb 17, 2024

Analysis

In PyTorch

We have (https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html):

T_co = TypeVar('T_co', covariant=True)

class Dataset(Generic[T_co]):
    r"""<...> All subclasses should overwrite :meth:`__getitem__`
    <...> Subclasses could also optionally implement :meth:`__getitems__`
    """
    def __getitem__(self, index) -> T_co:
        <...>

class IterableDataset(Dataset[T_co], Iterable[T_co]):
    r"""<...> All subclasses should overwrite :meth:`__iter__`
    """
    # No `def __len__(self)` default? Subclasses raise `TypeError` when needed.
    # See NOTE [ Lack of Default `__len__` in Python Abstract Base Classes ]

So:
1.

object <- Dataset <---- IterableDataset
       \               /
        \- Iterable <-/ 
  1. Datasets should implement __getitem__.
  2. IterableDatasets should implement __iter__ and __len__.

In Python

Now, let's look at the Python's Collections Abstract Base Classes documentation.

What does Python offer?

         -- Sized ---------------------
        /                               \
object <--- Iterable <-------------------- Collection <--- Sequence
        \             \                  /               /
         \             - Reversible <-------------------/
          \                            /
           -- Container ---------------
  1. Iterable requires __iter__.
  2. __len__ introduced in Sized. Iterables doesn't have __len__.
  3. Container introduces __contains__.
  4. Reversible introduces __reversed__.
  5. Collection just join the superclasses.
  6. Sequence has real implementations of __iter__, __contains__ , __reversed__ but requires __len__, __getitem__.

Applying back to PyTorch

What do I see?

  1. p.4 => IterableDataset can't be a subclass of Dataset. (It could be vice versa.)
  2. p.6 => p.3 is incorrect, i.e. IterableDatasets shouldn't implement __len__ but Dataset could.
  3. Inheritance could be changed drastically, without big code changes.

Proposition

Variant 1

Ok, we could have:

         -- Sized ---------------------
        /                              \
       /             - IterableDataset  \
      /             /                    \
object <--- Iterable <-------------------- Collection <- Dataset
      \                                  /
        -- Container -------------------
Class Methods to implement
IterableDataset __iter__
Dataset __len__, __iter__, __contains__ , __getitem__, __reversed__

Variant 2

Or even subclass of Sequence for Dataset:

         -- Sized ---------------------
        /                              \
       /             - IterableDataset  \
      /             /                    \
object <--- Iterable <-------------------- Collection <- Sequence <- Dataset
      \             \                    /              /
       \             - Reversible <--------------------/
        \                               /
         -- Container -----------------
Class Methods to implement
IterableDataset __iter__
Dataset __len__, __getitem__

See Also

Sequence could be Mapping

I thought a Dataset as a Sequence (list), i.e. int indices.
But strictly saying indices are untyped. So, it could be a Mapping (dict).

Type Annotations

I'd appreciate to annotate all the code (see above).

NOTE [ Lack of Default __len__ in Python Abstract Base Classes ]

Thinking on this could solve the nasty https://github.com/pytorch/pytorch/blob/v2.2.1/torch/utils/data/sampler.py#L70-L95.
On that, see also #122410, #47055.

Related Discussions, Issues and Commits

Thoughts

Thoughts? :)

cc @VitalyFedyunin @ejguan @dzhulgakov @ssnl

@soulitzer soulitzer added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: data torch.utils.data labels Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: data torch.utils.data triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

2 participants