Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Task objects to defer dataset download #1558

Open
haileyschoelkopf opened this issue Mar 11, 2024 · 2 comments
Open

Allow Task objects to defer dataset download #1558

haileyschoelkopf opened this issue Mar 11, 2024 · 2 comments
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers help wanted Contributors and extra help welcome.

Comments

@haileyschoelkopf
Copy link
Contributor

haileyschoelkopf commented Mar 11, 2024

currently Task object initialization causes the tasks to download their datasets upon initialization.

This is not always desirable, so we should allow users to defer download and perform it manually.

Should add a defer_download: bool = False flag to Task and ConfigurableTask init() methods which, when set to true, has the dataset not downloaded, and allow users to easily call a task.download() method with no args that performs the download.

If users attempt to run the task without first downloading the dataset, we should raise an error.

@haileyschoelkopf haileyschoelkopf added help wanted Contributors and extra help welcome. feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Mar 11, 2024
@zafstojano
Copy link
Contributor

Hi @haileyschoelkopf

I have started looking into this. I am confused as to why even though the Task class has a download method:

def download(
self,
data_dir: Optional[str] = None,
cache_dir: Optional[str] = None,
download_mode=None,

the ConfigurableTask class is the only class which inherits from it and overrides the download method. Wouldn't it be simpler to just define the download method in the Task class as follows:

def download(
        self,
        data_dir: Optional[str] = None,
        cache_dir: Optional[str] = None,
        download_mode=None,
        **kwargs,         # <--- allow for additional kwargs
    ) -> None:
        """Downloads and returns the task dataset.
        Override this method to download the dataset from a custom API.

        :param data_dir: str
            Stores the path to a local folder containing the `Task`'s data files.
            Use this to specify the path to manually downloaded data (usually when
            the dataset is not publicly accessible).
        :param cache_dir: str
            The directory to read/write the `Task` dataset. This follows the
            HuggingFace `datasets` API with the default cache directory located at:
                `~/.cache/huggingface/datasets`
            NOTE: You can change the cache location globally for a given process
            by setting the shell environment variable, `HF_DATASETS_CACHE`,
            to another directory:
                `export HF_DATASETS_CACHE="/path/to/another/directory"`
        :param download_mode: datasets.DownloadMode
            How to treat pre-existing `Task` downloads and data.
            - `datasets.DownloadMode.REUSE_DATASET_IF_EXISTS`
                Reuse download and reuse dataset.
            - `datasets.DownloadMode.REUSE_CACHE_IF_EXISTS`
                Reuse download with fresh dataset.
            - `datasets.DownloadMode.FORCE_REDOWNLOAD`
                Fresh download and fresh dataset.
        """
        self.dataset = datasets.load_dataset(
            path=self.DATASET_PATH,
            name=self.DATASET_NAME,
            data_dir=data_dir,
            cache_dir=cache_dir,
            download_mode=download_mode,
            **kwargs,           # <--- pass the additional kwargs
        )

And remove the download method in the ConfigurableTask.

What do you think?

@haileyschoelkopf
Copy link
Contributor Author

Yes, I think these should be unified.

One thing to check is that despite passing cache_dir directly in the Task.download() method, the HF environment variables should take precedence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers help wanted Contributors and extra help welcome.
Projects
Status: Backlog
Development

No branches or pull requests

2 participants