-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Task objects to defer dataset download #1558
Comments
I have started looking into this. I am confused as to why even though the lm-evaluation-harness/lm_eval/api/task.py Lines 236 to 240 in 86319a9
the def download(
self,
data_dir: Optional[str] = None,
cache_dir: Optional[str] = None,
download_mode=None,
**kwargs, # <--- allow for additional kwargs
) -> None:
"""Downloads and returns the task dataset.
Override this method to download the dataset from a custom API.
:param data_dir: str
Stores the path to a local folder containing the `Task`'s data files.
Use this to specify the path to manually downloaded data (usually when
the dataset is not publicly accessible).
:param cache_dir: str
The directory to read/write the `Task` dataset. This follows the
HuggingFace `datasets` API with the default cache directory located at:
`~/.cache/huggingface/datasets`
NOTE: You can change the cache location globally for a given process
by setting the shell environment variable, `HF_DATASETS_CACHE`,
to another directory:
`export HF_DATASETS_CACHE="/path/to/another/directory"`
:param download_mode: datasets.DownloadMode
How to treat pre-existing `Task` downloads and data.
- `datasets.DownloadMode.REUSE_DATASET_IF_EXISTS`
Reuse download and reuse dataset.
- `datasets.DownloadMode.REUSE_CACHE_IF_EXISTS`
Reuse download with fresh dataset.
- `datasets.DownloadMode.FORCE_REDOWNLOAD`
Fresh download and fresh dataset.
"""
self.dataset = datasets.load_dataset(
path=self.DATASET_PATH,
name=self.DATASET_NAME,
data_dir=data_dir,
cache_dir=cache_dir,
download_mode=download_mode,
**kwargs, # <--- pass the additional kwargs
) And remove the What do you think? |
Yes, I think these should be unified. One thing to check is that despite passing |
currently
Task
object initialization causes the tasks to download their datasets upon initialization.This is not always desirable, so we should allow users to defer download and perform it manually.
Should add a
defer_download: bool = False
flag toTask
andConfigurableTask
init() methods which, when set to true, has the dataset not downloaded, and allow users to easily call atask.download()
method with no args that performs the download.If users attempt to run the task without first downloading the dataset, we should raise an error.
The text was updated successfully, but these errors were encountered: