-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi datasets #123
Multi datasets #123
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix the issues flagged by Pyright? We can't merge before those are fixed.
elk/extraction/prompt_loading.py
Outdated
|
||
# Remove everything but the label column | ||
extra_cols = list(assert_type(Features, ds.features)) | ||
extra_cols.remove(label_column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we actually do want to remove the label column, because when num_classes > 2, the label should be a 0 or 1 (indicating which element of the contrast pair is correct), so we need to modify the label in _convert_to_prompts
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a commit for these changes, but also I think it's pretty unclear how we want to deal with MC datasets
yield sample | ||
|
||
|
||
class FewShotSampler: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why you used @DataClass for the above class, but not here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had time yet to review the pull request in more detail and to test it. I can do that tomorrow evening, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this is a little bit inconsistent but idk how much it matters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably FewShotSampler might be better if we make it not an iterator bc infinite iterators are weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mostly concerned about the data shuffling.
elk/extraction/balanced_sampler.py
Outdated
self.rng = np.random.default_rng(seed) | ||
|
||
def __iter__(self): | ||
for sample in self.dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the dataset we're streaming from isn't shuffled (e.g. all the movie reviews about batman come first) our sampling will be distributionally incorrect, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also one of the tests broke.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah working on the test thing, it's a pretty bizarre error.
The shuffling isn't really an issue. See the HF docs on this https://huggingface.co/docs/datasets/stream#shuffle. We just need to make sure we're actually calling .shuffle
, which I thought we were, but it's entirely possible I'm just misremembering.
elif label == 1: | ||
pos_buf.append(sample) | ||
else: | ||
raise ValueError(f"Expected label to be 0 or 1, got {label}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we don't support few-shot examples for dbpedia or ag_news (or multiclass datasets)? I guess this is fine until someone wants to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't want to think about how to support that rn
These requests are two weeks old so I assume they're resolved since you did your major refactor.
elk/extraction/balanced_sampler.py
Outdated
from typing import Iterator, Optional | ||
|
||
|
||
@dataclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small detail, but maybe we should just remove @DataClass, if we use init anyway... (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go ahead and change it
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am getting the following error, also before doing the merge, when running elicit gpt2 imdb --max_examples 1000 --num_gpus 2
...
Haven't looked at the error in detail, nor tried it to debug, yet
"""
Traceback (most recent call last):
File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1608, in _prepare_split_single
for key, record in generator:
File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/generator.py", line 57, in _generate_examples
for idx, ex in enumerate(self.config.generator(**gen_kwargs)):
File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/extraction.py", line 183, in _extraction_worker
yield from extract_hiddens(**{k: v[0] for k, v in kwargs.items()})
File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 43, in generator_context
response = gen.send(None)
File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/extraction.py", line 119, in extract_hiddens
for example in islice(BalancedSampler(prompt_ds), max_examples):
File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/balanced_sampler.py", line 34, in __iter__
for sample in self.data:
File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/prompt_loading.py", line 173, in load_prompts
example = next(ds_iterator)
File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 926, in __iter__
ex_iterable = self._prepare_ex_iterable_for_iteration()
File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 902, in _prepare_ex_iterable_for_iteration
if ex_iterable.n_shards % world_size == 0:
ZeroDivisionError: integer division or modulo by zero
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1349, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1644, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/ssd-1/notodai/walter/elk/.venv/bin/elk", line 8, in <module>
sys.exit(run())
File "/mnt/ssd-1/notodai/walter/elk/elk/__main__.py", line 29, in run
run.execute()
File "/mnt/ssd-1/notodai/walter/elk/elk/__main__.py", line 21, in execute
return self.command.execute()
File "/mnt/ssd-1/notodai/walter/elk/elk/training/train.py", line 60, in execute
train_run = Train(cfg=self, out_dir=self.out_dir)
File "<string>", line 5, in __init__
File "/mnt/ssd-1/notodai/walter/elk/elk/run.py", line 41, in __post_init__
self.dataset = extract(self.cfg.data, num_gpus=self.cfg.num_gpus)
File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/extraction.py", line 256, in extract
builder.download_and_prepare(num_proc=len(devices))
File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 872, in download_and_prepare
self._download_and_prepare(
File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
super()._download_and_prepare(
File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 967, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1516, in _prepare_split
for job_id, done, content in iflatmap_unordered(
File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1373, in iflatmap_unordered
[async_result.get() for async_result in async_results]
File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1373, in <listcomp>
[async_result.get() for async_result in async_results]
File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
raise self._value
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
Someone else got the same error?
i'm getting the same error. somehow the world_size in datasets/iterable_dataset.py is 0 edit: fixed here d2c66b0#r1150751121 |
split = split_dataset_by_node(split, world_size, rank) | ||
split = split_dataset_by_node( | ||
dataset=split, rank=rank, world_size=world_size | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function signature is
def split_dataset_by_node(dataset: DatasetType, rank: int, world_size: int) -> DatasetType:
so we were passing world_size to rank and vice versa
Remove print label
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, eval, elicit and extract work now.
However, I am wondering why imdb performs a bit worse (AUROC) than on main with the default seed... (Especially the last layer is just a bit over random
for more information, see https://pre-commit.ci
No idea, but we shouldn't merge until we figure out why. What is the exact command you're running to get that result? |
Of course, it's just gpt2, but yeah let's figure that out later... |
Sorry, I misread Noras previous message, somehow. Will revert the pull-request for now, and we might want to look better into a possible performance regression first. |
Adds support for using data from multiple datasets, and also combining hidden states from multiple layers