Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi datasets #123

Merged
merged 46 commits into from
Mar 28, 2023
Merged

Multi datasets #123

merged 46 commits into from
Mar 28, 2023

Conversation

Benw8888
Copy link
Collaborator

Adds support for using data from multiple datasets, and also combining hidden states from multiple layers

Copy link
Member

@norabelrose norabelrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix the issues flagged by Pyright? We can't merge before those are fixed.

elk/extraction/prompt_dataset.py Outdated Show resolved Hide resolved
elk/extraction/prompt_dataset.py Outdated Show resolved Hide resolved
elk/training/train.py Outdated Show resolved Hide resolved
elk/extraction/prompt_dataset.py Outdated Show resolved Hide resolved
elk/extraction/prompt_dataset.py Outdated Show resolved Hide resolved
elk/extraction/prompt_dataset.py Outdated Show resolved Hide resolved
elk/training/train.py Outdated Show resolved Hide resolved
elk/training/train.py Outdated Show resolved Hide resolved
elk/training/train.py Outdated Show resolved Hide resolved
elk/extraction/extraction.py Show resolved Hide resolved
elk/extraction/prompt_loading.py Outdated Show resolved Hide resolved

# Remove everything but the label column
extra_cols = list(assert_type(Features, ds.features))
extra_cols.remove(label_column)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually do want to remove the label column, because when num_classes > 2, the label should be a 0 or 1 (indicating which element of the contrast pair is correct), so we need to modify the label in _convert_to_prompts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a commit for these changes, but also I think it's pretty unclear how we want to deal with MC datasets

yield sample


class FewShotSampler:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why you used @DataClass for the above class, but not here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had time yet to review the pull request in more detail and to test it. I can do that tomorrow evening, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is a little bit inconsistent but idk how much it matters

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably FewShotSampler might be better if we make it not an iterator bc infinite iterators are weird

Copy link
Collaborator

@AlexTMallen AlexTMallen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly concerned about the data shuffling.

self.rng = np.random.default_rng(seed)

def __iter__(self):
for sample in self.dataset:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the dataset we're streaming from isn't shuffled (e.g. all the movie reviews about batman come first) our sampling will be distributionally incorrect, no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also one of the tests broke.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah working on the test thing, it's a pretty bizarre error.

The shuffling isn't really an issue. See the HF docs on this https://huggingface.co/docs/datasets/stream#shuffle. We just need to make sure we're actually calling .shuffle, which I thought we were, but it's entirely possible I'm just misremembering.

elif label == 1:
pos_buf.append(sample)
else:
raise ValueError(f"Expected label to be 0 or 1, got {label}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we don't support few-shot examples for dbpedia or ag_news (or multiclass datasets)? I guess this is fine until someone wants to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't want to think about how to support that rn

@AlexTMallen AlexTMallen dismissed norabelrose’s stale review March 25, 2023 16:57

These requests are two weeks old so I assume they're resolved since you did your major refactor.

from typing import Iterator, Optional


@dataclass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small detail, but maybe we should just remove @DataClass, if we use init anyway... (?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go ahead and change it

Copy link
Collaborator

@lauritowal lauritowal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am getting the following error, also before doing the merge, when running elicit gpt2 imdb --max_examples 1000 --num_gpus 2...
Haven't looked at the error in detail, nor tried it to debug, yet

"""
Traceback (most recent call last):
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1608, in _prepare_split_single
    for key, record in generator:
  File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/generator.py", line 57, in _generate_examples
    for idx, ex in enumerate(self.config.generator(**gen_kwargs)):
  File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/extraction.py", line 183, in _extraction_worker
    yield from extract_hiddens(**{k: v[0] for k, v in kwargs.items()})
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 43, in generator_context
    response = gen.send(None)
  File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/extraction.py", line 119, in extract_hiddens
    for example in islice(BalancedSampler(prompt_ds), max_examples):
  File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/balanced_sampler.py", line 34, in __iter__
    for sample in self.data:
  File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/prompt_loading.py", line 173, in load_prompts
    example = next(ds_iterator)
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 926, in __iter__
    ex_iterable = self._prepare_ex_iterable_for_iteration()
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 902, in _prepare_ex_iterable_for_iteration
    if ex_iterable.n_shards % world_size == 0:
ZeroDivisionError: integer division or modulo by zero

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1349, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1644, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/ssd-1/notodai/walter/elk/.venv/bin/elk", line 8, in <module>
    sys.exit(run())
  File "/mnt/ssd-1/notodai/walter/elk/elk/__main__.py", line 29, in run
    run.execute()
  File "/mnt/ssd-1/notodai/walter/elk/elk/__main__.py", line 21, in execute
    return self.command.execute()
  File "/mnt/ssd-1/notodai/walter/elk/elk/training/train.py", line 60, in execute
    train_run = Train(cfg=self, out_dir=self.out_dir)
  File "<string>", line 5, in __init__
  File "/mnt/ssd-1/notodai/walter/elk/elk/run.py", line 41, in __post_init__
    self.dataset = extract(self.cfg.data, num_gpus=self.cfg.num_gpus)
  File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/extraction.py", line 256, in extract
    builder.download_and_prepare(num_proc=len(devices))
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 967, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1516, in _prepare_split
    for job_id, done, content in iflatmap_unordered(
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1373, in iflatmap_unordered
    [async_result.get() for async_result in async_results]
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1373, in <listcomp>
    [async_result.get() for async_result in async_results]
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Someone else got the same error?

@thejaminator
Copy link
Collaborator

thejaminator commented Mar 28, 2023

I am getting the following error, also before doing the merge, when running elicit gpt2 imdb --max_examples 1000 --num_gpus 2...

i'm getting the same error. somehow the world_size in datasets/iterable_dataset.py is 0

edit: fixed here d2c66b0#r1150751121

split = split_dataset_by_node(split, world_size, rank)
split = split_dataset_by_node(
dataset=split, rank=rank, world_size=world_size
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function signature is

def split_dataset_by_node(dataset: DatasetType, rank: int, world_size: int) -> DatasetType:

so we were passing world_size to rank and vice versa

Remove print label
Copy link
Collaborator

@lauritowal lauritowal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, eval, elicit and extract work now.

However, I am wondering why imdb performs a bit worse (AUROC) than on main with the default seed... (Especially the last layer is just a bit over random

@norabelrose
Copy link
Member

Okay, eval, elicit and extract work now.

However, I am wondering why imdb performs a bit worse (AUROC) than on main with the default seed... (Especially the last layer is just a bit over random

No idea, but we shouldn't merge until we figure out why. What is the exact command you're running to get that result?

@lauritowal
Copy link
Collaborator

lauritowal commented Mar 28, 2023

No idea, but we shouldn't merge until we figure out why.
Alrght, let's merge!

What is the exact command you're running to get that result?
Simply, something like:

elk elicit gpt2 imdb --num_gpus 2
elk eval <run_name> gpt2 imdb --num_gpus 2

Of course, it's just gpt2, but yeah let's figure that out later...

@lauritowal lauritowal merged commit 77e2aeb into main Mar 28, 2023
@lauritowal lauritowal deleted the multi-datasets branch March 28, 2023 23:02
@lauritowal
Copy link
Collaborator

Sorry, I misread Noras previous message, somehow. Will revert the pull-request for now, and we might want to look better into a possible performance regression first.

@lauritowal lauritowal restored the multi-datasets branch March 29, 2023 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants