Multi datasets #123

Benw8888 · 2023-03-11T17:29:29Z

Adds support for using data from multiple datasets, and also combining hidden states from multiple layers

norabelrose

Could you fix the issues flagged by Pyright? We can't merge before those are fixed.

elk/extraction/prompt_dataset.py

elk/training/train.py

elk/extraction/prompt_dataset.py

elk/training/train.py

…-datasets

elk/extraction/extraction.py

elk/extraction/prompt_loading.py

AlexTMallen · 2023-03-23T01:23:41Z

elk/extraction/prompt_loading.py

+
+ # Remove everything but the label column
+ extra_cols = list(assert_type(Features, ds.features))
+ extra_cols.remove(label_column)


I think we actually do want to remove the label column, because when num_classes > 2, the label should be a 0 or 1 (indicating which element of the contrast pair is correct), so we need to modify the label in _convert_to_prompts.

I made a commit for these changes, but also I think it's pretty unclear how we want to deal with MC datasets

lauritowal · 2023-03-23T22:42:22Z

elk/extraction/balanced_sampler.py

+ yield sample
+
+
+class FewShotSampler:


Any reason why you used @DataClass for the above class, but not here?

I haven't had time yet to review the pull request in more detail and to test it. I can do that tomorrow evening, though.

yeah this is a little bit inconsistent but idk how much it matters

Arguably FewShotSampler might be better if we make it not an iterator bc infinite iterators are weird

…-datasets

AlexTMallen

I'm mostly concerned about the data shuffling.

AlexTMallen · 2023-03-25T00:13:44Z

elk/extraction/balanced_sampler.py

+ self.rng = np.random.default_rng(seed)
+
+ def __iter__(self):
+ for sample in self.dataset:


If the dataset we're streaming from isn't shuffled (e.g. all the movie reviews about batman come first) our sampling will be distributionally incorrect, no?

Also one of the tests broke.

Yeah working on the test thing, it's a pretty bizarre error.

The shuffling isn't really an issue. See the HF docs on this https://huggingface.co/docs/datasets/stream#shuffle. We just need to make sure we're actually calling .shuffle, which I thought we were, but it's entirely possible I'm just misremembering.

AlexTMallen · 2023-03-25T00:17:53Z

elk/extraction/balanced_sampler.py

+ elif label == 1:
+ pos_buf.append(sample)
+ else:
+ raise ValueError(f"Expected label to be 0 or 1, got {label}")


So we don't support few-shot examples for dbpedia or ag_news (or multiclass datasets)? I guess this is fine until someone wants to do this.

Yeah I don't want to think about how to support that rn

These requests are two weeks old so I assume they're resolved since you did your major refactor.

lauritowal · 2023-03-25T18:07:35Z

elk/extraction/balanced_sampler.py

+from typing import Iterator, Optional
+
+
+@dataclass


small detail, but maybe we should just remove @DataClass, if we use init anyway... (?)

Go ahead and change it

for more information, see https://pre-commit.ci

lauritowal

I am getting the following error, also before doing the merge, when running elicit gpt2 imdb --max_examples 1000 --num_gpus 2...
Haven't looked at the error in detail, nor tried it to debug, yet

"""
Traceback (most recent call last):
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1608, in _prepare_split_single
    for key, record in generator:
  File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/generator.py", line 57, in _generate_examples
    for idx, ex in enumerate(self.config.generator(**gen_kwargs)):
  File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/extraction.py", line 183, in _extraction_worker
    yield from extract_hiddens(**{k: v[0] for k, v in kwargs.items()})
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 43, in generator_context
    response = gen.send(None)
  File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/extraction.py", line 119, in extract_hiddens
    for example in islice(BalancedSampler(prompt_ds), max_examples):
  File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/balanced_sampler.py", line 34, in __iter__
    for sample in self.data:
  File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/prompt_loading.py", line 173, in load_prompts
    example = next(ds_iterator)
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 926, in __iter__
    ex_iterable = self._prepare_ex_iterable_for_iteration()
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 902, in _prepare_ex_iterable_for_iteration
    if ex_iterable.n_shards % world_size == 0:
ZeroDivisionError: integer division or modulo by zero

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1349, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1644, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/ssd-1/notodai/walter/elk/.venv/bin/elk", line 8, in <module>
    sys.exit(run())
  File "/mnt/ssd-1/notodai/walter/elk/elk/__main__.py", line 29, in run
    run.execute()
  File "/mnt/ssd-1/notodai/walter/elk/elk/__main__.py", line 21, in execute
    return self.command.execute()
  File "/mnt/ssd-1/notodai/walter/elk/elk/training/train.py", line 60, in execute
    train_run = Train(cfg=self, out_dir=self.out_dir)
  File "<string>", line 5, in __init__
  File "/mnt/ssd-1/notodai/walter/elk/elk/run.py", line 41, in __post_init__
    self.dataset = extract(self.cfg.data, num_gpus=self.cfg.num_gpus)
  File "/mnt/ssd-1/notodai/walter/elk/elk/extraction/extraction.py", line 256, in extract
    builder.download_and_prepare(num_proc=len(devices))
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 967, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1516, in _prepare_split
    for job_id, done, content in iflatmap_unordered(
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1373, in iflatmap_unordered
    [async_result.get() for async_result in async_results]
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1373, in <listcomp>
    [async_result.get() for async_result in async_results]
  File "/mnt/ssd-1/notodai/walter/elk/.venv/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Someone else got the same error?

…to multi-datasets

thejaminator · 2023-03-28T14:48:46Z

I am getting the following error, also before doing the merge, when running elicit gpt2 imdb --max_examples 1000 --num_gpus 2...

i'm getting the same error. somehow the world_size in datasets/iterable_dataset.py is 0

edit: fixed here d2c66b0#r1150751121

thejaminator · 2023-03-28T14:53:32Z

elk/extraction/prompt_loading.py

- split = split_dataset_by_node(split, world_size, rank)
+ split = split_dataset_by_node(
+ dataset=split, rank=rank, world_size=world_size
+ )


the function signature is

def split_dataset_by_node(dataset: DatasetType, rank: int, world_size: int) -> DatasetType:

so we were passing world_size to rank and vice versa

Remove print label

lauritowal

Okay, eval, elicit and extract work now.

However, I am wondering why imdb performs a bit worse (AUROC) than on main with the default seed... (Especially the last layer is just a bit over random

for more information, see https://pre-commit.ci

norabelrose · 2023-03-28T22:38:20Z

Okay, eval, elicit and extract work now.

However, I am wondering why imdb performs a bit worse (AUROC) than on main with the default seed... (Especially the last layer is just a bit over random

No idea, but we shouldn't merge until we figure out why. What is the exact command you're running to get that result?

lauritowal · 2023-03-28T23:02:27Z

No idea, but we shouldn't merge until we figure out why.
Alrght, let's merge!

What is the exact command you're running to get that result?
Simply, something like:

elk elicit gpt2 imdb --num_gpus 2
elk eval <run_name> gpt2 imdb --num_gpus 2

Of course, it's just gpt2, but yeah let's figure that out later...

lauritowal · 2023-03-29T09:30:15Z

Sorry, I misread Noras previous message, somehow. Will revert the pull-request for now, and we might want to look better into a possible performance regression first.

Benw8888 added 5 commits March 9, 2023 22:32

add multiple datasets support

681698d

Merge branch 'main' of github.com:EleutherAI/elk into multi-datasets

ac1b9f1

train_reporter works on a list of layers now

b864c77

changing printed layer names

7d7d97c

fixed concatenation bug

4fe61e9

norabelrose requested changes Mar 12, 2023

View reviewed changes

elk/extraction/prompt_dataset.py Outdated Show resolved Hide resolved

elk/extraction/prompt_dataset.py Outdated Show resolved Hide resolved

elk/training/train.py Outdated Show resolved Hide resolved

Benw8888 added 3 commits March 13, 2023 22:52

minor edits

fe61d67

fixed pyright issues

74da878

Merge branch 'main' of github.com:EleutherAI/elk into multi-datasets

569ef05

norabelrose previously requested changes Mar 14, 2023

View reviewed changes

norabelrose added 10 commits March 20, 2023 16:48

Merge branch 'main' into multi-datasets

b62b679

Fix tests

fe94c22

Now working sorta

bba24d8

Skip slow BalancedBatchSampler test

03ba6e0

Slightly relax test_output_is_roughly_balanced

15ab351

Make BalancedSampler deterministic

a80369e

InitVar

d304ab3

Support multi class again

761c82d

Fix naming issue

f29743b

Support few shot prompts

b7b7e23

norabelrose requested review from AlexTMallen and lauritowal March 23, 2023 01:01

norabelrose and others added 3 commits March 23, 2023 02:16

Merge branch 'main' into multi-datasets

1afb563

fix multiclass labels

225d4c7

Merge branch 'multi-datasets' of github.com:EleutherAI/elk into multi…

9368dc8

…-datasets

AlexTMallen reviewed Mar 23, 2023

View reviewed changes

lauritowal reviewed Mar 23, 2023

View reviewed changes

norabelrose added 2 commits March 24, 2023 22:14

Merge branch 'main' into multi-datasets

a858b65

Merge branch 'multi-datasets' of github.com:EleutherAI/elk into multi…

5dc2ec6

…-datasets

AlexTMallen requested changes Mar 25, 2023

View reviewed changes

AlexTMallen approved these changes Mar 25, 2023

View reviewed changes

lauritowal reviewed Mar 25, 2023

View reviewed changes

AlexTMallen and others added 6 commits March 26, 2023 00:43

make min_memory usable; broadcast mmax_examples in __post_init__

177eec2

prompt loading refactor to enable better streaming

3a762b0

remove shuffle arg

f66c054

remove unused @DataClass

d3d87fc

merge

3d08147

[pre-commit.ci] auto fixes from pre-commit.com hooks

c9a43e1

for more information, see https://pre-commit.ci

lauritowal requested changes Mar 27, 2023

View reviewed changes

lauritowal added 6 commits March 27, 2023 22:38

add concatenated_layer_offset to eval

94290aa

Merge branch 'multi-datasets' of https://github.com/EleutherAI/elk in…

f9298e4

…to multi-datasets

add self.

3765c4f

replace target with data

2b05193

add self.

83731bb

remove second arg

764fda9

fix passing the wrong params for world size / rank

d2c66b0

thejaminator reviewed Mar 28, 2023

View reviewed changes

Update prompt_loading.py

9186326

Remove print label

lauritowal approved these changes Mar 28, 2023

View reviewed changes

lauritowal and others added 2 commits March 28, 2023 22:34

fix pre-commit errors

3f99a4d

[pre-commit.ci] auto fixes from pre-commit.com hooks

148130d

for more information, see https://pre-commit.ci

lauritowal merged commit 77e2aeb into main Mar 28, 2023

lauritowal deleted the multi-datasets branch March 28, 2023 23:02

lauritowal mentioned this pull request Mar 29, 2023

Revert "Multi datasets" #158

Closed

lauritowal restored the multi-datasets branch March 29, 2023 09:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi datasets #123

Multi datasets #123

Benw8888 commented Mar 11, 2023

norabelrose left a comment

AlexTMallen Mar 23, 2023

AlexTMallen Mar 23, 2023

lauritowal Mar 23, 2023

lauritowal Mar 23, 2023 •

edited

norabelrose Mar 24, 2023

norabelrose Mar 24, 2023

AlexTMallen left a comment

AlexTMallen Mar 25, 2023

AlexTMallen Mar 25, 2023

norabelrose Mar 25, 2023

AlexTMallen Mar 25, 2023

norabelrose Mar 25, 2023

lauritowal Mar 25, 2023

norabelrose Mar 25, 2023

lauritowal left a comment •

edited

thejaminator commented Mar 28, 2023 •

edited

thejaminator Mar 28, 2023

lauritowal left a comment

norabelrose commented Mar 28, 2023

lauritowal commented Mar 28, 2023 •

edited

lauritowal commented Mar 29, 2023

Multi datasets #123

Multi datasets #123

Conversation

Benw8888 commented Mar 11, 2023

norabelrose left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lauritowal Mar 23, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexTMallen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lauritowal left a comment • edited

Choose a reason for hiding this comment

thejaminator commented Mar 28, 2023 • edited

Choose a reason for hiding this comment

lauritowal left a comment

Choose a reason for hiding this comment

norabelrose commented Mar 28, 2023

lauritowal commented Mar 28, 2023 • edited

lauritowal commented Mar 29, 2023

lauritowal Mar 23, 2023 •

edited

lauritowal left a comment •

edited

thejaminator commented Mar 28, 2023 •

edited

lauritowal commented Mar 28, 2023 •

edited