Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple datasets refactor #189

Merged
merged 17 commits into from
Apr 14, 2023
Merged

Multiple datasets refactor #189

merged 17 commits into from
Apr 14, 2023

Conversation

norabelrose
Copy link
Member

@norabelrose norabelrose commented Apr 13, 2023

This PR refactors various parts of the code in order to handle the training of reporters on multiple datasets simultaneously. We kinda-sorta "supported" this before but not correctly.

Before, we tried to merge together the hidden states from all the datasets into a single HF dataset. This creates a lot of problems; for one thing, it means that the hiddens aren't cached properly. For example, if you extract hiddens for imdb and then you want to fit a reporter on both imdb and amazon_polarity, it wouldn't use the cached imdb hiddens. Now we do use the cache in cases like this. To accomplish this, I needed to change how reporters are trained and evaluated. The training code now passes around dictionaries where the keys are dataset names and the values are tuples of hiddens and labels. This part could probably be further cleaned up a bit but it's a decent MVP I think.

Now, the eval.csv will separately report the metrics of the reporter on each dataset instead of reporting pooled metrics. I take it that this is much more useful than the pooled metrics. But maybe we should include the pooled metrics as well, idk.

There's also a few little fixes/enhancements in here, most notably that I switched to using torch.linalg.eigh by default instead of truncated_eigh for the time being. This makes me sad, since I put a lot of work into truncated_eigh, but it fails to converge way too often and ends up printing annoying warnings when this happens. Hopefully we'll be able to revive it eventually.

This PR also fixes #96, in a different way from PR #170. Basically I needed normalization to be a property of Reporters in order to get multiple datasets to work in a reasonable way. Essentially I don't do any explicit normalization for VINC, since it doesn't actually need it (we already de-mean representations from different classes separately), whereas for CCS I use a novel Normalizer class which is probably slightly over-engineered but whatever.

Copy link
Collaborator

@AlexTMallen AlexTMallen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Things blocking approval:

  • Logging needs to be updated to handle multiple datasets
  • Training is very slow (is this just because we're not truncating the eigendecomposition?)
  • --skip_baseline flag isn't used in trianing
  • (Lower priority) normalization of reporter isn't used for pseudo_auroc

elk/extraction/prompt_loading.py Outdated Show resolved Hide resolved
print(f"Using '{train_name}' for training and '{val_name}' for validation")

print(
# Cyan color for dataset name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice touch :)

elk/training/train.py Show resolved Hide resolved
elk/training/train.py Outdated Show resolved Hide resolved

if isinstance(self.cfg.net, CcsReporterConfig):
reporter = CcsReporter(x0.shape[-1], self.cfg.net, device=device)
assert len(train_dict) == 1, "CCS only supports single-task training"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this isn't high priority, we could be a little more lenient than this and allow training on mixtures of datasets when the shapes of the tensors are the same (same num_variants, num_choices).

elk/training/train.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@AlexTMallen AlexTMallen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (I also made a small commit just before reviewing, someone should quickly check it)

Copy link
Collaborator

@lauritowal lauritowal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the following error when running eval right now:

(.venv) laurito@ipe-bison:~/elk$ elk eval /home/laurito/elk-reporters/gpt2/imdb\,\ super_glue\ boolq/gracious-mendel/ gpt2 imdb ag_news --num_gpus 1 --max_examples 100
Using 1 of 2 GPUs: [1]
imdb: using 'train' for training and 'test' for validation
Found cached dataset generator (/home/laurito/.cache/huggingface/datasets/generator/default-4e6ce9115ec3adc4/0.0.0)
Found cached dataset generator (/home/laurito/.cache/huggingface/datasets/generator/default-58da717a77241cc1/0.0.0)
Using 1 of 2 GPUs: [1]
ag_news: using 'train' for training and 'test' for validation
Found cached dataset generator (/home/laurito/.cache/huggingface/datasets/generator/default-0d15ca1ac6cd44d9/0.0.0)
Found cached dataset generator (/home/laurito/.cache/huggingface/datasets/generator/default-31f2909bb9ce8ebf/0.0.0)
Output directory at /home/laurito/elk-reporters/gpt2/imdb, super_glue boolq/gracious-mendel/transfer_eval/imdb
Using 1 of 2 GPUs: [1]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:01<00:00,  8.46it/s]
Traceback (most recent call last):
  File "/home/laurito/elk/.venv/bin/elk", line 8, in <module>
    sys.exit(run())
             ^^^^^
  File "/home/laurito/elk/elk/__main__.py", line 26, in run
    run.execute()
  File "/home/laurito/elk/elk/__main__.py", line 18, in execute
    return self.command.execute()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/laurito/elk/elk/evaluation/evaluate.py", line 53, in execute
    run.evaluate()
  File "/home/laurito/elk/elk/evaluation/evaluate.py", line 111, in evaluate
    self.apply_to_layers(func=func, num_devices=num_devices)
  File "/home/laurito/elk/elk/run.py", line 151, in apply_to_layers
    df = pd.concat(df_buf).sort_values(by="layer")
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/laurito/elk/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 6766, in sort_values
    k = self._get_label_or_level_values(by, axis=axis)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/laurito/elk/.venv/lib/python3.11/site-packages/pandas/core/generic.py", line 1778, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'layer'

Didn't look into it in more detail, yet.

@norabelrose
Copy link
Member Author

Getting the following error when running eval right now:

(.venv) laurito@ipe-bison:~/elk$ elk eval /home/laurito/elk-reporters/gpt2/imdb\,\ super_glue\ boolq/gracious-mendel/ gpt2 imdb ag_news --num_gpus 1 --max_examples 100
Using 1 of 2 GPUs: [1]
imdb: using 'train' for training and 'test' for validation
Found cached dataset generator (/home/laurito/.cache/huggingface/datasets/generator/default-4e6ce9115ec3adc4/0.0.0)
Found cached dataset generator (/home/laurito/.cache/huggingface/datasets/generator/default-58da717a77241cc1/0.0.0)
Using 1 of 2 GPUs: [1]
ag_news: using 'train' for training and 'test' for validation
Found cached dataset generator (/home/laurito/.cache/huggingface/datasets/generator/default-0d15ca1ac6cd44d9/0.0.0)
Found cached dataset generator (/home/laurito/.cache/huggingface/datasets/generator/default-31f2909bb9ce8ebf/0.0.0)
Output directory at /home/laurito/elk-reporters/gpt2/imdb, super_glue boolq/gracious-mendel/transfer_eval/imdb
Using 1 of 2 GPUs: [1]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:01<00:00,  8.46it/s]
Traceback (most recent call last):
  File "/home/laurito/elk/.venv/bin/elk", line 8, in <module>
    sys.exit(run())
             ^^^^^
  File "/home/laurito/elk/elk/__main__.py", line 26, in run
    run.execute()
  File "/home/laurito/elk/elk/__main__.py", line 18, in execute
    return self.command.execute()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/laurito/elk/elk/evaluation/evaluate.py", line 53, in execute
    run.evaluate()
  File "/home/laurito/elk/elk/evaluation/evaluate.py", line 111, in evaluate
    self.apply_to_layers(func=func, num_devices=num_devices)
  File "/home/laurito/elk/elk/run.py", line 151, in apply_to_layers
    df = pd.concat(df_buf).sort_values(by="layer")
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/laurito/elk/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 6766, in sort_values
    k = self._get_label_or_level_values(by, axis=axis)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/laurito/elk/.venv/lib/python3.11/site-packages/pandas/core/generic.py", line 1778, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'layer'

Didn't look into it in more detail, yet.

Fixed

@norabelrose norabelrose merged commit 16dc1ca into main Apr 14, 2023
@norabelrose norabelrose deleted the multi-ds-eval branch April 14, 2023 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make normalization a property of the Reporter
3 participants