Measurement tampering #33

oliveradk · 2024-03-03T20:06:41Z

(all tests passing)

language model support (from language-models branch)
multi-label classification support
measurment tampering predictor training and dataset support
misc bug fixees

We're now just padding all inputs to the maximum sequence length as a workaround. Lots of room for improvement here, e.g. would be great if TransformerLens let us pass kwargs through to to_tokens (to save a bunch of copied code), and ideally we'd use the maximum sequence length over all training examples instead. Also looks like memory usage is still increasing slightly, but it's now on the order of 200MB for one epoch on the IMDB dataset with Pythia-14m, so at least some local testing is totally feasible now. For serious training, we'll use CUDA anyway (but this remaining memory leak might of course also affect that if it's a different source, should keep an eye out)

Mainly better padding, and allow freezing the model itself. Not sure yet whether either one is necessary, but at least I now get 80% accuracy pretty quickly on a CPU, will run bigger experiments soon.

…tilabel tasks

…g, I think bugs are resolved in main?)

ejnnr

Left a few minor comments but mostly looks good for now! (of course there's stuff to finish up later) In addition to comments, I don't think demo.ipynb has any changes that should be committed? measurement tampering.ipynb also seems partially outdated and like we probably don't need it in the repository, but I do like the dataset exploration in the beginning, we can reuse that when we make a nicer demo notebook for measurement tampering at some point

pyproject.toml

ejnnr · 2024-03-04T00:54:40Z

src/cupbearer/data/tampering.py

+@dataclass
+class TamperingDataConfig(DatasetConfig):
+    n_sensors: ClassVar[int] = 3  # not configurable
+    train: bool = True  # TODO: how does cupbearer use this?


Re how we use this: there was some magic going on to sometimes automatically get the validation split where this made sense (e.g. if you train a backdoored classifier, and then evaluate a detector on that classifier, cupbearer would guess that you'd want to evaluate the detector on the train=False split and check whether the training data config had a train field). But #30 removes that along with most other behind-the-scenes magic like it, stuff like this is now handled explicitly by the user

ejnnr · 2024-03-04T01:01:25Z

src/cupbearer/data/tampering.py

+from . import DatasetConfig
+
+
+class TamperingDataset(torch.utils.data.Dataset):


To define a measurement tampering task, I see two approaches (with the task format from #30):

Add options to TamperingDataset to only return a selection of the data (e.g. only easy training data, or only validation data that has tampering). Then use Task.from_separate_data() to recombine them. Advantage is we get some control over mixing ratio and use the standard interface, downside is that it's kind of indirect and complicated.

Pass trusted_data, untrusted_train_data, and test_data directly to Task, without using MixedData at all. This is closer to the format the datasets are already in anyway, so probably simpler. We'll just have to make sure to return the right label format: trusted_data and untrusted_train_data should just return (text, measurements), whereas test_data should return ((text, measurements), is_clean) I think.

Leaning a bit towards 2. overall but open to either approach or some third one

To be clear, no need to do this before merging, just noting down my thoughts

src/cupbearer/models/transformers_hf.py

ejnnr · 2024-03-04T01:11:28Z

src/cupbearer/models/transformers_hf.py

+        # get embeddings
+        embeddings = self.get_embeddings(tokens)
+
+        # TODO (odk) se store (doesn't this slow down training?)


(doesn't this slow down training?)

Calling self.store() should be ~free outside a self.capture() context manager

(but also like we discussed, we should switch to pytorch hooks anyway, so it won't be each model's responsibility anymore to expose activations)

ejnnr · 2024-03-04T01:18:09Z

src/cupbearer/models/transformers_hf.py

+            b, self.n_sensors, self.embed_dim
+        )
+        # last token embedding (for aggregate measurement)
+        last_token_ind = tokens["attention_mask"].sum(dim=1) - 1


We should probably add a check somewhere that padding is on the right just to avoid nasty surprises

ejnnr · 2024-03-04T01:20:53Z

src/cupbearer/models/transformers_hf.py

+        last_embs = embeddings[torch.arange(b), last_token_ind]
+        probe_embs = torch.concat([sensor_embs, last_embs.unsqueeze(dim=1)], axis=1)
+        assert probe_embs.shape == (b, self.n_probes, self.embed_dim)
+        logits = torch.concat(


Probably simpler to make self.probes a single nn.Linear with n_probes output logits? Though if we ever have non-linear probes, then it actually makes a difference (do we share most of their body or not between measurements), not sure which option we'd want in that case

src/cupbearer/scripts/_shared.py

ejnnr · 2024-03-04T01:39:38Z

@oliveradk I merged it after making a few small changes (you might want to double check them). The remaining open comments seem less important, just take a look at them at some point

ejnnr and others added 20 commits November 23, 2023 16:22

Add partial language model support

1d68521

Merge remote-tracking branch 'origin/main' into language-models

a860176

Merge branch 'main' into language-models

448d420

Improve classification finetuning

0043614

Mainly better padding, and allow freezing the model itself. Not sure yet whether either one is necessary, but at least I now get 80% accuracy pretty quickly on a CPU, will run bigger experiments soon.

add build to gitignore

c4bfe71

add transformer-lens as a dependency

d3d9bc4

num_epochs -> train_config.num_epochs in README

512a947

missing comma in pyproject.toml

2559d31

fixed input format in build model (create tensordataformat from shape)

cbb01a1

Merge remote-tracking branch 'origin/main' into language-models

04aaabc

added ipykernel to dev dependencies

6447237

add misc dir to gitignore (for e.g. exploratory notebooks)

56af701

commiting notebook in progress (will remove later)

f662d38

tampering model config (in notebook)

400a246

measurement tampering models, datasets, genearlized classifier to mul…

8c538bb

…tilabel tasks

Merge branch 'language-models' into measurement-tampering (not workin…

cf2e52e

…g, I think bugs are resolved in main?)

Merge branch 'main' into measurement-tampering

de6fd65

custom type hint for classification task

971e2cd

measurment_tampering tests

6c874fe

oliveradk requested review from ejnnr and VRehnberg March 3, 2024 20:06

ejnnr requested changes Mar 4, 2024

View reviewed changes

ejnnr added 3 commits March 3, 2024 17:31

Apply suggestions from code review

0f0ce87

Make enum use consistent

55bbd90

Revert demo notebook changes

198c1be

ejnnr merged commit 6f02988 into main Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measurement tampering #33

Measurement tampering #33

oliveradk commented Mar 3, 2024 •

edited

Loading

ejnnr left a comment

ejnnr Mar 4, 2024

ejnnr Mar 4, 2024

ejnnr Mar 4, 2024

ejnnr Mar 4, 2024

ejnnr Mar 4, 2024

ejnnr Mar 4, 2024

ejnnr Mar 4, 2024

ejnnr commented Mar 4, 2024

		from . import DatasetConfig


		class TamperingDataset(torch.utils.data.Dataset):

Measurement tampering #33

Measurement tampering #33

Conversation

oliveradk commented Mar 3, 2024 • edited Loading

ejnnr left a comment

Choose a reason for hiding this comment

ejnnr Mar 4, 2024

Choose a reason for hiding this comment

ejnnr Mar 4, 2024

Choose a reason for hiding this comment

ejnnr Mar 4, 2024

Choose a reason for hiding this comment

ejnnr Mar 4, 2024

Choose a reason for hiding this comment

ejnnr Mar 4, 2024

Choose a reason for hiding this comment

ejnnr Mar 4, 2024

Choose a reason for hiding this comment

ejnnr Mar 4, 2024

Choose a reason for hiding this comment

ejnnr commented Mar 4, 2024

oliveradk commented Mar 3, 2024 •

edited

Loading