Fit logistic regression models on the GPU #103

norabelrose · 2023-03-05T00:26:19Z

Fixes #88
Note: merge #101 first

In main, most of the wall clock time used up by the training process consists of training the logistic regression baseline models with scikit-learn. This is because scikit-learn runs on the CPU only. In this PR, I replace sklearn.LogisticRegression with a simple PyTorch module called Classifier, which uses LBFGS to fit its coefficients to a batch of data. This can speed up training by an order of magnitude.

Since Classifier is so fast, I also use it to predict the pseudo-labels (e.g. "Yes", "No") for each contrast pair, and measure its AUROC. This "pseudo-AUROC" is a potentially useful diagnostic for determining how well the normalization is working. If the pseudo-AUROC is high, that's a bad sign: it means the pseudo-labels are linearly separable and the unsupervised algorithm may pick up on that more than anything else. Currently the code prints a warning whenever the pseudo-AUROC is over 0.6, but we might want to tune this or make it customizable in the future. On the small number of (model, dataset) pairs I've looked at, though, it seems like the pseudo-AUROC is very close to 0.5, so I think the naive approach mostly works.

Also, I switched the training script from using Pool.imap_unordered to normal Pool.imap, which ensures that the entries in the eval CSV are sorted (making it much easier to read at a glance). The downside is that the progress bar can be "jumpier." We might be able to do something better in the future— the stupidly simple idea would be to stick with imap_unordered and then only print the eval CSV at the end, but I kinda wanted to avoid that since esp. during debugging the training script might crash and we want to make sure partial results get written to disk. IDK.

FabienRoger · 2023-03-06T13:27:02Z

I really like measuring pseudo-AUROC.
Have you seen it go above 0.5 in some cases?

FabienRoger · 2023-03-06T13:28:42Z

Using Pool.imap looks good to me. I guess that it doesn't make the bar too jumpy if the number of tasks is larger enough relative to the number of workers, with is hopefully the regime most of us are working with?

FabienRoger

LGTM
(Haven't looked into the promptsource integration)

FabienRoger · 2023-03-06T13:33:43Z

elk/training/classifier.py

+
+ self.linear = torch.nn.Linear(input_dim, num_classes, device=device)
+ self.linear.bias.data.zero_()
+ self.linear.weight.data.zero_()


this made me slightly worried at first, but I guess it's always fine since its output is never used as input to another layer?

Oh I was doing this because I think logistic regression models are usually initialized to zero? It shouldn't matter since this is always a convex problem, although maybe in weird cases it could matter due to the 1e-4 stopping condition

FabienRoger · 2023-03-06T13:42:27Z

elk/training/classifier.py

+
+ # https://en.wikipedia.org/wiki/Projection_(linear_algebra)
+ A = self.linear.weight.data.T
+ P = A @ torch.linalg.solve(A.mT @ A, A.mT)


You use that because it's more accurate than using gram Schmidt then projecting in the usual way? Looks good to me.
(Note: you drop the gradient, I hope this won't cause trouble to users wanting to do something weird).

Oh, I didn't even think about orthogonalizing A and then projecting. I just saw this formula in the Wikipedia article and it seems to work 😅

And yeah, I think including the gradient could cause more bugs than anything? But idk. I was primarily thinking of using this for INLP. The idea is you could project the data onto the nullspace of the classifier for the pseudo-labels as a more robust form of normalization. But it seems that at least on the datasets I've looked at this isn't actually necessary (the pseudo-labels aren't linearly separable after subtracting the mean).

lauritowal · 2023-03-06T22:53:59Z

elk/promptsource/templates.py

+
+# These are users whose datasets should be included in the results returned by
+# filter_english_datasets (regardless of their metadata)
+INCLUDED_USERS = {"Zaid", "craffel", "lauritowal"}


This file is not up-to-date, since Christy added a new user here and added also some new templates which we might want to add (?)

for more information, see https://pre-commit.ci

lauritowal

Should work now, I've solved the merge conflicts and committed some small additional changes @norabelrose

norabelrose added 5 commits March 4, 2023 08:45

Create Classifier class

2155ceb

Move submodule into repo proper

9f4853e

Remove vestigial references to submodules

7568d3d

Merge branch 'remove-submodule' into gpu-classifier

086987b

Check how linearly separable the pseudo-labels are

935006f

norabelrose requested review from lauritowal and FabienRoger March 5, 2023 00:26

FabienRoger approved these changes Mar 6, 2023

View reviewed changes

lauritowal reviewed Mar 6, 2023

View reviewed changes

lauritowal and others added 3 commits March 7, 2023 01:16

Merge branch 'main' into gpu-classifier

328c50c

fix: set .cpu() for correct tensors

ab0bb91

[pre-commit.ci] auto fixes from pre-commit.com hooks

0d60fcd

for more information, see https://pre-commit.ci

lauritowal approved these changes Mar 7, 2023

View reviewed changes

norabelrose merged commit 51a674a into main Mar 7, 2023

norabelrose deleted the gpu-classifier branch March 7, 2023 01:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fit logistic regression models on the GPU #103

Fit logistic regression models on the GPU #103

norabelrose commented Mar 5, 2023

FabienRoger commented Mar 6, 2023

FabienRoger commented Mar 6, 2023

FabienRoger left a comment

FabienRoger Mar 6, 2023

norabelrose Mar 6, 2023

FabienRoger Mar 6, 2023

norabelrose Mar 6, 2023

lauritowal Mar 6, 2023

lauritowal left a comment

Fit logistic regression models on the GPU #103

Fit logistic regression models on the GPU #103

Conversation

norabelrose commented Mar 5, 2023

FabienRoger commented Mar 6, 2023

FabienRoger commented Mar 6, 2023

FabienRoger left a comment

Choose a reason for hiding this comment

FabienRoger Mar 6, 2023

Choose a reason for hiding this comment

norabelrose Mar 6, 2023

Choose a reason for hiding this comment

FabienRoger Mar 6, 2023

Choose a reason for hiding this comment

norabelrose Mar 6, 2023

Choose a reason for hiding this comment

lauritowal Mar 6, 2023

Choose a reason for hiding this comment

lauritowal left a comment

Choose a reason for hiding this comment