Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fit logistic regression models on the GPU #103

Merged
merged 8 commits into from
Mar 7, 2023
Merged

Conversation

norabelrose
Copy link
Member

Fixes #88
Note: merge #101 first

In main, most of the wall clock time used up by the training process consists of training the logistic regression baseline models with scikit-learn. This is because scikit-learn runs on the CPU only. In this PR, I replace sklearn.LogisticRegression with a simple PyTorch module called Classifier, which uses LBFGS to fit its coefficients to a batch of data. This can speed up training by an order of magnitude.

Since Classifier is so fast, I also use it to predict the pseudo-labels (e.g. "Yes", "No") for each contrast pair, and measure its AUROC. This "pseudo-AUROC" is a potentially useful diagnostic for determining how well the normalization is working. If the pseudo-AUROC is high, that's a bad sign: it means the pseudo-labels are linearly separable and the unsupervised algorithm may pick up on that more than anything else. Currently the code prints a warning whenever the pseudo-AUROC is over 0.6, but we might want to tune this or make it customizable in the future. On the small number of (model, dataset) pairs I've looked at, though, it seems like the pseudo-AUROC is very close to 0.5, so I think the naive approach mostly works.

Also, I switched the training script from using Pool.imap_unordered to normal Pool.imap, which ensures that the entries in the eval CSV are sorted (making it much easier to read at a glance). The downside is that the progress bar can be "jumpier." We might be able to do something better in the future— the stupidly simple idea would be to stick with imap_unordered and then only print the eval CSV at the end, but I kinda wanted to avoid that since esp. during debugging the training script might crash and we want to make sure partial results get written to disk. IDK.

@FabienRoger
Copy link
Collaborator

I really like measuring pseudo-AUROC.
Have you seen it go above 0.5 in some cases?

@FabienRoger
Copy link
Collaborator

Using Pool.imap looks good to me. I guess that it doesn't make the bar too jumpy if the number of tasks is larger enough relative to the number of workers, with is hopefully the regime most of us are working with?

Copy link
Collaborator

@FabienRoger FabienRoger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
(Haven't looked into the promptsource integration)


self.linear = torch.nn.Linear(input_dim, num_classes, device=device)
self.linear.bias.data.zero_()
self.linear.weight.data.zero_()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this made me slightly worried at first, but I guess it's always fine since its output is never used as input to another layer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I was doing this because I think logistic regression models are usually initialized to zero? It shouldn't matter since this is always a convex problem, although maybe in weird cases it could matter due to the 1e-4 stopping condition


# https://en.wikipedia.org/wiki/Projection_(linear_algebra)
A = self.linear.weight.data.T
P = A @ torch.linalg.solve(A.mT @ A, A.mT)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use that because it's more accurate than using gram Schmidt then projecting in the usual way? Looks good to me.
(Note: you drop the gradient, I hope this won't cause trouble to users wanting to do something weird).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't even think about orthogonalizing A and then projecting. I just saw this formula in the Wikipedia article and it seems to work 😅

And yeah, I think including the gradient could cause more bugs than anything? But idk. I was primarily thinking of using this for INLP. The idea is you could project the data onto the nullspace of the classifier for the pseudo-labels as a more robust form of normalization. But it seems that at least on the datasets I've looked at this isn't actually necessary (the pseudo-labels aren't linearly separable after subtracting the mean).


# These are users whose datasets should be included in the results returned by
# filter_english_datasets (regardless of their metadata)
INCLUDED_USERS = {"Zaid", "craffel", "lauritowal"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not up-to-date, since Christy added a new user here and added also some new templates which we might want to add (?)

Copy link
Collaborator

@lauritowal lauritowal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should work now, I've solved the merge conflicts and committed some small additional changes @norabelrose

@norabelrose norabelrose merged commit 51a674a into main Mar 7, 2023
@norabelrose norabelrose deleted the gpu-classifier branch March 7, 2023 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Reporter for the supervised baseline, not sklearn LogisticRegression
3 participants