-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fit logistic regression models on the GPU #103
Conversation
I really like measuring pseudo-AUROC. |
Using Pool.imap looks good to me. I guess that it doesn't make the bar too jumpy if the number of tasks is larger enough relative to the number of workers, with is hopefully the regime most of us are working with? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
(Haven't looked into the promptsource integration)
|
||
self.linear = torch.nn.Linear(input_dim, num_classes, device=device) | ||
self.linear.bias.data.zero_() | ||
self.linear.weight.data.zero_() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this made me slightly worried at first, but I guess it's always fine since its output is never used as input to another layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was doing this because I think logistic regression models are usually initialized to zero? It shouldn't matter since this is always a convex problem, although maybe in weird cases it could matter due to the 1e-4 stopping condition
|
||
# https://en.wikipedia.org/wiki/Projection_(linear_algebra) | ||
A = self.linear.weight.data.T | ||
P = A @ torch.linalg.solve(A.mT @ A, A.mT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You use that because it's more accurate than using gram Schmidt then projecting in the usual way? Looks good to me.
(Note: you drop the gradient, I hope this won't cause trouble to users wanting to do something weird).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't even think about orthogonalizing A
and then projecting. I just saw this formula in the Wikipedia article and it seems to work 😅
And yeah, I think including the gradient could cause more bugs than anything? But idk. I was primarily thinking of using this for INLP. The idea is you could project the data onto the nullspace of the classifier for the pseudo-labels as a more robust form of normalization. But it seems that at least on the datasets I've looked at this isn't actually necessary (the pseudo-labels aren't linearly separable after subtracting the mean).
elk/promptsource/templates.py
Outdated
|
||
# These are users whose datasets should be included in the results returned by | ||
# filter_english_datasets (regardless of their metadata) | ||
INCLUDED_USERS = {"Zaid", "craffel", "lauritowal"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is not up-to-date, since Christy added a new user here and added also some new templates which we might want to add (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should work now, I've solved the merge conflicts and committed some small additional changes @norabelrose
Fixes #88
Note: merge #101 first
In
main
, most of the wall clock time used up by the training process consists of training the logistic regression baseline models with scikit-learn. This is because scikit-learn runs on the CPU only. In this PR, I replacesklearn.LogisticRegression
with a simple PyTorch module calledClassifier
, which uses LBFGS to fit its coefficients to a batch of data. This can speed up training by an order of magnitude.Since
Classifier
is so fast, I also use it to predict the pseudo-labels (e.g. "Yes", "No") for each contrast pair, and measure its AUROC. This "pseudo-AUROC" is a potentially useful diagnostic for determining how well the normalization is working. If the pseudo-AUROC is high, that's a bad sign: it means the pseudo-labels are linearly separable and the unsupervised algorithm may pick up on that more than anything else. Currently the code prints a warning whenever the pseudo-AUROC is over 0.6, but we might want to tune this or make it customizable in the future. On the small number of (model, dataset) pairs I've looked at, though, it seems like the pseudo-AUROC is very close to 0.5, so I think the naive approach mostly works.Also, I switched the training script from using
Pool.imap_unordered
to normalPool.imap
, which ensures that the entries in the eval CSV are sorted (making it much easier to read at a glance). The downside is that the progress bar can be "jumpier." We might be able to do something better in the future— the stupidly simple idea would be to stick withimap_unordered
and then only print the eval CSV at the end, but I kinda wanted to avoid that since esp. during debugging the training script might crash and we want to make sure partial results get written to disk. IDK.