Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GLMCV results do not match GLM with same parameters and optimal lambda #377

Closed
whoisnnamdi opened this issue Mar 25, 2020 · 4 comments · Fixed by #378
Closed

GLMCV results do not match GLM with same parameters and optimal lambda #377

whoisnnamdi opened this issue Mar 25, 2020 · 4 comments · Fixed by #378

Comments

@whoisnnamdi
Copy link

Problem
Optimal penalization parameter (lambda) found via GLMCV does not yield similar results when plugged into GLM with otherwise similar parameters.

Example
Script below mostly follow the group lasso example code in the docs except modified for regular lasso instead of group.

from pyglmnet import GLMCV
from pyglmnet import GLM
from pyglmnet.datasets import fetch_group_lasso_datasets
from sklearn.model_selection import train_test_split

df, group_idxs = fetch_group_lasso_datasets()

X = df[df.columns.difference(["Label"])].values
y = df.loc[:, "Label"].values

Xtrain, Xtest, ytrain, ytest = \
    train_test_split(X, y, test_size=0.2, random_state=42)

# Setup lasso cv model
gl_glm = GLMCV(distr="binomial", tol=1e-3,
               score_metric="pseudo_R2",
               alpha=1.0, learning_rate=1, max_iter=100, cv=3, verbose=True)

# Fit the model
gl_glm.fit(Xtrain, ytrain)

# Save the optimal lambda based on highest score
opt_lambda = gl_glm.reg_lambda[gl_glm.scores_.index(max(gl_glm.scores_))]
print(opt_lambda)    # 0.010000000000000007

# Setup lasso model using optimal lambda found earlier, all other relevant parameters kept the same
glm = GLM(distr="binomial", tol=1e-3, reg_lambda=opt_lambda,
            score_metric="pseudo_R2",
            alpha=1.0, learning_rate=1, max_iter=100, verbose=True)

# Fit the model
glm.fit(Xtrain, ytrain)

# Compare beta coefficients
print(gl_glm.beta_ - glm.beta_)

Results
You can tweak the learning rate and iterations of the second model, but the results will never match those of GLMCV even with many iterations and a low learning rate.

I understand there may be some inherent instability given the way that convergence is reached, but this feels like too much. Especially since an important purpose of lasso is to do feature selection, if certain variables have non-zero coefficients in one model but not the other, this somewhat defeats that purpose.

Thank you for looking into this.

@jasmainak
Copy link
Member

The differences are really tiny if you reduce tol and increase max_iter. See below:

from pyglmnet import GLMCV
from pyglmnet import GLM
from pyglmnet.datasets import fetch_group_lasso_datasets
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

df, group_idxs = fetch_group_lasso_datasets()

X = df[df.columns.difference(["Label"])].values
y = df.loc[:, "Label"].values

Xtrain, Xtest, ytrain, ytest = \
    train_test_split(X, y, test_size=0.2, random_state=42)

# Setup lasso cv model
gl_glm = GLMCV(distr="binomial", tol=1e-7,
               score_metric="pseudo_R2",
               alpha=1.0, learning_rate=1, max_iter=200, cv=3, verbose=True)

# Fit the model
gl_glm.fit(Xtrain, ytrain)

# Save the optimal lambda based on highest score
opt_lambda = gl_glm.reg_lambda[gl_glm.scores_.index(max(gl_glm.scores_))]
print(opt_lambda)    # 0.010000000000000007

# Setup lasso model using optimal lambda found earlier, all other relevant parameters kept the same
glm = GLM(distr="binomial", tol=1e-7, reg_lambda=opt_lambda,
            score_metric="pseudo_R2",
            alpha=1.0, learning_rate=1., max_iter=200, verbose=True)

# Fit the model
glm.fit(Xtrain, ytrain)

# Compare beta coefficients
print(gl_glm.beta_ - glm.beta_)

plt.plot(gl_glm.beta_)
plt.plot(glm.beta_, 'r')
plt.show()

I see this:
Figure_1

This is specially the case for the second model which is not helped by warm start.

@whoisnnamdi
Copy link
Author

Thanks, I noticed this too - that changing the the tol and max_iter parameters could get the coefficient values closer.

That said, there are a few instances in your example where GLMCV assigns a non-zero value while GLM does not, or vice versa, which matters if one is using this for variable selection. Is this effectively unavoidable?

@jasmainak
Copy link
Member

Well, you need to push the convergence even further. You'll see that there is this warning currently:

/Users/mainak/Documents/github_repos/pyglmnet/pyglmnet/pyglmnet.py:900: UserWarning: Reached max number of iterations without convergence.
  "Reached max number of iterations without convergence.")

Use max_iter=500 and you see there is almost no difference between the two.

Figure_1

I agree it's a bit hard to debug this. Wouldn't be opposed to adding a method plot_convergence to the GLM object if that helps you.

@whoisnnamdi
Copy link
Author

Thanks for the help. Yes a plot_convergence method would be super helpful if it's not too difficult to implement!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants