GLMCV results do not match GLM with same parameters and optimal lambda #377

whoisnnamdi · 2020-03-25T08:07:15Z

Problem
Optimal penalization parameter (lambda) found via GLMCV does not yield similar results when plugged into GLM with otherwise similar parameters.

Example
Script below mostly follow the group lasso example code in the docs except modified for regular lasso instead of group.

from pyglmnet import GLMCV
from pyglmnet import GLM
from pyglmnet.datasets import fetch_group_lasso_datasets
from sklearn.model_selection import train_test_split

df, group_idxs = fetch_group_lasso_datasets()

X = df[df.columns.difference(["Label"])].values
y = df.loc[:, "Label"].values

Xtrain, Xtest, ytrain, ytest = \
    train_test_split(X, y, test_size=0.2, random_state=42)

# Setup lasso cv model
gl_glm = GLMCV(distr="binomial", tol=1e-3,
               score_metric="pseudo_R2",
               alpha=1.0, learning_rate=1, max_iter=100, cv=3, verbose=True)

# Fit the model
gl_glm.fit(Xtrain, ytrain)

# Save the optimal lambda based on highest score
opt_lambda = gl_glm.reg_lambda[gl_glm.scores_.index(max(gl_glm.scores_))]
print(opt_lambda)    # 0.010000000000000007

# Setup lasso model using optimal lambda found earlier, all other relevant parameters kept the same
glm = GLM(distr="binomial", tol=1e-3, reg_lambda=opt_lambda,
            score_metric="pseudo_R2",
            alpha=1.0, learning_rate=1, max_iter=100, verbose=True)

# Fit the model
glm.fit(Xtrain, ytrain)

# Compare beta coefficients
print(gl_glm.beta_ - glm.beta_)

Results
You can tweak the learning rate and iterations of the second model, but the results will never match those of GLMCV even with many iterations and a low learning rate.

I understand there may be some inherent instability given the way that convergence is reached, but this feels like too much. Especially since an important purpose of lasso is to do feature selection, if certain variables have non-zero coefficients in one model but not the other, this somewhat defeats that purpose.

Thank you for looking into this.

jasmainak · 2020-03-26T02:18:25Z

The differences are really tiny if you reduce tol and increase max_iter. See below:

from pyglmnet import GLMCV
from pyglmnet import GLM
from pyglmnet.datasets import fetch_group_lasso_datasets
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

df, group_idxs = fetch_group_lasso_datasets()

X = df[df.columns.difference(["Label"])].values
y = df.loc[:, "Label"].values

Xtrain, Xtest, ytrain, ytest = \
    train_test_split(X, y, test_size=0.2, random_state=42)

# Setup lasso cv model
gl_glm = GLMCV(distr="binomial", tol=1e-7,
               score_metric="pseudo_R2",
               alpha=1.0, learning_rate=1, max_iter=200, cv=3, verbose=True)

# Fit the model
gl_glm.fit(Xtrain, ytrain)

# Save the optimal lambda based on highest score
opt_lambda = gl_glm.reg_lambda[gl_glm.scores_.index(max(gl_glm.scores_))]
print(opt_lambda)    # 0.010000000000000007

# Setup lasso model using optimal lambda found earlier, all other relevant parameters kept the same
glm = GLM(distr="binomial", tol=1e-7, reg_lambda=opt_lambda,
            score_metric="pseudo_R2",
            alpha=1.0, learning_rate=1., max_iter=200, verbose=True)

# Fit the model
glm.fit(Xtrain, ytrain)

# Compare beta coefficients
print(gl_glm.beta_ - glm.beta_)

plt.plot(gl_glm.beta_)
plt.plot(glm.beta_, 'r')
plt.show()

I see this:

This is specially the case for the second model which is not helped by warm start.

whoisnnamdi · 2020-03-26T08:02:48Z

Thanks, I noticed this too - that changing the the tol and max_iter parameters could get the coefficient values closer.

That said, there are a few instances in your example where GLMCV assigns a non-zero value while GLM does not, or vice versa, which matters if one is using this for variable selection. Is this effectively unavoidable?

jasmainak · 2020-03-26T11:46:51Z

Well, you need to push the convergence even further. You'll see that there is this warning currently:

/Users/mainak/Documents/github_repos/pyglmnet/pyglmnet/pyglmnet.py:900: UserWarning: Reached max number of iterations without convergence.
  "Reached max number of iterations without convergence.")

Use max_iter=500 and you see there is almost no difference between the two.

I agree it's a bit hard to debug this. Wouldn't be opposed to adding a method plot_convergence to the GLM object if that helps you.

whoisnnamdi · 2020-03-26T19:18:15Z

Thanks for the help. Yes a plot_convergence method would be super helpful if it's not too difficult to implement!

jasmainak mentioned this issue Apr 1, 2020

ENH: add plot for convergence #378

Merged

titipata closed this as completed in #378 May 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GLMCV results do not match GLM with same parameters and optimal lambda #377

GLMCV results do not match GLM with same parameters and optimal lambda #377

whoisnnamdi commented Mar 25, 2020

jasmainak commented Mar 26, 2020

whoisnnamdi commented Mar 26, 2020

jasmainak commented Mar 26, 2020

whoisnnamdi commented Mar 26, 2020

GLMCV results do not match GLM with same parameters and optimal lambda #377

GLMCV results do not match GLM with same parameters and optimal lambda #377

Comments

whoisnnamdi commented Mar 25, 2020

jasmainak commented Mar 26, 2020

whoisnnamdi commented Mar 26, 2020

jasmainak commented Mar 26, 2020

whoisnnamdi commented Mar 26, 2020