Regularization in R – Actuarial Outpost

Tagged: glmnet, R, regularization

This topic has 6 replies, 5 voices, and was last updated 3 years, 4 months ago by exekias.

Viewing 7 posts - 1 through 7 (of 7 total)

Author

Posts
November 28, 2020 at 8:03 pm #1193
Mel Phant
Participant
I’m practicing with the June 13, 2019 exam and like to play around with different tasks to gain a better understanding of what R is actually doing and how I can use it. I’m up to the regularization task (#9) and am now realizing, though I learned LASSO and Ridge Regression for SRM, I don’t fully understand what R is doing with glmnet. The coefficients/betas produced- are they for an OLS model (with the penalty factored in)? The other model I’ve created to predict for this project is a GLM using Gamma.

If I decided I wanted to use what R spit out using LASSO versus what I got earlier using stepwise selection, would that mean I am going with a regular linear model (Gaussian and Identity)? Can I use glmnet on a Gamma or other GLM?
November 29, 2020 at 3:03 pm #1198
SamCastillo
Participant
testpost
December 7, 2020 at 1:48 pm #1275
GhostOfMFE
Participant
testpost

it worked.
December 9, 2020 at 3:45 pm #1324
Michael Barr
Participant
Hi – I can go into more detail if you like, but here is a high level description of a workflow and how to implement it with glmnet (and cv.glmnet).

As you know, Lasso/Ridge/Elastic Net perform a penalized regression – the penalties are L1 norm (aka “Manhattan Distance”), L2 Norm (Euclidean distane), and a linear weighting of the two, respectively. The mix between these is controlled by a hyperparameter which if I recall is denoted /alpha, and the overall magnitude or amount of penalty is controlled by the hyperparameter /lambda.

How do you decide which penalty or what mixture of L1/L2 to use (/alpha)? Sometimes that can be decided a priori (i.e. you are primarily interested in feature selection, so you choose L1 penalty to induce drop-out). If I recall, that is the default with /lambda=1 i.e. Lasso. /lambda=0 would be Ridge. Double check the help file on that.

How do you decide the optimal amount of penalty (/lambda)? Again, sometimes you may know a priori what you want – like you may want no penalty (/lambda = 0, a vanilla MLE regression), or you want penalty to dominate (i.e. /lambda = something arbitrarily large, so all coefficients are shrunk to approximately 0 and you return the null model). But you can see these are special or edge cases and you wouldn’t be using glmnet if you wanted them. So you always need to tune /lambda.

glmnet proposes k-fold cross-validation for hyperparameter tuning and it is impemented with cv.glmnet(). So you want to start with this method – it will give you a grid of values for /lambda chosen heuristically or you can provide your own grid of test values, but since there is no real interpretation or significance to the value of /lambda above 0 (it depends on the number and scale of your coefficients), you pretty much want to start with letting the algo choose a grid for you.

So you run cv.glmnet() to cross-validate /lambda (and potentially /alpha) and receive back a corresponding cross-validated estimated measure of predictive accuracy (I forget defaults, but I believe it is negative log-likelihood). You can plot the metric by calling the method’s plot function [simply plot(cv.object)]. You can also access a couple values of /lambda which are directly stored and which correspond to the minimum prediction error and the 1se prediction error, meaning a slightly more regularized result which is estimated at 1 s.e. of the minimum. That is often chosen since we have a preference for parsimony in models.

Now with your selected value of lambda, you get your “final” model fit to all the data by using the glmnet() method directly. If I recall correctly, this isn’t totally necessary since each proposed model is fit to all data in addition to the folds and stored in cv.glmnet() but I may be mistaken. In any case, glmnet() will return a smaller object with just the one set of coefficients and fitted results.

HTH
December 9, 2020 at 5:46 pm #1329
Michael Barr
Participant
PS – you always want to standardize your covariates before using these methods. I recall there is an option to specify whether or not you have done so manually in your data.frame prior to passing to glmnet(), and if you say you have not then it will do so for you, but I don’t recall what it is defaulted to. I usually do this myself manually using the caret package for this and other preprocessing work.
December 9, 2020 at 9:40 pm #1332
Michael Barr
Participant
PPS – the glmnet package allows you to fit a variety of distributions, similar to the glm() r function. both methods are likelihood-based and do not perform OLS (the two are only equivalent in the gaussian + identity case you identified). In any case, you will not receive back the MLE estimates from glmnet except for the special case of \lambda==0 (no weight on penalty). In all cases, the loss function being used in the estimation is negative log-likelihood + lambda * penalty (either ||L1||, ||L2||, or a*||L1|| + (1-a)*||L2||).

A vanilla GLM would simply maximize negative log-likelihood. But because the penalties are measures of the magnitudes of coefficients, including that term in the loss function means that you reduce the magnitudes of the estimates, moving away from the MLE estimate.
January 6, 2021 at 6:16 am #1799
exekias
Participant
How does using regularization affect inference? If your parameters aren’t MLE then the score (gradient of loglikelihood) isn’t zero. So, it seems like a lot of the assumptions behind the calculation of standard errors and p-values are thrown off. Does anyone have a good reference?
Author

Posts