Forum Replies Created

AuthorPosts

Michael BarrParticipant
PPS – the glmnet package allows you to fit a variety of distributions, similar to the glm() r function. both methods are likelihoodbased and do not perform OLS (the two are only equivalent in the gaussian + identity case you identified). In any case, you will not receive back the MLE estimates from glmnet except for the special case of \lambda==0 (no weight on penalty). In all cases, the loss function being used in the estimation is negative loglikelihood + lambda * penalty (either L1, L2, or a*L1 + (1a)*L2).
A vanilla GLM would simply maximize negative loglikelihood. But because the penalties are measures of the magnitudes of coefficients, including that term in the loss function means that you reduce the magnitudes of the estimates, moving away from the MLE estimate.
Michael BarrParticipantPS – you always want to standardize your covariates before using these methods. I recall there is an option to specify whether or not you have done so manually in your data.frame prior to passing to glmnet(), and if you say you have not then it will do so for you, but I don’t recall what it is defaulted to. I usually do this myself manually using the caret package for this and other preprocessing work.
Michael BarrParticipantHi – I can go into more detail if you like, but here is a high level description of a workflow and how to implement it with glmnet (and cv.glmnet).
As you know, Lasso/Ridge/Elastic Net perform a penalized regression – the penalties are L1 norm (aka “Manhattan Distance”), L2 Norm (Euclidean distane), and a linear weighting of the two, respectively. The mix between these is controlled by a hyperparameter which if I recall is denoted /alpha, and the overall magnitude or amount of penalty is controlled by the hyperparameter /lambda.
How do you decide which penalty or what mixture of L1/L2 to use (/alpha)? Sometimes that can be decided a priori (i.e. you are primarily interested in feature selection, so you choose L1 penalty to induce dropout). If I recall, that is the default with /lambda=1 i.e. Lasso. /lambda=0 would be Ridge. Double check the help file on that.
How do you decide the optimal amount of penalty (/lambda)? Again, sometimes you may know a priori what you want – like you may want no penalty (/lambda = 0, a vanilla MLE regression), or you want penalty to dominate (i.e. /lambda = something arbitrarily large, so all coefficients are shrunk to approximately 0 and you return the null model). But you can see these are special or edge cases and you wouldn’t be using glmnet if you wanted them. So you always need to tune /lambda.
glmnet proposes kfold crossvalidation for hyperparameter tuning and it is impemented with cv.glmnet(). So you want to start with this method – it will give you a grid of values for /lambda chosen heuristically or you can provide your own grid of test values, but since there is no real interpretation or significance to the value of /lambda above 0 (it depends on the number and scale of your coefficients), you pretty much want to start with letting the algo choose a grid for you.
So you run cv.glmnet() to crossvalidate /lambda (and potentially /alpha) and receive back a corresponding crossvalidated estimated measure of predictive accuracy (I forget defaults, but I believe it is negative loglikelihood). You can plot the metric by calling the method’s plot function [simply plot(cv.object)]. You can also access a couple values of /lambda which are directly stored and which correspond to the minimum prediction error and the 1se prediction error, meaning a slightly more regularized result which is estimated at 1 s.e. of the minimum. That is often chosen since we have a preference for parsimony in models.
Now with your selected value of lambda, you get your “final” model fit to all the data by using the glmnet() method directly. If I recall correctly, this isn’t totally necessary since each proposed model is fit to all data in addition to the folds and stored in cv.glmnet() but I may be mistaken. In any case, glmnet() will return a smaller object with just the one set of coefficients and fitted results.
HTH

AuthorPosts