
#231




Glm vs glmnet vs cv.glmnet
Can someone confirm that I have understood glm vs glmnet vs cv.glmnet correctly?
GLM: for when you want to model a simple glm(i.e. does not follow normal distribution,etc) using any of the family distributions and their respective link functions. All variables will be included in the model Uses the standard y ~ x formula syntax GLMNET: for when you want to employ regularization to your variables to shrink or remove some of them (depending on if it's lasso or ridge...) You need to specify alpha and lambda (or for the latter, allow glmnet to generate default lambda values for you). Need to set up a model.matrix etc for the formula CV.GLMNET: using crossvalidation, automatically finds optimal value of lambda I'm just unsure which to use first when applying regularization: glmnet or cv.glment? Or does it not matter? From rmd 6.7 it seems like cv.glmnet was run first to find the lambda.min which was then used in the glmnet model but still unsure about this. Also, can any family be used with glmnet? I know the default is gaussian and binomial can also be used, but can it be used with poisson, etc? Any insight or additions to what I wrote above would be appreciated! Last edited by rstein; 05232019 at 10:39 AM.. Reason: Another question 
#232




Quote:
Use Elastic Net Reqularization ```{r} library(glmnet) set.seed(1234) f < as.formula("RESPONSE ~ PREDICTOR + PREDICTOR + ...") x.train < model.matrix(f, DATASET.train) m < cv.glmnet(x = x.train, y = DATASET.train$RESPONSE, family = "FAMILY", alpha = ALPHA) m.best < glmnet(x = X.train, y = Train.DS$G3.Pass.Flag, family = "FAMILY", lambda = m$lambda.min, alpha = ALPHA) m.best$beta ``` If you want to see which families are included simply run the command ?glmnet
__________________
Last edited by Josh Peck; 05232019 at 11:44 AM.. 
#233




Quote:
I stole this from the FAP modules. It goes into the pros and cons of various accuracy metrics. 
#234




Not smarter than yourself
Quote:
https://www.hackingnote.com/en/machi...prosandcons https://rmartinshort.jimdo.com/2019/...mlalgorithms/ 
#235




Quote:
1. Use stepwiseselection based on AIC for feature selection. Achieve this by running stepAIC(glm, direction = "backwards") 2. Use LASSO regularization for feature selection. Do this using Josh's code above (run cv.glmnet > obtain optimal lambda > run glmnet with this lambda) Which model performed best? If the first model is still the most accurate, are the others close enough to sacrifice some model accuracy for model interpretability? Use your best judgement and whatever you decide just make sure you justify it. 
#236




Quote:
Decision Trees Pros can handle linear and nonlinear relationships robust to correlated features (no need for PCA), feature distributions (no need for centering or scaling), and missing values simple to understand easy to run Cons poor accuracy prone to overfitting Random Forests Pros almost all the pros of decision trees much more accurate less prone to overfitting (the averaging of many models reduces variance) Cons hard to interpret longer to run slight increase in bias Gradient Boosting Machines Pros almost all the pros of decision trees MUCH more accurate reduces bias Cons hard to interpret longer to run prone to overfitting sensitive to parameter settings ("hunts" for noise if not tuned properly) 
#237




Hi Guys,
I was looking at the hospital readmission sample project, and came across to task 7 where AIC had to be performed. However, prior to doing AIC, there were several steps being done regarding leveling and binarization for the factor variables. I have been struggling to understand the concept behind this, my questions are: 1. What is / are base level(s)?  it says the base levels are (racewhite, DRGmed.C, and RaceGenderWhiteF) 2. Why do we have to separate all the factor variables into individual variable (such as Race Hispanic, Race Black, DRG Other.surg, etc) 3. Why does when removing the original factor variables, the Gender variable is retained, while (DRG, race, and racegender are removed) 4. Finally, after doing all the work, the solution decided to again remove all variable associated with Male. I really need help to grasp this concept. Please someone help me! Thank you... Last edited by Inactuary; 05242019 at 04:04 AM.. 
#238




Quote:
1. Base level refers to the level of the factor with the most observations. To binarize 'Race', we create 4 indicator variables, one for each race. The problem is that these 4 variables are perfectly correlated; the sum of them always = 1. To 'trick' our model and get around this issue of multicolinearity, we remove the base level. To us, we recognize the meaning of it as 'if other 3 indicators are 0, then the observation is the base level'. 2. When doing variable selection with backward stepAIC, it will calculate the AIC of the model with all variables, then calculate the AIC of the model with all variables 1 (repeated for each different variable), then decide if it should remove a variable and which one. If Race was one variable with 4 levels, this process would only ask 'is it significant to include the Race variable with all the races?' whereas if we split it up, it can ask instead 'is it significant to include a distinction for White race specifically, or are we fine just saying Black, Hispanic, and Other?' (as example of being able to remove just 1 level of the variable and not the whole thing). 3. The factor variables were split into binary indicators, so if we have RaceWhite, RaceBlack, RaceHispanic, RaceOther, then we don't also need the original Race. However, Gender is already binary. It doesn't matter if we call the levels M/F or 0/1. Therefore, there's nothing to binarize/add/remove, it's fine asis. 4. I don't quite remember the part you're referring to, I'd have to look at it again, but maybe the answers to 13 help with this question? 
#239




Here is great explanation of GLMs and link functions if anyone wants to take a look
https://www.youtube.com/watch?v=Xix97pw0xY
__________________

#240




Quote:
Coincidentally, I also picked up Applied Predictive Modeling by Kuhn and Johnson. It almost seems that text should be on the syllabus, as it really applies to what the exam is testing for. 
Thread Tools  Search this Thread 
Display Modes  

