Actuarial Outpost
 
Go Back   Actuarial Outpost > Exams - Please Limit Discussion to Exam-Related Topics > SoA/CAS Preliminary Exams > Exam PA: Predictive Analytics
FlashChat Actuarial Discussion Preliminary Exams CAS/SOA Exams Cyberchat Around the World Suggestions


Reply
 
Thread Tools Search this Thread Display Modes
  #61  
Old 04-15-2020, 06:24 PM
Sader Sader is offline
Member
SOA
 
Join Date: Feb 2019
Posts: 36
Default June 19 Exam Task 3 PCA

In task 2 of the June 19 Exam, I combined various factor levels of the data set (including levels of Rd_Conditions, Light, and Weather used in Task 3). However when doing Task 3, the PCA categories don't use the new levels I created, it uses all the old levels. At the end of task 2 I assigned dat <- dat2 so that my new levels will be used going forward. Is anyone else running into this issue? Is there a reason for this/is this supposed to happen? I thought the old levels were renamed so I didn't think this would be possible.
Reply With Quote
  #62  
Old 04-18-2020, 11:03 AM
Sader Sader is offline
Member
SOA
 
Join Date: Feb 2019
Posts: 36
Default Backward v Forward Selection stepAIC()

On page 206 of the manual, you comment on when to use forward vs backward selection. You say "With the selection criterion held fixed, forward selection is more likely to produce a final model with fewer features and better aligns with the goal our identifying key factors."

So forward selection is like BIC in that it might select fewer predictors when selecting features? I am trying to better understand how they differ so I can better justify why I select forward vs backward selection on the exam. I feel like justifying "why" you select forward vs backward is hard because they are doing the same thing but in opposite directions: one starts with no features and adds while one starts with all then takes away. I feel like the only way to justify which is better is by doing both then comparing the models they create.. or is there another way to justify your choice?

I ask this because on June 19 PA exam, task 6 is says to decide on forward or backward selection based on the business problem, not based on which one makes a better fitting model.
Reply With Quote
  #63  
Old 04-18-2020, 06:43 PM
mnm4156 mnm4156 is offline
Member
SOA
 
Join Date: Oct 2015
Posts: 77
Default

question on determining the AUC for random forest models - in chunk 15 of section 5.3, the predict() function is used with type="prob" and fed into the roc function as follows:

pred.rf.prob <- predict(rf, newdata = test, type = "prob")[, 2]
roc(test$class, pred.rf.prob, auc = TRUE)

The PA learning modules have a slightly different approach for determining the AUC for a random forest and I just want to understand why these methods are not equivalent (I am getting different AUCs) and which one is more appropriate to use. The R code using the approach from the PA learning modules would be:

pred.rf.prob.modules<-predict(rf, newdata=test)
roc(as.numeric(test$class), as.numeric(pred.rf.prob.modules), auc = TRUE)
Reply With Quote
  #64  
Old 04-19-2020, 04:29 AM
ambroselo ambroselo is offline
Member
SOA
 
Join Date: Sep 2018
Location: Iowa City
College: University of Iowa
Posts: 308
Default

Quote:
Originally Posted by Sader View Post
So forward selection is like BIC in that it might select fewer predictors when selecting features?
To a certain extent, yes.
Reply With Quote
  #65  
Old 04-19-2020, 06:27 AM
ambroselo ambroselo is offline
Member
SOA
 
Join Date: Sep 2018
Location: Iowa City
College: University of Iowa
Posts: 308
Default

Quote:
Originally Posted by mnm4156 View Post
question on determining the AUC for random forest models - in chunk 15 of section 5.3, the predict() function is used with type="prob" and fed into the roc function as follows:

pred.rf.prob <- predict(rf, newdata = test, type = "prob")[, 2]
roc(test$class, pred.rf.prob, auc = TRUE)

The PA learning modules have a slightly different approach for determining the AUC for a random forest and I just want to understand why these methods are not equivalent (I am getting different AUCs) and which one is more appropriate to use. The R code using the approach from the PA learning modules would be:

pred.rf.prob.modules<-predict(rf, newdata=test)
roc(as.numeric(test$class), as.numeric(pred.rf.prob.modules), auc = TRUE)
Are you referring to CHUNK 39 of PA Module 7? Please read footnote xiii on page 334 of the manual. The code in the module is producing class predictions, but we need probability predictions.
Reply With Quote
  #66  
Old 04-19-2020, 09:45 AM
Sader Sader is offline
Member
SOA
 
Join Date: Feb 2019
Posts: 36
Default interpreting log(x)

When working through June 19 exam, I picked the Gaussian model using log(Crash_Score) as Crash_Score_log below with the identity link. When running my final model to the data I got the following output

glm(formula = Crash_Score_log ~ Rd_Class + Rd_Feature + Light,
family = gaussian(link = "identity"), data = datlog)

Deviance Residuals:
Min 1Q Median 3Q Max
-6.3105 -0.4034 0.0642 0.4728 2.3232

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.705350 0.008111 210.244 < 2e-16 ***
Rd_ClassOTHER -0.098125 0.010158 -9.660 < 2e-16 ***
Rd_FeatureDRIVEWAY 0.041152 0.016146 2.549 0.0108 *
Rd_FeatureINTERCECTION 0.091280 0.010658 8.565 < 2e-16 ***
Rd_FeatureRAMP_O -0.048800 0.022520 -2.167 0.0303 *
LightDARK-LIT -0.086658 0.013172 -6.579 4.84e-11 ***
LightDARK-NOT-LIT -0.139265 0.026550 -5.245 1.57e-07 ***
LightDAWN -0.056061 0.058213 -0.963 0.3355
LightDUSK -0.018240 0.028584 -0.638 0.5234
LightOTHER -0.215928 0.051388 -4.202 2.66e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


My base Rd_Class is HWY. So, when interpreting the -0.098125, which is correct?

1. The Crash_Score_log is expected to be 0.098 lower when the accident doesn't occur on a highway.

2. Crash score is expected to be exp(-.098) times lower when the accident doesn't occur on a highway.

Last edited by Sader; 04-19-2020 at 10:56 AM..
Reply With Quote
  #67  
Old 04-19-2020, 09:48 AM
mnm4156 mnm4156 is offline
Member
SOA
 
Join Date: Oct 2015
Posts: 77
Default

Quote:
Originally Posted by ambroselo View Post
Are you referring to CHUNK 39 of PA Module 7? Please read footnote xiii on page 334 of the manual. The code in the module is producing class predictions, but we need probability predictions.
Yes, that chunk and also chunks 11, 12 & 18 of PA Module RmD 7.3. pretty much every time the auc is calculated.
Reply With Quote
  #68  
Old 04-19-2020, 10:17 AM
ambroselo ambroselo is offline
Member
SOA
 
Join Date: Sep 2018
Location: Iowa City
College: University of Iowa
Posts: 308
Default

Quote:
Originally Posted by mnm4156 View Post
Yes, that chunk and also chunks 11, 12 & 18 of PA Module RmD 7.3. pretty much every time the auc is calculated.
If I am not mistaken, they all suffer from the same problem, i.e., feeding class predictions (converted into numbers) instead of probability predictions into the roc() function. Fortunately, the code for calculating the AUC of classification trees in the Dec 2019 PA exam is correct.
Reply With Quote
  #69  
Old 04-21-2020, 12:09 PM
mnm4156 mnm4156 is offline
Member
SOA
 
Join Date: Oct 2015
Posts: 77
Default

Quote:
Originally Posted by ambroselo View Post
"The default behavior of tree" (tree is another package for fitting decision trees; see Chapter 8 of ISLR) is to use the class with the highest predicted probability (the so-called "majority class") as the predicted class. You can install the tree package and type the command ?predict.tree. In the case of a binary target variable, the predicted class is the one whose predicted probability is higher than 50%.
hi so just to confirm - there is no way to change the threshold/cutoff for determining class assignments for decision trees like we did for the logistic regression model (e.g. chunk 12 of section 4.3). This is because trees do predict class assignments (based on majority vote or 0.5 threshold), while GLMs do not - they only produce a prediction of the probability an event of interest will occur. therefore for GLM only we need to determine the threshold we want. Is this correct? Thanks so much.


Also, would appreciate your advice on my study plan for the next two months. I’ve gone through ACTEX twice (it was great) and then skimmed through the modules, took some notes and made index cards of some key concepts/things to remember. I now plan to practice with the sample projects and past exams. Is there any order I should attempt these as in terms of level of difficulty? Also, I plan to take the practice exams under exam conditions (limiting myself to 5 hrs and writing the whole assignment out). Do you suggest I do the same with the sample projects or just take my time and digest the material? I’ve outlined below the resources I understand to be available
• Student success sample project
• Hospital Readmission sample project
• Past exams - December 2018, June 2019 & December 2019 (do you recommend doing both days of each of these? I probably will)
• I also understand you’re releasing two practice exams within the next few weeks?

Last edited by mnm4156; 04-21-2020 at 07:14 PM..
Reply With Quote
  #70  
Old 04-21-2020, 08:33 PM
ambroselo ambroselo is offline
Member
SOA
 
Join Date: Sep 2018
Location: Iowa City
College: University of Iowa
Posts: 308
Default

Quote:
Originally Posted by Sader View Post
When working through June 19 exam, I picked the Gaussian model using log(Crash_Score) as Crash_Score_log below with the identity link. When running my final model to the data I got the following output

glm(formula = Crash_Score_log ~ Rd_Class + Rd_Feature + Light,
family = gaussian(link = "identity"), data = datlog)

Deviance Residuals:
Min 1Q Median 3Q Max
-6.3105 -0.4034 0.0642 0.4728 2.3232

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.705350 0.008111 210.244 < 2e-16 ***
Rd_ClassOTHER -0.098125 0.010158 -9.660 < 2e-16 ***
Rd_FeatureDRIVEWAY 0.041152 0.016146 2.549 0.0108 *
Rd_FeatureINTERCECTION 0.091280 0.010658 8.565 < 2e-16 ***
Rd_FeatureRAMP_O -0.048800 0.022520 -2.167 0.0303 *
LightDARK-LIT -0.086658 0.013172 -6.579 4.84e-11 ***
LightDARK-NOT-LIT -0.139265 0.026550 -5.245 1.57e-07 ***
LightDAWN -0.056061 0.058213 -0.963 0.3355
LightDUSK -0.018240 0.028584 -0.638 0.5234
LightOTHER -0.215928 0.051388 -4.202 2.66e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


My base Rd_Class is HWY. So, when interpreting the -0.098125, which is correct?

1. The Crash_Score_log is expected to be 0.098 lower when the accident doesn't occur on a highway.

2. Crash score is expected to be exp(-.098) times lower when the accident doesn't occur on a highway.
Both statements are fine, but the second statement is only approximately correct, strictly speaking. The model equation is for E[log(Crash_Score)] and you want to remove the log by exponentiation. However, exp(E[log(Crash_Score)]) is generally not the same as E[exp(log(Crash_Score))] = E[Crash_Score].
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


All times are GMT -4. The time now is 04:17 AM.


Powered by vBulletin®
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
*PLEASE NOTE: Posts are not checked for accuracy, and do not
represent the views of the Actuarial Outpost or its sponsors.
Page generated in 0.69772 seconds with 10 queries