Actuarial Outpost
 
Go Back   Actuarial Outpost > Exams - Please Limit Discussion to Exam-Related Topics > SoA/CAS Preliminary Exams > Exam PA: Predictive Analytics
FlashChat Actuarial Discussion Preliminary Exams CAS/SOA Exams Cyberchat Around the World Suggestions


Upload your resume securely at https://www.dwsimpson.com
to be contacted when our jobs meet your skills and objectives.


Reply
 
Thread Tools Search this Thread Display Modes
  #241  
Old 05-26-2019, 05:37 AM
Inactuary Inactuary is offline
SOA
 
Join Date: May 2019
Posts: 2
Default

Quote:
Originally Posted by DjPim View Post
There's a thread for this project that might prove useful. Most of these are answered there, but I'll try to summarize:

1. Base level refers to the level of the factor with the most observations. To binarize 'Race', we create 4 indicator variables, one for each race. The problem is that these 4 variables are perfectly correlated; the sum of them always = 1. To 'trick' our model and get around this issue of multicolinearity, we remove the base level. To us, we recognize the meaning of it as 'if other 3 indicators are 0, then the observation is the base level'.

2. When doing variable selection with backward stepAIC, it will calculate the AIC of the model with all variables, then calculate the AIC of the model with all variables -1 (repeated for each different variable), then decide if it should remove a variable and which one. If Race was one variable with 4 levels, this process would only ask 'is it significant to include the Race variable with all the races?' whereas if we split it up, it can ask instead 'is it significant to include a distinction for White race specifically, or are we fine just saying Black, Hispanic, and Other?' (as example of being able to remove just 1 level of the variable and not the whole thing).

3. The factor variables were split into binary indicators, so if we have RaceWhite, RaceBlack, RaceHispanic, RaceOther, then we don't also need the original Race. However, Gender is already binary. It doesn't matter if we call the levels M/F or 0/1. Therefore, there's nothing to binarize/add/remove, it's fine as-is.

4. I don't quite remember the part you're referring to, I'd have to look at it again, but maybe the answers to 1-3 help with this question?
Thank you! It makes a lot of sense!
Reply With Quote
  #242  
Old 05-26-2019, 01:07 PM
ActuariallyDecentAtBest ActuariallyDecentAtBest is offline
Member
SOA
 
Join Date: Dec 2016
Posts: 383
Default

Haven't gotten started on this sample project yet, still haven't even finished the freaking material.

How hard is the sample exam? Is it very doable?? Seems like we have to memorize more code...
Reply With Quote
  #243  
Old 05-27-2019, 01:51 AM
noone noone is offline
Member
SOA
 
Join Date: Feb 2017
Posts: 138
Default Partial Dependence Plots

Can someone please provide a laymans description/interpretation of Partial Dependence Plots and what they are good for and how to interpret? Thanks!
Reply With Quote
  #244  
Old 05-27-2019, 10:00 AM
LyActuary LyActuary is online now
Member
SOA
 
Join Date: Sep 2017
Location: Rochester, NY
College: University of Rochester
Posts: 102
Default

Quote:
Originally Posted by jdman929 View Post
Has anyone else gotten stuck on the Student Success Practice Exam Decision Tree portion? I'm trying to run the code provided and I get errors. The code in question is:

library(rpart)
library(rpart.plot)
set.seed(123)
excluded_variables <- c("G3") # List excluded variables

dt <- rpart(G3.Pass.Flag ~ .,
data = Train.DS[, !(names(Full.DS) %in% excluded_variables)],
control = rpart.control(minbucket = 5, cp = .001, maxdepth = 20),
parms = list(split = "gini"))

rpart.plot(dt)

Error in `[.data.frame`(Train.DS, , !(names(Full.DS) %in% excluded_variables)) : undefined columns selected


Does anyone know what's going on?
I think you need to change the !(names(Full.DS) %in% excluded_variables) to !(names(Train.DS) %in% excluded_variables)
Reply With Quote
  #245  
Old 05-27-2019, 10:53 AM
Josh Peck Josh Peck is offline
Member
SOA
 
Join Date: Dec 2016
College: Towson University
Posts: 99
Default

Quote:
Originally Posted by noone View Post
Can someone please provide a laymans description/interpretation of Partial Dependence Plots and what they are good for and how to interpret? Thanks!
It would be like if you took the variable in question and
broke it up into a bunch of buckets
Then predicted the target variable for each of those buckets

Thus, it shows how the target variable is predicted on average for all the different values of that predictor.

Because these plots take a lot of time to run, I think they would crash the prometric computer and will not be tested on.
However, you should still understand the idea so you can explain it if they ask how you could benefit from running it if you had more time.
__________________
P FM MFE C PA

Last edited by Josh Peck; 05-27-2019 at 11:00 AM..
Reply With Quote
  #246  
Old 05-27-2019, 04:58 PM
noone noone is offline
Member
SOA
 
Join Date: Feb 2017
Posts: 138
Default

Reference
Prediction Bad Good
Bad 79 35
Good 221 665

The above is a confusion matrix for rmd 7.3 chunk 21.

According to the results, sensitivity is .2633 and specificity is .95. Sensitivity is the proportion of true positive predictions among all positive cases. So that would be 665/(665+35) =.95. And specificity is TN/(TN+FP) = 79/(79+221) =.2633. It looks like the code switched them up but i don't see how that could happen. Any thoughts here?
Reply With Quote
  #247  
Old 05-28-2019, 10:06 AM
rstein rstein is offline
SOA
 
Join Date: Jan 2019
Posts: 12
Default

In the exam solution it says that GLMs "cannot capture non-linear relationships" which is confusing me with the fact that they do model non-normal distributions. I know I must be mixing up two things but can someone explain the difference to me, or explain what it means to "not capture non-linear relationships".

Thanks!
Reply With Quote
  #248  
Old 05-28-2019, 10:55 AM
TranceBrah's Avatar
TranceBrah TranceBrah is offline
Member
SOA
 
Join Date: Mar 2014
Location: Best Coast
Posts: 238
Default

Quote:
Originally Posted by rstein View Post
In the exam solution it says that GLMs "cannot capture non-linear relationships" which is confusing me with the fact that they do model non-normal distributions. I know I must be mixing up two things but can someone explain the difference to me, or explain what it means to "not capture non-linear relationships".

Thanks!
Normal linear regression:model the expected value of a continuous variable, Y, as a linear function of the continuous predictor, X, E(Yi) = β0 + β1xi

GLM does NOT assume a linear relationship between the dependent variable and the independent variables, but it does assume linear relationship between the transformed response in terms of the link function and the explanatory variables; e.g., for binary logistic regression logit(p/1-p) = β0 + βX.
Reply With Quote
  #249  
Old 05-28-2019, 11:17 AM
Josh Peck Josh Peck is offline
Member
SOA
 
Join Date: Dec 2016
College: Towson University
Posts: 99
Default

Quote:
Originally Posted by noone View Post
Reference
Prediction Bad Good
Bad 79 35
Good 221 665

The above is a confusion matrix for rmd 7.3 chunk 21.

According to the results, sensitivity is .2633 and specificity is .95. Sensitivity is the proportion of true positive predictions among all positive cases. So that would be 665/(665+35) =.95. And specificity is TN/(TN+FP) = 79/(79+221) =.2633. It looks like the code switched them up but i don't see how that could happen. Any thoughts here?
Bad is mapped to 1 (TRUE)

You can see this if you run the following chunk
```{r}
str(credit$Credit)
str(as.factor(credit$Credit))
```

It makes no difference to use sensitivity vs specificity as long as you know how to interpret it, which it seems that you do. It simply depends on which factor level is labeled as TRUE.
__________________
P FM MFE C PA
Reply With Quote
  #250  
Old 05-29-2019, 02:15 AM
noone noone is offline
Member
SOA
 
Join Date: Feb 2017
Posts: 138
Default

Quote:
Originally Posted by Josh Peck View Post
It would be like if you took the variable in question and
broke it up into a bunch of buckets
Then predicted the target variable for each of those buckets

Thus, it shows how the target variable is predicted on average for all the different values of that predictor.

Because these plots take a lot of time to run, I think they would crash the prometric computer and will not be tested on.
However, you should still understand the idea so you can explain it if they ask how you could benefit from running it if you had more time.
Thanks. So the examples in module 7 use issage (predictor) for the x-axis and yhat on the y-axis. What does y-hat mean? The target variable is claim count and is either C (actual_cnt=0) or N(actual_cnt>=1).
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


All times are GMT -4. The time now is 11:29 AM.


Powered by vBulletin®
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
*PLEASE NOTE: Posts are not checked for accuracy, and do not
represent the views of the Actuarial Outpost or its sponsors.
Page generated in 0.23156 seconds with 10 queries