Actuarial Outpost
 
Go Back   Actuarial Outpost > Exams - Please Limit Discussion to Exam-Related Topics > SoA/CAS Preliminary Exams > Exam PA: Predictive Analytics
FlashChat Actuarial Discussion Preliminary Exams CAS/SOA Exams Cyberchat Around the World Suggestions


Not looking for a job? Tell us about your ideal job,
and we'll only contact you when it opens up.
https://www.dwsimpson.com/register


Reply
 
Thread Tools Search this Thread Display Modes
  #221  
Old 12-04-2019, 12:05 PM
jgrant616's Avatar
jgrant616 jgrant616 is offline
SOA
 
Join Date: Sep 2018
Location: Boston
College: Bryant University
Favorite beer: A nice crisp lager
Posts: 2
Default

Hi Ambrose,
I have a question on Decision trees, page 303-304. This is where you break down how the rel error column in the cptable is calculated. It seems to me that your method changed slightly from the first split to the second split.

For the first split, you calculated the RSS across the two nodes of .548 and then divided it by two (because there are two nodes) to get .2740. Then, you subtracted it from the root node's rel error of 1 to get the cp value of .726.

For the second split (Example 5.2.1), you calculated the RSS across the two nodes of .26. However, in this split, instead of dividing by 2 and then subtracting, you subtracted from the prior RSS of .548 to get .288 and then divided by 2 to get .144.

Am I missing something here? Why did the order of the calculation change?

If I calculate it the way you did in the first split, I'd take .26/2 = .13, and subtract that from the previous rel. error of .2740 which gets me to .144 all the same, but I want to be sure this isn't a coincidence.

Thanks for your help.
__________________
Prelims: VEE | P | FM | MFE | MLC | C | FAP Modules | PA | APC
Reply With Quote
  #222  
Old 12-04-2019, 05:41 PM
bnut34 bnut34 is offline
 
Join Date: Feb 2010
College: University of Wisconsin - Stevens Point
Posts: 17
Default

Hi Ambrose,

In the case study 4.3, a confusion matrix was used on the training data to assess the performance of the full probit model. Then after feature selection, the final model was validated using ROC / AUC on the test data. Ideally, wouldn't validation also include picking a cutoff and assessing that cutoff, as well as the final model, on the test data via a confusion matrix? Was that not included here because we were not asked to specify a cutoff (as opposed to the Hospital Readmissions case)?

Thank you
Reply With Quote
  #223  
Old Today, 02:04 AM
KarimZ's Avatar
KarimZ KarimZ is offline
Member
SOA
 
Join Date: May 2015
Location: Pakistan
College: Graduate, Bsc Accounting and Finance
Posts: 201
Default

Have a silly question. How do you guys write code in R?

Do you type codes in console and then paste them in R markdown file? Or is there a way to save code in console directly as an rmd file?

Or is it a better strategy to directly edit code provided in exam in the markdown file?
__________________
P FM MFE C LTAM PA

VEEs

FAP Interim FAP Final

APC

Reply With Quote
  #224  
Old Today, 11:47 AM
Life Life is offline
Non-Actuary
 
Join Date: Sep 2016
Posts: 27
Default

Quote:
Originally Posted by Yossarian View Post
It's only saying the observations (in this case, claim size), can be negative when you use this model (normal GLM on claim size with log link).

The predictions won't be negative because of the log link, but the Gaussian distribution can admit a negative observation.

But just because a negative observation can be admitted, it seems very unlikely in practice that your data would have a negative observation of claim size.

That said, I think that Gaussian would also be an unlikely distribution choice for claims size modeling for that reason (you wouldn't have negative claim size, so why would one think that would be a good option?). Bringing us back full circle, the reason for the note from the SOA, if you chose Gaussian distribution in the exam, you would have to explain why you chose that, and part of that would be to address this issue.

Note: you did ask "Can someone help explain this to me?" ... so I didn't wait for Ambrose to reply, but when he does respond, if any of this is wrong, I apologize in advance, and I will either edit or delete it so as not to create future confusion.
If I'm understanding this correctly, the fact that negative observation can be admitted would only matter if you were to care about, say, the minimum of the prediction rather than the mean. But when would we ever care about some metric other than the mean/sd of the prediction?
Reply With Quote
  #225  
Old Today, 12:40 PM
windows7forever windows7forever is offline
Member
SOA
 
Join Date: Apr 2016
Posts: 346
Default

Quote:
Originally Posted by KarimZ View Post
Have a silly question. How do you guys write code in R?

Do you type codes in console and then paste them in R markdown file? Or is there a way to save code in console directly as an rmd file?

Or is it a better strategy to directly edit code provided in exam in the markdown file?
You have to write codes in Rmd file directly. There's no other application accessible in the exam computer. Only the given Rmd file, Excel or csv file and Word template. Regular R is not available. You write codes and generate output in RStudio directly.
Reply With Quote
  #226  
Old Today, 12:46 PM
windows7forever windows7forever is offline
Member
SOA
 
Join Date: Apr 2016
Posts: 346
Default

Back to Hospital Readmissions project's comments in your manual's Component 3, what if the goal of the project is to provide key predictors and interpret them like June 2019 exam, then how will you interpret the logit model's coefficients?

Just say a predictor's increase will likely cause increase or decrease the target variable is not sufficient then, since we need to provide magnitude too.

Logit model is easier to interpret than probit model, even if probit model performs slightly better than logit model due to slightly lower model AIC.

Thanks.
Reply With Quote
  #227  
Old Today, 02:35 PM
bnut34 bnut34 is offline
 
Join Date: Feb 2010
College: University of Wisconsin - Stevens Point
Posts: 17
Default

Looking at the Dec 2018 exam, the first page notes that the target variable is number of injuries per 2,000 work hours. My first instinct would've been to just create a new variable that is number of injuries / (number of hours / 2000), and to get rid of the two individual variables that feed into that (number of hours and number of injuries). Then proceed from there (my GLM would've been geared towards a continuous variable, not a Poisson). Is there any issue with that?

Asked another way, I believe that an offset on number of hours is appropriate if the target variable is number of injuries. But it's not. I'm not seeing how an offset on number of hours is appropriate when that is part of the target variable.

Thank you
Reply With Quote
  #228  
Old Today, 03:26 PM
PTActuary PTActuary is offline
SOA
 
Join Date: Nov 2019
Posts: 12
Default

Quote:
Originally Posted by windows7forever View Post
I have not seen a lot examples or practice exercises close to exam style that use undersampling and oversampling methods, but that does not mean the exam will not have these methods in the future.

Thanks.
I was also wondering about the application of over/under sampling methods in R. I understand the rationale behind doing so, but I don't really know how to execute the method.
Reply With Quote
  #229  
Old Today, 05:24 PM
bnut34 bnut34 is offline
 
Join Date: Feb 2010
College: University of Wisconsin - Stevens Point
Posts: 17
Default

Hello. For clustering, whether kmeans or hierarchical, am I right that it would be rare to have categorical variables involved? Since clusters are determined by distance, and categorical variables can only take on 0 or 1 (numerically), it doesn't seem like a process that would help a lot.

In a similar vein, if I take a categorial variable with more than 2 factors, binarize it, and use them all in a PCA, do I need to either include all or none of the binarized variables for that one factor in any feature I create? Or can I pick and choose (say only a couple of the factors have a large loading)? I would think it would get hairy then when deleting variables afterwards?
Reply With Quote
  #230  
Old Today, 05:42 PM
SunnyDale SunnyDale is offline
SOA
 
Join Date: Jul 2018
Posts: 28
Default

During the data exploration process, when encounter variable(s) with a large number of dimensions, for example, the "PRIMARY" and "US_STATE" variables in the Dec 2018 Exam, is it normally the case to just remove those variables even though they may contain useful information for prediction?
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


All times are GMT -4. The time now is 07:12 PM.


Powered by vBulletin®
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
*PLEASE NOTE: Posts are not checked for accuracy, and do not
represent the views of the Actuarial Outpost or its sponsors.
Page generated in 0.15484 seconds with 9 queries