Actuarial Outpost ACTEX PA Manual: First Component
 Register Blogs Wiki FAQ Calendar Search Today's Posts Mark Forums Read
 FlashChat Actuarial Discussion Preliminary Exams CAS/SOA Exams Cyberchat Around the World Suggestions

 Not looking for a job? Tell us about your ideal job, and we'll only contact you when it opens up. https://www.dwsimpson.com/register

#221
12-04-2019, 12:05 PM
 jgrant616 SOA Join Date: Sep 2018 Location: Boston College: Bryant University Favorite beer: A nice crisp lager Posts: 2

Hi Ambrose,
I have a question on Decision trees, page 303-304. This is where you break down how the rel error column in the cptable is calculated. It seems to me that your method changed slightly from the first split to the second split.

For the first split, you calculated the RSS across the two nodes of .548 and then divided it by two (because there are two nodes) to get .2740. Then, you subtracted it from the root node's rel error of 1 to get the cp value of .726.

For the second split (Example 5.2.1), you calculated the RSS across the two nodes of .26. However, in this split, instead of dividing by 2 and then subtracting, you subtracted from the prior RSS of .548 to get .288 and then divided by 2 to get .144.

Am I missing something here? Why did the order of the calculation change?

If I calculate it the way you did in the first split, I'd take .26/2 = .13, and subtract that from the previous rel. error of .2740 which gets me to .144 all the same, but I want to be sure this isn't a coincidence.

__________________
Prelims: VEE | P | FM | MFE | MLC | C | FAP Modules | PA | APC
#222
12-04-2019, 05:41 PM
 bnut34 Join Date: Feb 2010 College: University of Wisconsin - Stevens Point Posts: 17

Hi Ambrose,

In the case study 4.3, a confusion matrix was used on the training data to assess the performance of the full probit model. Then after feature selection, the final model was validated using ROC / AUC on the test data. Ideally, wouldn't validation also include picking a cutoff and assessing that cutoff, as well as the final model, on the test data via a confusion matrix? Was that not included here because we were not asked to specify a cutoff (as opposed to the Hospital Readmissions case)?

Thank you
#223
Today, 02:04 AM
 KarimZ Member SOA Join Date: May 2015 Location: Pakistan College: Graduate, Bsc Accounting and Finance Posts: 201

Have a silly question. How do you guys write code in R?

Do you type codes in console and then paste them in R markdown file? Or is there a way to save code in console directly as an rmd file?

Or is it a better strategy to directly edit code provided in exam in the markdown file?
__________________
P FM MFE C LTAM PA

VEEs

FAP Interim FAP Final

APC

#224
Today, 11:47 AM
 Life Non-Actuary Join Date: Sep 2016 Posts: 27

Quote:
 Originally Posted by Yossarian It's only saying the observations (in this case, claim size), can be negative when you use this model (normal GLM on claim size with log link). The predictions won't be negative because of the log link, but the Gaussian distribution can admit a negative observation. But just because a negative observation can be admitted, it seems very unlikely in practice that your data would have a negative observation of claim size. That said, I think that Gaussian would also be an unlikely distribution choice for claims size modeling for that reason (you wouldn't have negative claim size, so why would one think that would be a good option?). Bringing us back full circle, the reason for the note from the SOA, if you chose Gaussian distribution in the exam, you would have to explain why you chose that, and part of that would be to address this issue. Note: you did ask "Can someone help explain this to me?" ... so I didn't wait for Ambrose to reply, but when he does respond, if any of this is wrong, I apologize in advance, and I will either edit or delete it so as not to create future confusion.
If I'm understanding this correctly, the fact that negative observation can be admitted would only matter if you were to care about, say, the minimum of the prediction rather than the mean. But when would we ever care about some metric other than the mean/sd of the prediction?
#225
Today, 12:40 PM
 windows7forever Member SOA Join Date: Apr 2016 Posts: 346

Quote:
 Originally Posted by KarimZ Have a silly question. How do you guys write code in R? Do you type codes in console and then paste them in R markdown file? Or is there a way to save code in console directly as an rmd file? Or is it a better strategy to directly edit code provided in exam in the markdown file?
You have to write codes in Rmd file directly. There's no other application accessible in the exam computer. Only the given Rmd file, Excel or csv file and Word template. Regular R is not available. You write codes and generate output in RStudio directly.
#226
Today, 12:46 PM
 windows7forever Member SOA Join Date: Apr 2016 Posts: 346

Back to Hospital Readmissions project's comments in your manual's Component 3, what if the goal of the project is to provide key predictors and interpret them like June 2019 exam, then how will you interpret the logit model's coefficients?

Just say a predictor's increase will likely cause increase or decrease the target variable is not sufficient then, since we need to provide magnitude too.

Logit model is easier to interpret than probit model, even if probit model performs slightly better than logit model due to slightly lower model AIC.

Thanks.
#227
Today, 02:35 PM
 bnut34 Join Date: Feb 2010 College: University of Wisconsin - Stevens Point Posts: 17

Looking at the Dec 2018 exam, the first page notes that the target variable is number of injuries per 2,000 work hours. My first instinct would've been to just create a new variable that is number of injuries / (number of hours / 2000), and to get rid of the two individual variables that feed into that (number of hours and number of injuries). Then proceed from there (my GLM would've been geared towards a continuous variable, not a Poisson). Is there any issue with that?

Asked another way, I believe that an offset on number of hours is appropriate if the target variable is number of injuries. But it's not. I'm not seeing how an offset on number of hours is appropriate when that is part of the target variable.

Thank you
#228
Today, 03:26 PM
 PTActuary SOA Join Date: Nov 2019 Posts: 12

Quote:
 Originally Posted by windows7forever I have not seen a lot examples or practice exercises close to exam style that use undersampling and oversampling methods, but that does not mean the exam will not have these methods in the future. Thanks.
I was also wondering about the application of over/under sampling methods in R. I understand the rationale behind doing so, but I don't really know how to execute the method.
#229
Today, 05:24 PM
 bnut34 Join Date: Feb 2010 College: University of Wisconsin - Stevens Point Posts: 17

Hello. For clustering, whether kmeans or hierarchical, am I right that it would be rare to have categorical variables involved? Since clusters are determined by distance, and categorical variables can only take on 0 or 1 (numerically), it doesn't seem like a process that would help a lot.

In a similar vein, if I take a categorial variable with more than 2 factors, binarize it, and use them all in a PCA, do I need to either include all or none of the binarized variables for that one factor in any feature I create? Or can I pick and choose (say only a couple of the factors have a large loading)? I would think it would get hairy then when deleting variables afterwards?
#230
Today, 05:42 PM
 SunnyDale SOA Join Date: Jul 2018 Posts: 28

During the data exploration process, when encounter variable(s) with a large number of dimensions, for example, the "PRIMARY" and "US_STATE" variables in the Dec 2018 Exam, is it normally the case to just remove those variables even though they may contain useful information for prediction?