Actuarial Outpost
 
Go Back   Actuarial Outpost > Exams - Please Limit Discussion to Exam-Related Topics > SoA/CAS Preliminary Exams > Exam PA: Predictive Analytics
FlashChat Actuarial Discussion Preliminary Exams CAS/SOA Exams Cyberchat Around the World Suggestions



Reply
 
Thread Tools Search this Thread Display Modes
  #391  
Old 06-17-2020, 02:37 AM
actuarialmeow actuarialmeow is offline
SOA
 
Join Date: Feb 2020
Studying for PA
Posts: 6
Default

Hi Ambrose,

For Practice Exam 1, results of my models differed horrendously from the sample solution. My Tree1 and Tree2 only had 2 splits (VAgeCat and NCD) and 0 split respectively and my final GLM only had VAgeCat5 and VAgeCat6 as predictors.

My procedures were similar to that of the sample solution and the only main difference is some data prep work (i.e. combining levels).

I am currently having a tough time deciding which final model to use (my tree1 is more helpful in the sense that it's using 2 variables to help predict instead of 1) but the GLM make more sense in this business context I think. What do you think would be a wise choice here? Even if I did finally decide to pick one and attempt to strongly justify for it, I personally still feel like the models are almost ridiculous which would make it hard to argue for to begin with.

Can you please share some insights/strategies on how to deal with undesirable results(or models constructed)? Or am I overthinking this and those are actually perfectly acceptable constructs?

Thanks in advanced!!!

P.S. my test statistics are comparable to the solutions' so it's not so much of having bad model performance but learning to deal with what seems like ridiculous models..?

Last edited by actuarialmeow; 06-17-2020 at 02:41 AM..
Reply With Quote
  #392  
Old 06-17-2020, 08:21 AM
SweepingRocks SweepingRocks is offline
Member
SOA
 
Join Date: Jun 2017
College: Bentley University (Class of 2019ish)
Posts: 293
Default

A few questions, because I made some mistakes on this exam that gave me similar results:

1.) Did you recreate the partitions for the test and train set after editing the data frame? I forgot to recreate it after deleting some observations which led to weird results.

2.) Are you using the train set to train your model? I believe this exam has by default the models being trained by the full data set, which isn’t appropriate.
__________________
FM P MFE STAM LTAM FAP PA
Former Disney World Cast Member, currently no idea what I'm doing

"I think you should refrain from quoting yourself. It sounds pompous." - SweepingRocks

Last edited by SweepingRocks; 06-17-2020 at 08:51 AM..
Reply With Quote
  #393  
Old 06-17-2020, 09:17 AM
actuarialmeow actuarialmeow is offline
SOA
 
Join Date: Feb 2020
Studying for PA
Posts: 6
Default

Quote:
Originally Posted by SweepingRocks View Post
A few questions, because I made some mistakes on this exam that gave me similar results:

1.) Did you recreate the partitions for the test and train set after editing the data frame? I forgot to recreate it after deleting some observations which led to weird results.

2.) Are you using the train set to train your model? I believe this exam has by default the models being trained by the full data set, which isnít appropriate.
Assuming that your mistakes was referring to mine, the answers to both questions are yes!
Reply With Quote
  #394  
Old 06-17-2020, 09:46 AM
windows7forever windows7forever is offline
Member
SOA
 
Join Date: Apr 2016
Posts: 531
Default

Quote:
Originally Posted by SweepingRocks View Post
A few questions, because I made some mistakes on this exam that gave me similar results:

1.) Did you recreate the partitions for the test and train set after editing the data frame? I forgot to recreate it after deleting some observations which led to weird results.

2.) Are you using the train set to train your model? I believe this exam has by default the models being trained by the full data set, which isnít appropriate.
Yes, I noticed that when I was doing the practice exam before. I found all the necessary changes before I looked at the solution except I removed Vehicle Type and kept PC. Normally, I would keep Vehicle Type instead of PC for multiple levels of variation, but I saw mean claim counts were very similar across several levels of Vehicle Type so decided to use the consolidated version, PC instead.

Even after I put Vehicle Type back and removed PC, I got similar result as the solution on stepwise BIC forward GLM, but my tree results were much different. That reminded me one disadvantage in base tree is that it can generate unstable results when training set changes. I probably still had one or two levels combinations did differently than the solution, but my base tree ended with 2 splits with min xerror and the simplest tree below min xerror + xstd had 1 split just like the solution. My random forest was affected by base tree results since it averaged results of 100 base trees in parallel. Difference in cleaned data from the solution could affect the bootstrapped traning sample used to create trees in random forest too.

So I am not very surprised with the result differences occurred in trees and random forest but not GLM models.
Reply With Quote
  #395  
Old 06-17-2020, 09:57 AM
elvi11 elvi11 is offline
Member
SOA
 
Join Date: Sep 2014
College: Master
Posts: 43
Default

Quote:
Originally Posted by windows7forever View Post
I went over your solutions of Practice Exam 1 again and have some follow up questions below. Thank you again in advance.

In data cleaning Task 2, I found that there's no median of Clm_Count in the summary table on P451.

Q1: Was median important in analyzing relationship between any factor variable and Clm_Count?

I did see skewness in Clm_Count but it only had 4 distinct values and most policyholders had 0 claim in a year, so there's no need to check median to verify any skewness with the mean Clm_Count.

You did not require an interpretation of tree output in tree related tasks that's different than what's expected in December 2019 exams for classification trees.

Q2 a): Are regression trees interpreted the same way as classification trees except the top number on each node represents the mean of claim count instead of proportion of something?

Q2 b): But then we will not have a focus point on terminal nodes with proportion of 1s only like in December 2019 exam, since that's only for classification trees. Do we focus on the terminal nodes that have largest mean claim count first or relatively large count of observations also should be considered?

Q2 c): What's the ratio of .../... in the middle of each node meaning? Does it help us interpret the terminal nodes' outcome?

Q3: Why not use LL like in December 2018 exams to evaluate prediction performance across all trees and GLM models? I thought chi-square statistic is a measure of goodness of fit on the training set.


Q4: In addition, I want to know if a weight version description in Task 3 I think about below is correct after reading your solution on offset specification.

When dealing with count target variable measured in frequency (rate) over a group of policyholders rather than each individual policyholder, different claim counts maybe subject to different policy effective periods of different policyholders. In general, we can expect that a policyholder with longer policy effective periods will have larger claim count than a policyholder with shorter policy effective periods, keep other things equal.

In a GLM model, a weight is a variable that can account for the average outcome of different claim counts over different policy effective periods for a group of policyholder, so it can generate more accurate predictions.
-----------------------------------------------------------------------
I had the same questions around Q2 regression trees - hopefully someone can answer this today...
Reply With Quote
  #396  
Old 06-17-2020, 12:36 PM
elvi11 elvi11 is offline
Member
SOA
 
Join Date: Sep 2014
College: Master
Posts: 43
Default under and over sampling

Hello,

What measures should we take if the mean of target on train data is considerable different than mean of target on test data?
1. For classification problems with decision trees, I believe that we can mention sampling = "down" or "up" in rpart(). But how about GLMs?
2. How about regression models?
Reply With Quote
  #397  
Old 06-17-2020, 05:43 PM
ambroselo ambroselo is offline
Member
SOA
 
Join Date: Sep 2018
Location: Iowa City
College: University of Iowa
Posts: 308
Default

Quote:
Originally Posted by SweepingRocks View Post
A few questions, because I made some mistakes on this exam that gave me similar results:

1.) Did you recreate the partitions for the test and train set after editing the data frame? I forgot to recreate it after deleting some observations which led to weird results.

2.) Are you using the train set to train your model? I believe this exam has by default the models being trained by the full data set, which isnít appropriate.
1. We usually do the training/test split after exploring and cleaning the data, but at times there are data adjustments in the middle of a project. Page 2 of the cheat sheet says that:
Quote:
If data adjustments occur after the training/test split (e.g., Dec 2019 Task 6), need to repeat the split to capture changes, but should use the same partition
2. The training set, as its name suggests, is for training your models. It is not uncommon that you have to make some changes to the given Rmd file (that's how the SOA tests you!). In Practice Exam 1, I (deliberately!) set the data argument of model fitting functions like glm() and rpart() to the full dataset and expect you to change it to the training set.
Reply With Quote
  #398  
Old 06-17-2020, 05:46 PM
ambroselo ambroselo is offline
Member
SOA
 
Join Date: Sep 2018
Location: Iowa City
College: University of Iowa
Posts: 308
Default

Quote:
Originally Posted by elvi11 View Post
Hello,

What measures should we take if the mean of target on train data is considerable different than mean of target on test data?
This will not happen if you recall that the createDataPartition() function performs stratified sampling and ensures that the distributions of the target variable on the training set and test set are largely comparable.
Reply With Quote
  #399  
Old 06-17-2020, 06:35 PM
Sader Sader is offline
Member
SOA
 
Join Date: Feb 2019
Posts: 36
Default binarize for cluster analysis?

Does anyone know if we need to binarize before cluster analysis? I know we need to for PCA because that method can't handle factor variables but didn't know if we also needed to for k-means/hierarchical cluster analysis?
Reply With Quote
  #400  
Old 06-21-2020, 10:50 AM
eget_act's Avatar
eget_act eget_act is offline
Member
CAS SOA
 
Join Date: Jul 2015
Studying for PA
College: IIT
Favorite beer: Stella, Guinness, Corona
Posts: 80
Default

Hello Dr. Lo,

I would like to say thank you very much for all your help and guidance. I really enjoyed studying for this exam using your manual. Your manual is just so easy to read and understand and in fact, I developed an interest in learning more about predictive analytics.

Thanks very much again!

Cheers!

PS: If you got a chance to look at last week's exam, could you please share your opinion about it?
I personally felt that the exam has matured now and is testing the right concepts and also the marks distribution was also pretty much balanced this time. At least 25 marks question on analyzing the data makes so much sense since we spent so much time working on it.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


All times are GMT -4. The time now is 07:25 AM.


Powered by vBulletin®
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
*PLEASE NOTE: Posts are not checked for accuracy, and do not
represent the views of the Actuarial Outpost or its sponsors.
Page generated in 0.50166 seconds with 10 queries