Actuarial Outpost
 
Go Back   Actuarial Outpost > Exams - Please Limit Discussion to Exam-Related Topics > SoA/CAS Preliminary Exams > Exam PA: Predictive Analytics
FlashChat Actuarial Discussion Preliminary Exams CAS/SOA Exams Cyberchat Around the World Suggestions

Browse Open Actuarial Jobs

Life  Health  Casualty  Pension  Entry Level  All Jobs  Salaries


Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old 12-07-2019, 02:12 PM
windows7forever windows7forever is offline
Member
SOA
 
Join Date: Apr 2016
Posts: 538
Default When do we need to binarize factor variables?

In Hospital Readmissions project, we had to binarize factor variables in the stepAIC() process. However, the June 2019 exam did not require binarization of variables but just reduce number of levels of each factor variable. Some variables could have 3 levels instead of 2.

When do we need to binarize factor variables? Will this be something always given the question?

StepAIC() eliminates a factor variable as a whole even if only some levels of the factor variable are not significant. If we just reduce all factor variables to two levels based on similar median/mean and frequency table, will that be similar to variable binarization (indicator 0 or 1 for whether or not have the specific level of the factor variable)?

Thanks.
Reply With Quote
  #2  
Old 12-08-2019, 05:16 AM
jericc1 jericc1 is offline
SOA
 
Join Date: Oct 2017
Studying for PA
College: UCI - Alumni
Posts: 15
Default

Personally, I think it's good practice to explicitly binarize your factor variables in the data you're planning to pass through modeling functions, such as lm()/glm(), even if they automatically binarize factor variables for you. My thinking is that you're proactively avoiding any issues where a function spits out an error if it can only handle numeric data types, e.g. prcomp().

Also, I think it's good to do so for the benefit (as you mentioned) of being able to identify statistically insignificant levels in the model output that can hint at which levels would be good candidates for folding into other levels (assuming there are several levels within the variable). If your categorical variable only has two levels, then explicitly binarizing that variable won't really help in this regard, but I don't see any harm in doing so other than saving time.

As for the actual exam, I don't think the exam authors expect us to explicitly binarize unless they actually give us the code. That is, it's won't be entirely expected unless there already is a dummyVars() function somewhere in the provided code -- but that's just my 2 cents.
Reply With Quote
  #3  
Old 12-08-2019, 12:33 PM
windows7forever windows7forever is offline
Member
SOA
 
Join Date: Apr 2016
Posts: 538
Default

Quote:
Originally Posted by jericc1 View Post
Personally, I think it's good practice to explicitly binarize your factor variables in the data you're planning to pass through modeling functions, such as lm()/glm(), even if they automatically binarize factor variables for you. My thinking is that you're proactively avoiding any issues where a function spits out an error if it can only handle numeric data types, e.g. prcomp().

Also, I think it's good to do so for the benefit (as you mentioned) of being able to identify statistically insignificant levels in the model output that can hint at which levels would be good candidates for folding into other levels (assuming there are several levels within the variable). If your categorical variable only has two levels, then explicitly binarizing that variable won't really help in this regard, but I don't see any harm in doing so other than saving time.

As for the actual exam, I don't think the exam authors expect us to explicitly binarize unless they actually give us the code. That is, it's won't be entirely expected unless there already is a dummyVars() function somewhere in the provided code -- but that's just my 2 cents.
Thanks for your thought. Have you done June 2019 exam in a different way? I cleared up many misunderstandings I had in June exam. I lost points due to early rank-deficient issue that I am aware how to solve that now.

I did not lose points for Task 2 for reducing the number of levels to 2 for every single factor variable except gender. For simplicity, I merged the levels that had similar mean and medians into two groups. The first group had levels had most number of observations or two most number of observations. The rest would be grouped into another group. I did not lose any point for doing so.

Later on, I also got the interaction term the solution used, but mine was statistically significant at one level of each original variable that created the interaction term: Road_Feature:Traffic_Control after stepAIC process with BIC forward selection method.

Also it's easier to compare two levels with two levels when identifying potential interaction effects between two predictors.

Do you think is there any set rule we have to reduce some factor variables into 3 or 4 levels instead of 2?
Reply With Quote
  #4  
Old 12-08-2019, 01:06 PM
tbsmith20 tbsmith20 is offline
SOA
 
Join Date: Aug 2018
College: Wisconsin - Madison
Posts: 19
Default

Here is my game plan. Show some graphs, probably justify a log transformation, comment one some interactions and relevel in the prep step. I think it is more important, as related to the exam, to relevel any factor variables. Then once you run the models, binarize if needed to select just the important level. If nothing else this will save some time. Use it as a model refinement, rather than a model prep step. Obviously, you also need to do it on variables that lead to rank deficient models. The December exam and the June exam both seem to have purposefully lead the modeling process into a rank deficient model and required the candidate to do something about it (Binarization). I would expect the same next week.
__________________
P FM MFE C MLC PA IA FA APC ASA

Health FSA
Modules: Econ Foundations D&P PR&F
Exams: D&P F&V GHS
DMAC
FAC
Reply With Quote
  #5  
Old 12-08-2019, 09:47 PM
windows7forever windows7forever is offline
Member
SOA
 
Join Date: Apr 2016
Posts: 538
Default

Quote:
Originally Posted by tbsmith20 View Post
Here is my game plan. Show some graphs, probably justify a log transformation, comment one some interactions and relevel in the prep step. I think it is more important, as related to the exam, to relevel any factor variables. Then once you run the models, binarize if needed to select just the important level. If nothing else this will save some time. Use it as a model refinement, rather than a model prep step. Obviously, you also need to do it on variables that lead to rank deficient models. The December exam and the June exam both seem to have purposefully lead the modeling process into a rank deficient model and required the candidate to do something about it (Binarization). I would expect the same next week.
I thought the reasons behind rank deficient in June 19 and December 18 exams were different than binarization. The reason we had rank deficient in December 18 was the collinearity in PC_HRS_### variables given in the data. That's available prior to create new features in PCA or cluster analysis.

However, in June 19 exam, if we did not create any new feature or do Task 3 at all, we would not get rank deficient issue later on. The original variables that created the PCA loadings feature were not removed was the cause of rank deficient.

Any other rank deficient issue can you think of?
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


All times are GMT -4. The time now is 05:28 PM.


Powered by vBulletin®
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
*PLEASE NOTE: Posts are not checked for accuracy, and do not
represent the views of the Actuarial Outpost or its sponsors.
Page generated in 0.21561 seconds with 11 queries