Actuarial Outpost When do we need to binarize factor variables?
 Register Blogs Wiki FAQ Calendar Search Today's Posts Mark Forums Read
 FlashChat Actuarial Discussion Preliminary Exams CAS/SOA Exams Cyberchat Around the World Suggestions

#1
12-07-2019, 02:12 PM
 windows7forever Member SOA Join Date: Apr 2016 Posts: 538
When do we need to binarize factor variables?

In Hospital Readmissions project, we had to binarize factor variables in the stepAIC() process. However, the June 2019 exam did not require binarization of variables but just reduce number of levels of each factor variable. Some variables could have 3 levels instead of 2.

When do we need to binarize factor variables? Will this be something always given the question?

StepAIC() eliminates a factor variable as a whole even if only some levels of the factor variable are not significant. If we just reduce all factor variables to two levels based on similar median/mean and frequency table, will that be similar to variable binarization (indicator 0 or 1 for whether or not have the specific level of the factor variable)?

Thanks.
#2
12-08-2019, 05:16 AM
 jericc1 SOA Join Date: Oct 2017 Studying for PA College: UCI - Alumni Posts: 15

Personally, I think it's good practice to explicitly binarize your factor variables in the data you're planning to pass through modeling functions, such as lm()/glm(), even if they automatically binarize factor variables for you. My thinking is that you're proactively avoiding any issues where a function spits out an error if it can only handle numeric data types, e.g. prcomp().

Also, I think it's good to do so for the benefit (as you mentioned) of being able to identify statistically insignificant levels in the model output that can hint at which levels would be good candidates for folding into other levels (assuming there are several levels within the variable). If your categorical variable only has two levels, then explicitly binarizing that variable won't really help in this regard, but I don't see any harm in doing so other than saving time.

As for the actual exam, I don't think the exam authors expect us to explicitly binarize unless they actually give us the code. That is, it's won't be entirely expected unless there already is a dummyVars() function somewhere in the provided code -- but that's just my 2 cents.
#3
12-08-2019, 12:33 PM
 windows7forever Member SOA Join Date: Apr 2016 Posts: 538

Quote:
 Originally Posted by jericc1 Personally, I think it's good practice to explicitly binarize your factor variables in the data you're planning to pass through modeling functions, such as lm()/glm(), even if they automatically binarize factor variables for you. My thinking is that you're proactively avoiding any issues where a function spits out an error if it can only handle numeric data types, e.g. prcomp(). Also, I think it's good to do so for the benefit (as you mentioned) of being able to identify statistically insignificant levels in the model output that can hint at which levels would be good candidates for folding into other levels (assuming there are several levels within the variable). If your categorical variable only has two levels, then explicitly binarizing that variable won't really help in this regard, but I don't see any harm in doing so other than saving time. As for the actual exam, I don't think the exam authors expect us to explicitly binarize unless they actually give us the code. That is, it's won't be entirely expected unless there already is a dummyVars() function somewhere in the provided code -- but that's just my 2 cents.
Thanks for your thought. Have you done June 2019 exam in a different way? I cleared up many misunderstandings I had in June exam. I lost points due to early rank-deficient issue that I am aware how to solve that now.

I did not lose points for Task 2 for reducing the number of levels to 2 for every single factor variable except gender. For simplicity, I merged the levels that had similar mean and medians into two groups. The first group had levels had most number of observations or two most number of observations. The rest would be grouped into another group. I did not lose any point for doing so.

Later on, I also got the interaction term the solution used, but mine was statistically significant at one level of each original variable that created the interaction term: Road_Feature:Traffic_Control after stepAIC process with BIC forward selection method.

Also it's easier to compare two levels with two levels when identifying potential interaction effects between two predictors.

Do you think is there any set rule we have to reduce some factor variables into 3 or 4 levels instead of 2?
#4
12-08-2019, 01:06 PM
 tbsmith20 SOA Join Date: Aug 2018 College: Wisconsin - Madison Posts: 19

Here is my game plan. Show some graphs, probably justify a log transformation, comment one some interactions and relevel in the prep step. I think it is more important, as related to the exam, to relevel any factor variables. Then once you run the models, binarize if needed to select just the important level. If nothing else this will save some time. Use it as a model refinement, rather than a model prep step. Obviously, you also need to do it on variables that lead to rank deficient models. The December exam and the June exam both seem to have purposefully lead the modeling process into a rank deficient model and required the candidate to do something about it (Binarization). I would expect the same next week.
__________________
P FM MFE C MLC PA IA FA APC ASA

Health FSA
Modules: Econ Foundations D&P PR&F
Exams: D&P F&V GHS
DMAC
FAC
#5
12-08-2019, 09:47 PM
 windows7forever Member SOA Join Date: Apr 2016 Posts: 538

Quote:
 Originally Posted by tbsmith20 Here is my game plan. Show some graphs, probably justify a log transformation, comment one some interactions and relevel in the prep step. I think it is more important, as related to the exam, to relevel any factor variables. Then once you run the models, binarize if needed to select just the important level. If nothing else this will save some time. Use it as a model refinement, rather than a model prep step. Obviously, you also need to do it on variables that lead to rank deficient models. The December exam and the June exam both seem to have purposefully lead the modeling process into a rank deficient model and required the candidate to do something about it (Binarization). I would expect the same next week.
I thought the reasons behind rank deficient in June 19 and December 18 exams were different than binarization. The reason we had rank deficient in December 18 was the collinearity in PC_HRS_### variables given in the data. That's available prior to create new features in PCA or cluster analysis.

However, in June 19 exam, if we did not create any new feature or do Task 3 at all, we would not get rank deficient issue later on. The original variables that created the PCA loadings feature were not removed was the cause of rank deficient.

Any other rank deficient issue can you think of?