Actuarial Outpost
 
Go Back   Actuarial Outpost > Actuarial Discussion Forum > General Actuarial
FlashChat Actuarial Discussion Preliminary Exams CAS/SOA Exams Cyberchat Around the World Suggestions


Upload your resume securely at https://www.dwsimpson.com
to be contacted when new jobs meet your skills and objectives.


General Actuarial Non-Specific Actuarial Topics - Before posting a thread, please browse over our other sections to see if there is a better fit, such as Careers - Employment, Actuarial Science Universities Forum or any of our other 100+ forums.

Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old 07-02-2018, 06:37 AM
mattcarp mattcarp is offline
Member
CAS
 
Join Date: Sep 2016
Studying for Exam 5
College: UC Berkeley
Posts: 298
Default Coding missing variables

I have a data set that I would like to use in a GLM, but for about half of the entries all or nearly all of the explanatory variables are missing. In this situation, does it make sense to code these variables as missing or just throw the records out entirely?
__________________
P FM VEE MFE C S OC1 5 OC2 6 7 8 9
Reply With Quote
  #2  
Old 07-02-2018, 08:01 AM
Marcie's Avatar
Marcie Marcie is online now
Member
CAS
 
Join Date: Feb 2015
Posts: 7,689
Default

Yes.
Reply With Quote
  #3  
Old 07-02-2018, 08:11 AM
JMO's Avatar
JMO JMO is offline
Carol Marler
Non-Actuary
 
Join Date: Sep 2001
Location: Back home again in Indiana
Studying for Nothing actuarial.
Posts: 37,309
Default

I wonder, first, what can be done about fixing the missing data. Second, if that cannot be done, whether it's a mistake to even try to do GLM. But I don't know enough to give you advice.
I like what Marcie said, though.
__________________
Carol Marler, "Just My Opinion"

Pluto is no longer a planet and I am no longer an actuary. Please take my opinions as non-actuarial.


My latest favorite quotes, updated Apr 5, 2018.

Spoiler:
I should keep these four permanently.
Quote:
Originally Posted by rekrap View Post
JMO is right
Quote:
Originally Posted by campbell View Post
I agree with JMO.
Quote:
Originally Posted by Westley View Post
And def agree w/ JMO.
Quote:
Originally Posted by MG View Post
This. And everything else JMO wrote.
And this all purpose permanent quote:
Quote:
Originally Posted by Dr T Non-Fan View Post
Yup, it is always someone else's fault.
MORE:
All purpose response for careers forum:
Quote:
Originally Posted by DoctorNo View Post
Depends upon the employer and the situation.
Quote:
Originally Posted by Sredni Vashtar View Post
I feel like ERM is 90% buzzwords, and that the underlying agenda is to make sure at least one of your Corporate Officers is not dumb.
Reply With Quote
  #4  
Old 07-02-2018, 08:31 AM
mattcarp mattcarp is offline
Member
CAS
 
Join Date: Sep 2016
Studying for Exam 5
College: UC Berkeley
Posts: 298
Default

Quote:
Originally Posted by JMO View Post
I wonder, first, what can be done about fixing the missing data. Second, if that cannot be done, whether it's a mistake to even try to do GLM. But I don't know enough to give you advice.
I like what Marcie said, though.
The data came from an external source, so there's nothing really I can do about the missing values. I have about 250k records in total, but of those a little over 100k have all or nearly all the variables missing. Given the missing records make up such a large proportion of the data, my question is does it even make sense to code as missing, or just run the GLM on the remaining 150k records?
__________________
P FM VEE MFE C S OC1 5 OC2 6 7 8 9
Reply With Quote
  #5  
Old 07-02-2018, 08:39 AM
kevinykuo kevinykuo is offline
CAS
 
Join Date: Nov 2017
Posts: 18
Default

Do both and see which does better on validation? Do you expect new data to be scored to be missing a bunch of variables? Do you need to use GLM? Hard to say without knowing what problem you're solving, but these are some things to think about.
Reply With Quote
  #6  
Old 07-02-2018, 08:50 AM
Colymbosathon ecplecticos's Avatar
Colymbosathon ecplecticos Colymbosathon ecplecticos is offline
Member
 
Join Date: Dec 2003
Posts: 5,836
Default

Are any of your variables nested? Prostate size for females is very likely to be a (correctly) missing value.
__________________
"What do you mean I don't have the prerequisites for this class? I've failed it twice before!"
Reply With Quote
  #7  
Old 07-02-2018, 09:36 AM
mattcarp mattcarp is offline
Member
CAS
 
Join Date: Sep 2016
Studying for Exam 5
College: UC Berkeley
Posts: 298
Default

Quote:
Originally Posted by Colymbosathon ecplecticos View Post
Are any of your variables nested? Prostate size for females is very likely to be a (correctly) missing value.
No I don't think there is anything like this.

One thing I find interesting is according to the data dictionary the company provided, all of the variables have a code for "Unknown" value but they also use blank for "Default" value.
__________________
P FM VEE MFE C S OC1 5 OC2 6 7 8 9
Reply With Quote
  #8  
Old 07-02-2018, 09:45 AM
Vorian Atreides's Avatar
Vorian Atreides Vorian Atreides is offline
Wiki/Note Contributor
CAS
 
Join Date: Apr 2005
Location: Hitler's Secret Bunker
Studying for ACAS
College: Hard Knocks
Favorite beer: Sam Adams Cherry Wheat
Posts: 61,988
Default

Quote:
Originally Posted by mattcarp View Post
No I don't think there is anything like this.

One thing I find interesting is according to the data dictionary the company provided, all of the variables have a code for "Unknown" value but they also use blank for "Default" value.
If a "null" value represents "default value" . . . then I would recode it to be something explicit representing "default" (using its definition).

If the metadata file doesn't give anything explicit as to what the default level is, I would ask. Very important to understand what the (qualitative) differences are between explicitly defined levels and the "default" level to assess appropriateness of the final model.
__________________
I find your lack of faith disturbing

Why should I worry about dying? Itís not going to happen in my lifetime!


Freedom of speech is not a license to discourtesy

#BLACKMATTERLIVES
Reply With Quote
  #9  
Old 07-02-2018, 11:24 AM
AMedActuary AMedActuary is offline
Member
SOA
 
Join Date: May 2007
College: UCLA Alumni
Posts: 389
Default

It would depend on why the data is missing. If it's missing at random, it should be okay to just remove. If there are certain types of people who have more missing data than others, then it may have an effect on your results. See the below page on Wikipedia for a brief intro to missing data.

https://en.wikipedia.org/wiki/Missing_data
Reply With Quote
  #10  
Old 07-02-2018, 12:09 PM
CuriousGeorge CuriousGeorge is online now
Member
CAS SOA
 
Join Date: Dec 2005
Posts: 1,155
Default

Whether you expect future data to also have significant missing fields may also be a consideration.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


All times are GMT -4. The time now is 08:51 AM.


Powered by vBulletin®
Copyright ©2000 - 2018, Jelsoft Enterprises Ltd.
*PLEASE NOTE: Posts are not checked for accuracy, and do not
represent the views of the Actuarial Outpost or its sponsors.
Page generated in 0.19126 seconds with 11 queries