View Single Post
Old 07-02-2018, 01:26 PM
whoanonstop's Avatar
whoanonstop whoanonstop is offline
Join Date: Aug 2013
Location: Los Angeles, CA
Studying for Spark / Scala
College: College of William and Mary
Favorite beer: Orange Juice
Posts: 5,899
Blog Entries: 1

Lots of people responding that don't seem to have any sound advice beyond "maybe your data isn't really missing"?????

Except for this:

Originally Posted by AMedActuary View Post
It would depend on why the data is missing. If it's missing at random, it should be okay to just remove. If there are certain types of people who have more missing data than others, then it may have an effect on your results. See the below page on Wikipedia for a brief intro to missing data.
OP. For each scenario of missing data there will be different answers. Read through the Wiki above to hone in on which type of "missingness" you're dealing with.

Also, if you're working with linear models a lot, just buy "Regression Modeling Strategies" by Harrell. It has a chapter on missing data that would be of great benefit to you. Of course, the book is valuable beyond that chapter.

Originally Posted by kevinykuo View Post
Do both and see which does better on validation? Do you expect new data to be scored to be missing a bunch of variables? Do you need to use GLM? Hard to say without knowing what problem you're solving, but these are some things to think about.
After checking the first review, this might be a good book for you as well:

"If you want to move past the "just use cross validation" stage of your ML work and improve your model's generalization (and understand why and when to use techniques like bootstrapping) this is the book for you."

Reply With Quote
Page generated in 0.12600 seconds with 9 queries