![]() |
|
|
|||||||
| FlashChat | Actuarial Discussion | Preliminary Exams | CAS/SOA Exams | Cyberchat | Around the World | Suggestions |
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
|
#1
|
|||
|
|||
|
It's best to pose this issue by example:
Suppose you want to predict the performance of several hedge funds. You gather historical data, predictive variables and annual rates of return (your target). One of the variables you select is whether the members of the board left the annual fund meeting with smiles on their faces. Preliminary analysis suggests this variable is highly predictive; smiles are a good thing. Some of the hedge funds have board meetings before Thanksgiving. Others wait until just before Christmas. Meeting dates may differ from year to year within the same fund. However, you will always use this model on November 30th to select funds for next year's portfolio. All funds require that you put in your order by November 30th or they won't accept your money. Your share in the fund takes effect on January 1st of the new year. So, when training your predictive model with the historical data, do you include information on board member smiles that was collected from annual meetings in December, even if, were you to apply the model for any such record, you would not know the smile information (your order is in by the time they hold their meeting...)? Put another way, if an observed hedge fund from 3 years ago had a board meeting in December whereby all members were smiling and whereby the fund then went on to generate great returns, would you use the smile data in your model in order to better understand how your model will react to similar data in the future? Thanks for any insight. |
|
#2
|
|||
|
|||
|
What do you know about the impact of meeting times, and the impact of smile times? Does a smile in February mean the same thing as a smile in December? Just because smiles appeared predictive doesn't mean all smiles are created equal, especially given that orders are locked in prior to the December meeting... (I'm assuming this is a metaphor for some other problme and other variables, but even if it's not I'd probably ask the same thing.) Even if you weren't considering the role of old December data, you should look at the significance of your fixed timing constraint.
It's rare that there is a good reason to ignore otherwise perfectly good data that you already have. Without anything else to go on, I would look at the full smile history of a fund at the point of observation to make the best prediciton possible. Of course, if you are back-testing the model I wouldn't use the 3 year old December smile to make the prediction for the January following that smile, but you should use the smile data from all Decembers prior to that one when making that evaluation.
__________________
Do not reply to this post if you rely on red font. |
|
#3
|
||||
|
||||
|
I'll go with "don't include December smiles for a model to be used in November". Best case is that December smiles are not predictive, worst case is that they are.
Worst case: to the degree that other predictors are correlated with December smiles, the fitted model will "adjust" for the information contained in December smiles. December smiles will not be used for a model that is fired in November, so some of the predictive power that exists in the other predictors were "used" by December smiles, but then December smiles were not available, so the power of that predictor has been thrown out. Your model is less useful than it otherwise would be. Best case: You spend a lot of time fretting over, and collecting information for a data element you're not going to get the benefit from. |
|
#4
|
|||
|
|||
|
Quote:
__________________
Do not reply to this post if you rely on red font. |
|
#5
|
|||
|
|||
|
No doubt, questions a good statistician/actuary should ask. For my purposes, I'm only concerned with whether to use smile data from the current year's November and December. The issue is that December data is readily available (and predictive) for the historical records, but is not available when applying the model (some funds have their meetings after putting in the order).
And yes, as cool as hedge fund espionage may be, this is a metephor for an insurance related predictive model. By chance, do you know of any literature to reference? I can find all kinds of text books and linear regression tutorials that comment on sources of bias, but they're all aimed at fitting a model for process discovery or explanation. Thus far, I can't find anything addressing this issue in context of making predictions with suppressed/incomplete data. And thank you for your comments! |
|
#7
|
||||
|
||||
|
No book recommendations, but one idea on frame-of-reference thought. Consider limiting yourself only to the kinds of data available at the time the model is used. For example, number of accidents in the next month is available in the historical data, and would probably be very predictive, but not available to an insurer or an insured at any given time. (If it's available to the insured, you may want to let your fraud folks know about it
|
|
#8
|
||||
|
||||
|
You want the data to reflect the reality that you will be posed with when using the resulting model. So, you need to "zero out" the smile indicator variable for the December smiles. In addition to that, there is probably a difference between the December smiles being zeroed out and the November smile indicator being missing because you never were actually able to get someone to the meeting to find out. So, you also should create a "Unable to Observe the Smiles" indicator and test the significance of that. Usually, we see that missing data is usually indicates something bad (and that makes sense here, because if a company has good news, they usually want it to be widely known).
__________________
Res ipsa loquitur, sed quid in infernos dicet?
|
|
#9
|
|||
|
|||
|
If the results of your predictive model for a given hedge fund depends heavily on whether that model was trained with december data included, then it seems to me that you may have bigger problems than whether to include december data, ie a rather fragile model.
If you are trying to draw conclusions about a single fund, then I don't know why you wouldn't want to include all the data possible, unless there is some particular reason why December smiles would have a different effect than November smiles- for example, as somebody else pointed out, maybe the fact that the meeting is in December itself carries information that you don't want to ignore. Of course, when you are evaluating what the real world performance of your model will or should be, then you should remember you will not have the december data. For example, you want to think about what you will do for funds that have an unknown latest meeting. |
|
#10
|
|||
|
|||
|
Can't you attempt to interact the smile data with the "how long ago" data, so that smiles last month and smiles 11 months ago can be distinguished by the model? Then you can find out directly if smiles from 11 months ago are predictive.
(I assume you aren't trying to use data that isn't available, like smiles from the following month.) |
![]() |
| Tags |
| predictve modeling |
| Thread Tools | |
| Display Modes | |
|
|