Actuarial Outpost
 
Go Back   Actuarial Outpost > Actuarial Discussion Forum > Property - Casualty / General Insurance
FlashChat Actuarial Discussion Preliminary Exams CAS/SOA Exams Cyberchat Around the World Suggestions

DW Simpson International Actuarial Jobs
Canada  Asia  Australia  Bermuda  Latin America  Europe


Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old 02-19-2019, 03:25 PM
Actuarially Me Actuarially Me is offline
Member
CAS
 
Join Date: Jun 2013
Posts: 134
Default Predictive Modeling Material

I get time allocated to research/learning. Typing things out and explaining them helps keep things fresh in my mind, so I'm counting this as learning time. Hopefully others can benefit from this as well. I'll probably update this with my research and to fix the formatting. The initial post is probably going to birds-eye view, but that bird vomits on your with a ton of words that are barely cohesive. I'll clean it up over time and implement any suggestions that others may have.

Predictive Modeling In Insurance:

If at any point I am incorrect, please let me know and I'll adjust this post. I am self taught (outside of the advice given here), so I may have made some incorrect assumptions along the way. I focused on learning in R because I had familiarity with it so can only speak to that. If I could turn back time (if I could find a way), I would probably learn Python. It is more flexible and plays nicely with other languages. It also has packages that do GPU processing for those of you with large data sets.



GLM's aren't used often outside of insurance. Usually the goal is predictive accuracy and less interpretable, "black box" methods are used. Most relationships aren't linear. One moment you're celebrating your one year anniversary and the next you're getting yelled at for not folding the towels properly. There are extensions of GLMs such as GLMM (General Linear Mixed models) which penalize coefficients similar to Buhlmann-Strauss credibility and GAMs (General Additive Models) that use smooth splines to break up the linearity. With GLMM's and GAMs, you lose some of the interpretability for predictability.

Models such as territory clustering, fraud detection, or claims triage usually use more advanced models. The method that is best varies by model and new algorithms are coming out. Checking out Kaggle competitions can give a good insight of the current models being used.

For pure premium modeling, interpretability and ease of deployment are valued and so GLMS are used since they can be reduced to a simple multiplicative model.

Either way, domain knowledge is key and most of the time will be spent getting a better understanding the underlying data and their relationships. It's a ground-hog day of model iteration. But hey, "running a model" is the new "compiling code".


General Linear Models for Regression

GLMs take a linear model and, well... generalize it. It allows you to transform the response variable (what you're trying to predict), and apply a distribution. For insurance, we're usually dealing with claims data, so we want a positive domain. Commonly used distributions are Poisson, Gamma, Inverse Gaussian, and Tweedie. Common transformations are log and identity. Logging the response variable makes the mean of the response variable exponential and thus creates a multiplicative model instead of an additive model. Multiplicative models are also easier to interpret as values less than 1 decrease the rate, while values greater than one increase the rate.

GLMs are very interpretable. Each variable and factor has a coefficient and so there is a closed form equation you can apply. GLMs handle both continuous data and categorical data. When processing categorical data, the GLM creates a dummy binary variable for each category (except the base category). For each categorical variable with n categories, the model has n-1 additional parameters. Example, if you're fitting 50 U.S. States (excl DC), you're going to have 49 additional parameters to fit. High cardinality adds complexity, which can lead to overfitting. It's best to reduce the number of categories. Sometimes it's out of your control (underwriter preference). Some categories may be uncredible and thus the standard error of the coefficient will be much higher and/or you'll get unrealistic coefficients. GLMs will attempt to find an estimate for every variable and category you introduce to the model.

Luckily for those who have no choice, but to include high cardinality variables, there's an extension of the GLM called Elastic Net Regression. In a nutshell, this method is a way of penalizing uncredible coefficients. It does thatt by shrinking the coefficients (Ridge Regression) or setting the coefficients to zero (Lasso Regression). I won't get into the math behind them, but Elastic Net has a couple of parameters: Lambda (severity of the penalty), and Alpha (shrinkage type). If Lambda is 0, we have a standard GLM. As Lambda increases, so does the penalty for large coefficients. Alpha is in [0,1] and determines how quickly it shrinks 0 = Ridge Regression, 1 = Lasso Regression, and anything in between is a weighted average of the two.

GLMS are better when you are able to create the model to your liking. GLM Elastic nets are better when you have uncredible data and are forced to use the given variables.

An additional note: You want all of your data to be on the same scale. If you use a log-link, for every continuous variable you include in the model, you'll want to log the continuous variable.

Rating Modeling:

With my limited experience, this is the bread and butter of pricing. Rate Making modeling needs to be interpretable for multiple reasons. Underwriters have their professional intuition they want to include or there may be laws/regulations on what you can/cannot rate. The most useful guidelines for this are the CAS Monograph 5 and Practitioners Guide to GLMs.

You can model frequency and severity separately or can create a single pure premium model. Separating frequency and severity gives you more insight to the data. Some variables may be statistically significant in terms of frequency, but not severity and vice versa. A single model gives up that flexibility for simplicity. Having two models also gives you two chances to overfit the data.

It's also common to use weight and offset parameters for pure premium modeling. Weights act as a way to reduce the variance of a variable. Observations with a higher weight are given more credibility. Offsets act as a way for you to adjust for a base rate. Useful when you're updating the rating variables, but not the base rate. Both of these values need to be on the same scale as the predictor.

Here are some guidelines for model types:

Claim Frequency:
Response: log(counts/exposure)
Distribution: Poisson
Link: log
Weight: log(exposure)
Offset: None

Claim Counts:
Response: log(counts)
Distribution: Poisson/Negative Binomial
Weights: 1
Offset: log(exposure)

Claim Severity:
Response: ln(Loss)
Distribution: Gamma/Inverse Gaussian
Weight: number of claims
Offset: None

Pure Premium:
Response: log(Loss/Exposure)
Distribution: Tweedie p = 1.5-1.65
Weight: log(exposure)
Offset: None

The Tweedie distribution is something particularly useful in single model pure premium modeling. It's part of the exponential family with an extra power parameter p. When p=1 it is a Poisson, p=2 is Gamma, and p in (1,2) is a compound Poisson-Gamma distribution. Commonly used values are between 1.5-1.65. It really depends on the data. This additional variable is known as a hyperparameter, a constant parameter that is set before learning. There are methods you can do to "tune" the parameters to find the best estimate. We can get into that later.

If you use a single pure premium Tweedie model, your predicted values just need to be exponentiated in order to get real values.
If using a separate frequency and severity model, multiply the predictions together.

How you implement the model will depend where you work. You may directly employ your model because actuaries are superior intellect beings, or you may be second class citizens to underwriters where your model is an "oh.. neat". Likely, it will be somewhere in between. You present what the data tells you and they'll tell you the trends they see in the industry.

Pure Premium Model Comparison

The goal of pure premium model isn't to get the best prediction, it's the ability to determine low/high risks and price accordingly. So that means typical measures like R Squared, adjusted R Squared, Mean Absolute Error, and Mean Squared error aren't the best measures.

One common method is the Gini Index, which is determined from a Lorenz Curve. If you plot premiums on the x-axis and losses on the y-axis, the y=x is a "perfect model", where premiums are exactly equal to losses. If you sort and plot your pure premium predictions in ascending order, you will get a Lorenz curve. This shows the model's ability to determine profitable risks. It's hard to eyeball graphs, so we can look at the area between the lorenz curve and the perfect model. This is known as the Gini coefficient. Higher is better. The best possible model would be the entire area under the perfect model which would be a Gini coefficient of 0.50. It's technically more correct to multiply by 2 so the domain is in [0,1].This is something you can use to compare models, but doesn't really measure goodness of fit.

In terms of dating, if you have attractiveness on the y-axis and craziness on the x-axis, and you plot the people you dated in order of lowest to highest attractive/crazy ratio, the curve would tell you how good your personality is.

Another useful metric is AIC (Akaike Information Criterion). This penalizes more complex models if the added variables add little value. A smaller AIC is better. There are built in functions that calculate this if not available in model summary. This is good to compare models of the same type when adding or removing variables. Ie Tweeide w/ Age compared to Tweedie w/o Age. The formula for AIC is 2k-ln(Maximum Likeihood) where k is the number of parameters in the model.

A measure of goodness-of-fit is the deviance. This is better than R Squared because it takes the shape of the distribution into consideration. The formulas to calculate deviance are a bit complicated, but it can easily be pulled from a model summary.

Model Building Tips:

Quote:
Originally Posted by itGetsBetter View Post
Thanks for pulling this together.

Lift charts are really important in insurance because the point of modeling in pricing is to differentiate between high and low loss insureds. Base rate can be adjusted up and down to get a target loss ratio, but segmentation is what the rating factor models are here for. See 7.2.1 - 7.2.3 of the GLM Monograph. Lift charts are great for discussing with underwriters too, because once you have explained how the charts work, they can see how well your model identifies the best 10% and worst 10% of risks and also keeps the insureds in order in between. A double lift chart can be used to show how much better the new version of the model is compared to the old.
Quote:
Originally Posted by redearedslider View Post
I can't stress enough how helpful it is to get your hands dirty ASAP. A math professor of mine liked to say "Math isn't a spectator sport" and it applies here as well.


Help me fill this out!

Resources:
General Predictive Modeling
Introduction to Statistical Learning. This book is a high level look at Machine Learning. Has supplementary R coding without getting into the mathematics behind the methods. Also available in pdf for free.

Applied Predictive Modeling. Written by Max Kuhn (author of caret package). This book does a good job of going over all the types of predictive models with accompanying R code. It does a good job of breaking everything down and showing the pros and cons of various models. Gets a little deeper into the math.

Elements of Statistical Learning. The machine learning bible. This book is DENSE. It goes into the mathematics and theory behind all of the methods. Brush up on your Linear Algebra and eigenvectors. Probably way more than you need to know for your job, but maybe you'll use this to come up with some new actuarial specific models.

R for Data Science. Written by arguably one of the most influential people in R. If you use dplyr or anything else tidyverse, Hadley Wickham is to thank. It's free at the link, but worth a buy for a quick reference or some inspiration.

Datacamp is a great way to learn R and all of the update trends. You can usually get a 50% off subscription. They're always adding new courses. There's even a course on Life Insurance Pricing for those wanting to abandon CAS.

Coursera Data Science Track by John Hopkins is also good. Each class has homework and projects which give you a good basis of personal projects. You can audit the class for free or pay a monthly fee to get an official certificate that you can put on your LinkedIn or print and place on your refrigerator. No matter how hard they push it, I don't think employers care about the certificate, so free is fine.

Insurance Specific Modeling
CAS Monograph 5. The main reference point for rating. I read it once and maintained 20% of it. Read it again and I'm up to probably 30%.

Practitioners guide to GLMs (Warning: Clicking link will download PDF). Somehow this one slipped past me until yesterday. Another user on here suggested it and after a 30 second overview, it does better at explaining things that I have done above.

Predictive Modeing In Actuarial Science Vol 1. Similar to Applied Predictive Modeling, but designed around Insurance.

Predictive Modeling In Actuarial Science Vol 2. Gives actual examples of various models used in the insurance industry.


GLMMS for Ratemaking A paper going over GLMMs for those of you who want a deeper dive.


Minimum Bias, GLMs, and Credibility... Another way to look at modeling. Using minimum bias instead of maximum likeihood estimate. This paper also does a good job of comparing methods.


Insurance Premium via Gradient Tree-Boosted Tweedie.... A paper going over the TDboost R package. Boosting is a different type of machine learning algorithm. It can be useful for determining interaction terms or measuring variable importance

HDTweedie Package An elastic net for the Tweedie distribution. The glmnet package in R does not support the Tweedie

Suggested R packages:


tidyverse - this package includes most of Wickham's packages:
ggplot2: Must have for visualizations
tibble: a better version of a data frame
tidyr: I don't remember what this has in addition to dplyr
readr: better way to import data. Imports as a tibble
purrr: better version of the apply family
dplyr: must have. Pretty much SQL transformations in R
stringr: better ways to deal with strings
forcats: formulas to deal with categorical variables

statmod: includes QoL statstical computations and allows tweedie to be used in glm formula
rpart: for decision trees. Can be useful to find relationships to apply to GLMS
caret: great for testing models with hyperparameters and using cross validation. Provides QoL calculations for comparing models. Also has function createDataPartition which is useful for creating training/validation sets.

Matrix: great package to have when creating sparse matrices for categorical variables.
corrplot: creates a matrix to visualize correlations between continuous variables.
MICE: for various ways to handle missing data

Last edited by Actuarially Me; 04-16-2019 at 02:04 PM..
Reply With Quote
  #2  
Old 02-19-2019, 05:21 PM
itGetsBetter itGetsBetter is online now
Member
CAS AAA
 
Join Date: Feb 2016
Location: Midwest
Studying for Awaiting Exam 9 Result
Favorite beer: Spruce Springsteen
Posts: 248
Default

Thanks for pulling this together.

Lift charts are really important in insurance because the point of modeling in pricing is to differentiate between high and low loss insureds. Base rate can be adjusted up and down to get a target loss ratio, but segmentation is what the rating factor models are here for. See 7.2.1 - 7.2.3 of the GLM Monograph. Lift charts are great for discussing with underwriters too, because once you have explained how the charts work, they can see how well your model identifies the best 10% and worst 10% of risks and also keeps the insureds in order in between. A double lift chart can be used to show how much better the new version of the model is compared to the old.
Reply With Quote
  #3  
Old 02-19-2019, 09:24 PM
redearedslider's Avatar
redearedslider redearedslider is offline
Member
CAS
 
Join Date: Oct 2015
Posts: 13,804
Default

This is shaping up to be a great resource.

My two cents:

The bulk of the thread so far is in regards to predictive modeling in pricing but there is also application in reserving. Worth being familiar with the general idea as it's been picking up steam (e.g. A.M. Best moving to a stochastic BCAR framework).

I can't stress enough how helpful it is to get your hands dirty ASAP. A math professor of mine liked to say "Math isn't a spectator sport" and it applies here as well.
__________________
Quote:
Originally Posted by Abraham Weishaus View Post
ASM does not have a discussion of stimulation, but considering how boring the manual is, maybe it would be a good idea.

Last edited by redearedslider; 02-20-2019 at 03:19 AM..
Reply With Quote
  #4  
Old 02-20-2019, 08:58 AM
Actuarially Me Actuarially Me is offline
Member
CAS
 
Join Date: Jun 2013
Posts: 134
Default

Quote:
Originally Posted by redearedslider View Post
This is shaping up to be a great resource.

My two cents:

The bulk of the thread so far is in regards to predictive modeling in pricing but there is also application in reserving. Worth being familiar with the general idea as it's been picking up steam (e.g. A.M. Best moving to a stochastic BCAR framework).

I can't stress enough how helpful it is to get your hands dirty ASAP. A math professor of mine liked to say "Math isn't a spectator sport" and it applies here as well.
Definitely agree on the getting hands dirty ASAP. Added it to the tips. I've taken a ton of DataCamp courses, but the material doesn't stick unless I go back and apply them somehow. Tons of datasets on Kaggle to test out methods. R Studio has a free publishing platform, so you can type up a markdown and publish it to rpubs.com. It' good for building a portfolio if you plan on interviewing.

I don't have much experience on the reserving side to type something up, but if someone else has something, I'd add it. My experience was for self insured reserving with mostly WC and GL. One model we did was claim triage, which predicted high risk claims and sent them to the TPA. We were starting to look into using it to predict Ultimates, but I left before that got off the ground.
Reply With Quote
  #5  
Old 02-20-2019, 08:59 AM
Actuarially Me Actuarially Me is offline
Member
CAS
 
Join Date: Jun 2013
Posts: 134
Default

Quote:
Originally Posted by itGetsBetter View Post
Thanks for pulling this together.

Lift charts are really important in insurance because the point of modeling in pricing is to differentiate between high and low loss insureds. Base rate can be adjusted up and down to get a target loss ratio, but segmentation is what the rating factor models are here for. See 7.2.1 - 7.2.3 of the GLM Monograph. Lift charts are great for discussing with underwriters too, because once you have explained how the charts work, they can see how well your model identifies the best 10% and worst 10% of risks and also keeps the insureds in order in between. A double lift chart can be used to show how much better the new version of the model is compared to the old.
Thanks! Added your quote to the tips! Going to print out the practitioners guide that you suggested in another thread. I can't believe I passed that up while looking for articles lol.
Reply With Quote
  #6  
Old 02-20-2019, 09:13 AM
ALivelySedative's Avatar
ALivelySedative ALivelySedative is offline
Member
CAS
 
Join Date: Dec 2013
Location: Land of the Pine
College: UNC-Chapel Hill Alum
Favorite beer: Red Oak
Posts: 3,160
Default

I've been through some of these and some of them I've never heard of. Regardless, my biggest problem is currently (and will be going forward) convincing management/IT to record data at a level raw enough to actually implement a model appropriately (e.g. variables are currently captured at an already binned/resolved categorical value, and the raw figures are lost on an exposure base level). If there's a text source out there to help solve that problem, I'd love to hear about it.
__________________
Stuff | 6 | ACAS | FCAS stuff
Reply With Quote
  #7  
Old 02-20-2019, 09:35 AM
Actuarially Me Actuarially Me is offline
Member
CAS
 
Join Date: Jun 2013
Posts: 134
Default

Quote:
Originally Posted by ALivelySedative View Post
I've been through some of these and some of them I've never heard of. Regardless, my biggest problem is currently (and will be going forward) convincing management/IT to record data at a level raw enough to actually implement a model appropriately (e.g. variables are currently captured at an already binned/resolved categorical value, and the raw figures are lost on an exposure base level). If there's a text source out there to help solve that problem, I'd love to hear about it.
Getting the data can be the hardest part. The actual model only accounts for a couple lines of code, so most of your time is spent gathering and understanding data.

It really depends on what you're trying to record. There's some things that can be scraped from websites, but accuracy is only as good as the website.

Sometimes, it's just better to go to a vendor. Verisk is one of the more respected vendors, so might be easier to get management on board. However, they can be expensive. There are a lot of other insurance tech startups that have much better prices. There's one vendor that has a cache of Firehydrant locations. Something like that is definitely cheaper to go to a vendor than do yourself.

Distance calculations can likely be done in house. Google Maps API is pretty cheap and you can use it to geocache addresses. There's publically available data of Firestation locations, police locations, schools, hospitals, etc. that could add predictability to certain lines.
Reply With Quote
  #8  
Old 02-20-2019, 09:59 AM
ALivelySedative's Avatar
ALivelySedative ALivelySedative is offline
Member
CAS
 
Join Date: Dec 2013
Location: Land of the Pine
College: UNC-Chapel Hill Alum
Favorite beer: Red Oak
Posts: 3,160
Default

Quote:
Originally Posted by Actuarially Me View Post
Getting the data can be the hardest part. The actual model only accounts for a couple lines of code, so most of your time is spent gathering and understanding data.

It really depends on what you're trying to record. There's some things that can be scraped from websites, but accuracy is only as good as the website.

Sometimes, it's just better to go to a vendor. Verisk is one of the more respected vendors, so might be easier to get management on board. However, they can be expensive. There are a lot of other insurance tech startups that have much better prices. There's one vendor that has a cache of Firehydrant locations. Something like that is definitely cheaper to go to a vendor than do yourself.

Distance calculations can likely be done in house. Google Maps API is pretty cheap and you can use it to geocache addresses. There's publically available data of Firestation locations, police locations, schools, hospitals, etc. that could add predictability to certain lines.
No no, I mean basic stuff. Driving experience is not recorded as a raw value. The combination of drivers' experience on a policy is binned into some categorical value, which is then recorded for all risks on the policy. So not only is it not recorded for an individual, they're certainly not matched to an actual exposure (car-year). I'm just trying to convince people that this is a bad approach to our own data.
__________________
Stuff | 6 | ACAS | FCAS stuff
Reply With Quote
  #9  
Old 02-20-2019, 10:01 AM
Actuarially Me Actuarially Me is offline
Member
CAS
 
Join Date: Jun 2013
Posts: 134
Default

Quote:
Originally Posted by ALivelySedative View Post
No no, I mean basic stuff. Driving experience is not recorded as a raw value. The combination of drivers' experience on a policy is binned into some categorical value, which is then recorded for all risks on the policy. So not only is it not recorded for an individual, they're certainly not matched to an actual exposure (car-year). I'm just trying to convince people that this is a bad approach to our own data.
Big oof!
Reply With Quote
  #10  
Old 02-20-2019, 10:07 AM
ALivelySedative's Avatar
ALivelySedative ALivelySedative is offline
Member
CAS
 
Join Date: Dec 2013
Location: Land of the Pine
College: UNC-Chapel Hill Alum
Favorite beer: Red Oak
Posts: 3,160
Default

Quote:
Originally Posted by Actuarially Me View Post
Big oof!
Yup. Been that way for a couple decades at least. I've convinced my actuarial superiors that this is true, but I haven't convinced anyone to actually do anything about it from the IT side. Adverse selection is rampant!
__________________
Stuff | 6 | ACAS | FCAS stuff
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


All times are GMT -4. The time now is 05:27 PM.


Powered by vBulletin®
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
*PLEASE NOTE: Posts are not checked for accuracy, and do not
represent the views of the Actuarial Outpost or its sponsors.
Page generated in 0.39255 seconds with 9 queries