

FlashChat  Actuarial Discussion  Preliminary Exams  CAS/SOA Exams  Cyberchat  Around the World  Suggestions 

Thread Tools  Search this Thread  Display Modes 
#1




Predictive Modeling Material
I get time allocated to research/learning. Typing things out and explaining them helps keep things fresh in my mind, so I'm counting this as learning time. Hopefully others can benefit from this as well. I'll probably update this with my research and to fix the formatting. The initial post is probably going to birdseye view, but that bird vomits on your with a ton of words that are barely cohesive. I'll clean it up over time and implement any suggestions that others may have.
Predictive Modeling In Insurance: If at any point I am incorrect, please let me know and I'll adjust this post. I am self taught (outside of the advice given here), so I may have made some incorrect assumptions along the way. I focused on learning in R because I had familiarity with it so can only speak to that. If I could turn back time (if I could find a way), I would probably learn Python. It is more flexible and plays nicely with other languages. It also has packages that do GPU processing for those of you with large data sets. GLM's aren't used often outside of insurance. Usually the goal is predictive accuracy and less interpretable, "black box" methods are used. Most relationships aren't linear. One moment you're celebrating your one year anniversary and the next you're getting yelled at for not folding the towels properly. There are extensions of GLMs such as GLMM (General Linear Mixed models) which penalize coefficients similar to BuhlmannStrauss credibility and GAMs (General Additive Models) that use smooth splines to break up the linearity. With GLMM's and GAMs, you lose some of the interpretability for predictability. Models such as territory clustering, fraud detection, or claims triage usually use more advanced models. The method that is best varies by model and new algorithms are coming out. Checking out Kaggle competitions can give a good insight of the current models being used. For pure premium modeling, interpretability and ease of deployment are valued and so GLMS are used since they can be reduced to a simple multiplicative model. Either way, domain knowledge is key and most of the time will be spent getting a better understanding the underlying data and their relationships. It's a groundhog day of model iteration. But hey, "running a model" is the new "compiling code". General Linear Models for Regression GLMs take a linear model and, well... generalize it. It allows you to transform the response variable (what you're trying to predict), and apply a distribution. For insurance, we're usually dealing with claims data, so we want a positive domain. Commonly used distributions are Poisson, Gamma, Inverse Gaussian, and Tweedie. Common transformations are log and identity. Logging the response variable makes the mean of the response variable exponential and thus creates a multiplicative model instead of an additive model. Multiplicative models are also easier to interpret as values less than 1 decrease the rate, while values greater than one increase the rate. GLMs are very interpretable. Each variable and factor has a coefficient and so there is a closed form equation you can apply. GLMs handle both continuous data and categorical data. When processing categorical data, the GLM creates a dummy binary variable for each category (except the base category). For each categorical variable with n categories, the model has n1 additional parameters. Example, if you're fitting 50 U.S. States (excl DC), you're going to have 49 additional parameters to fit. High cardinality adds complexity, which can lead to overfitting. It's best to reduce the number of categories. Sometimes it's out of your control (underwriter preference). Some categories may be uncredible and thus the standard error of the coefficient will be much higher and/or you'll get unrealistic coefficients. GLMs will attempt to find an estimate for every variable and category you introduce to the model. Luckily for those who have no choice, but to include high cardinality variables, there's an extension of the GLM called Elastic Net Regression. In a nutshell, this method is a way of penalizing uncredible coefficients. It does thatt by shrinking the coefficients (Ridge Regression) or setting the coefficients to zero (Lasso Regression). I won't get into the math behind them, but Elastic Net has a couple of parameters: Lambda (severity of the penalty), and Alpha (shrinkage type). If Lambda is 0, we have a standard GLM. As Lambda increases, so does the penalty for large coefficients. Alpha is in [0,1] and determines how quickly it shrinks 0 = Ridge Regression, 1 = Lasso Regression, and anything in between is a weighted average of the two. GLMS are better when you are able to create the model to your liking. GLM Elastic nets are better when you have uncredible data and are forced to use the given variables. An additional note: You want all of your data to be on the same scale. If you use a loglink, for every continuous variable you include in the model, you'll want to log the continuous variable. Rating Modeling: With my limited experience, this is the bread and butter of pricing. Rate Making modeling needs to be interpretable for multiple reasons. Underwriters have their professional intuition they want to include or there may be laws/regulations on what you can/cannot rate. The most useful guidelines for this are the CAS Monograph 5 and Practitioners Guide to GLMs. You can model frequency and severity separately or can create a single pure premium model. Separating frequency and severity gives you more insight to the data. Some variables may be statistically significant in terms of frequency, but not severity and vice versa. A single model gives up that flexibility for simplicity. Having two models also gives you two chances to overfit the data. It's also common to use weight and offset parameters for pure premium modeling. Weights act as a way to reduce the variance of a variable. Observations with a higher weight are given more credibility. Offsets act as a way for you to adjust for a base rate. Useful when you're updating the rating variables, but not the base rate. Both of these values need to be on the same scale as the predictor. Here are some guidelines for model types: Claim Frequency: Response: log(counts/exposure) Distribution: Poisson Link: log Weight: log(exposure) Offset: None Claim Counts: Response: log(counts) Distribution: Poisson/Negative Binomial Weights: 1 Offset: log(exposure) Claim Severity: Response: ln(Loss) Distribution: Gamma/Inverse Gaussian Weight: number of claims Offset: None Pure Premium: Response: log(Loss/Exposure) Distribution: Tweedie p = 1.51.65 Weight: log(exposure) Offset: None The Tweedie distribution is something particularly useful in single model pure premium modeling. It's part of the exponential family with an extra power parameter p. When p=1 it is a Poisson, p=2 is Gamma, and p in (1,2) is a compound PoissonGamma distribution. Commonly used values are between 1.51.65. It really depends on the data. This additional variable is known as a hyperparameter, a constant parameter that is set before learning. There are methods you can do to "tune" the parameters to find the best estimate. We can get into that later. If you use a single pure premium Tweedie model, your predicted values just need to be exponentiated in order to get real values. If using a separate frequency and severity model, multiply the predictions together. How you implement the model will depend where you work. You may directly employ your model because actuaries are superior intellect beings, or you may be second class citizens to underwriters where your model is an "oh.. neat". Likely, it will be somewhere in between. You present what the data tells you and they'll tell you the trends they see in the industry. Pure Premium Model Comparison The goal of pure premium model isn't to get the best prediction, it's the ability to determine low/high risks and price accordingly. So that means typical measures like R Squared, adjusted R Squared, Mean Absolute Error, and Mean Squared error aren't the best measures. One common method is the Gini Index, which is determined from a Lorenz Curve. If you plot premiums on the xaxis and losses on the yaxis, the y=x is a "perfect model", where premiums are exactly equal to losses. If you sort and plot your pure premium predictions in ascending order, you will get a Lorenz curve. This shows the model's ability to determine profitable risks. It's hard to eyeball graphs, so we can look at the area between the lorenz curve and the perfect model. This is known as the Gini coefficient. Higher is better. The best possible model would be the entire area under the perfect model which would be a Gini coefficient of 0.50. It's technically more correct to multiply by 2 so the domain is in [0,1].This is something you can use to compare models, but doesn't really measure goodness of fit. In terms of dating, if you have attractiveness on the yaxis and craziness on the xaxis, and you plot the people you dated in order of lowest to highest attractive/crazy ratio, the curve would tell you how good your personality is. Another useful metric is AIC (Akaike Information Criterion). This penalizes more complex models if the added variables add little value. A smaller AIC is better. There are built in functions that calculate this if not available in model summary. This is good to compare models of the same type when adding or removing variables. Ie Tweeide w/ Age compared to Tweedie w/o Age. The formula for AIC is 2kln(Maximum Likeihood) where k is the number of parameters in the model. A measure of goodnessoffit is the deviance. This is better than R Squared because it takes the shape of the distribution into consideration. The formulas to calculate deviance are a bit complicated, but it can easily be pulled from a model summary. Model Building Tips: Quote:
Quote:
Help me fill this out! Resources: General Predictive Modeling Introduction to Statistical Learning. This book is a high level look at Machine Learning. Has supplementary R coding without getting into the mathematics behind the methods. Also available in pdf for free. Applied Predictive Modeling. Written by Max Kuhn (author of caret package). This book does a good job of going over all the types of predictive models with accompanying R code. It does a good job of breaking everything down and showing the pros and cons of various models. Gets a little deeper into the math. Elements of Statistical Learning. The machine learning bible. This book is DENSE. It goes into the mathematics and theory behind all of the methods. Brush up on your Linear Algebra and eigenvectors. Probably way more than you need to know for your job, but maybe you'll use this to come up with some new actuarial specific models. R for Data Science. Written by arguably one of the most influential people in R. If you use dplyr or anything else tidyverse, Hadley Wickham is to thank. It's free at the link, but worth a buy for a quick reference or some inspiration. Datacamp is a great way to learn R and all of the update trends. You can usually get a 50% off subscription. They're always adding new courses. There's even a course on Life Insurance Pricing for those wanting to abandon CAS. Coursera Data Science Track by John Hopkins is also good. Each class has homework and projects which give you a good basis of personal projects. You can audit the class for free or pay a monthly fee to get an official certificate that you can put on your LinkedIn or print and place on your refrigerator. No matter how hard they push it, I don't think employers care about the certificate, so free is fine. Insurance Specific Modeling CAS Monograph 5. The main reference point for rating. I read it once and maintained 20% of it. Read it again and I'm up to probably 30%. Practitioners guide to GLMs (Warning: Clicking link will download PDF). Somehow this one slipped past me until yesterday. Another user on here suggested it and after a 30 second overview, it does better at explaining things that I have done above. Predictive Modeing In Actuarial Science Vol 1. Similar to Applied Predictive Modeling, but designed around Insurance. Predictive Modeling In Actuarial Science Vol 2. Gives actual examples of various models used in the insurance industry. GLMMS for Ratemaking A paper going over GLMMs for those of you who want a deeper dive. Minimum Bias, GLMs, and Credibility... Another way to look at modeling. Using minimum bias instead of maximum likeihood estimate. This paper also does a good job of comparing methods. Insurance Premium via Gradient TreeBoosted Tweedie.... A paper going over the TDboost R package. Boosting is a different type of machine learning algorithm. It can be useful for determining interaction terms or measuring variable importance HDTweedie Package An elastic net for the Tweedie distribution. The glmnet package in R does not support the Tweedie Suggested R packages: tidyverse  this package includes most of Wickham's packages: ggplot2: Must have for visualizations tibble: a better version of a data frame tidyr: I don't remember what this has in addition to dplyr readr: better way to import data. Imports as a tibble purrr: better version of the apply family dplyr: must have. Pretty much SQL transformations in R stringr: better ways to deal with strings forcats: formulas to deal with categorical variables statmod: includes QoL statstical computations and allows tweedie to be used in glm formula rpart: for decision trees. Can be useful to find relationships to apply to GLMS caret: great for testing models with hyperparameters and using cross validation. Provides QoL calculations for comparing models. Also has function createDataPartition which is useful for creating training/validation sets. Matrix: great package to have when creating sparse matrices for categorical variables. corrplot: creates a matrix to visualize correlations between continuous variables. MICE: for various ways to handle missing data Last edited by Actuarially Me; 04162019 at 02:04 PM.. 
#2




Thanks for pulling this together.
Lift charts are really important in insurance because the point of modeling in pricing is to differentiate between high and low loss insureds. Base rate can be adjusted up and down to get a target loss ratio, but segmentation is what the rating factor models are here for. See 7.2.1  7.2.3 of the GLM Monograph. Lift charts are great for discussing with underwriters too, because once you have explained how the charts work, they can see how well your model identifies the best 10% and worst 10% of risks and also keeps the insureds in order in between. A double lift chart can be used to show how much better the new version of the model is compared to the old. 
#3




This is shaping up to be a great resource.
My two cents: The bulk of the thread so far is in regards to predictive modeling in pricing but there is also application in reserving. Worth being familiar with the general idea as it's been picking up steam (e.g. A.M. Best moving to a stochastic BCAR framework). I can't stress enough how helpful it is to get your hands dirty ASAP. A math professor of mine liked to say "Math isn't a spectator sport" and it applies here as well.
__________________
Last edited by redearedslider; 02202019 at 03:19 AM.. 
#4




Quote:
I don't have much experience on the reserving side to type something up, but if someone else has something, I'd add it. My experience was for self insured reserving with mostly WC and GL. One model we did was claim triage, which predicted high risk claims and sent them to the TPA. We were starting to look into using it to predict Ultimates, but I left before that got off the ground. 
#5




Quote:

#6




I've been through some of these and some of them I've never heard of. Regardless, my biggest problem is currently (and will be going forward) convincing management/IT to record data at a level raw enough to actually implement a model appropriately (e.g. variables are currently captured at an already binned/resolved categorical value, and the raw figures are lost on an exposure base level). If there's a text source out there to help solve that problem, I'd love to hear about it.
__________________

#7




Quote:
It really depends on what you're trying to record. There's some things that can be scraped from websites, but accuracy is only as good as the website. Sometimes, it's just better to go to a vendor. Verisk is one of the more respected vendors, so might be easier to get management on board. However, they can be expensive. There are a lot of other insurance tech startups that have much better prices. There's one vendor that has a cache of Firehydrant locations. Something like that is definitely cheaper to go to a vendor than do yourself. Distance calculations can likely be done in house. Google Maps API is pretty cheap and you can use it to geocache addresses. There's publically available data of Firestation locations, police locations, schools, hospitals, etc. that could add predictability to certain lines. 
#8




Quote:
__________________

#9




Quote:

#10




Yup. Been that way for a couple decades at least. I've convinced my actuarial superiors that this is true, but I haven't convinced anyone to actually do anything about it from the IT side. Adverse selection is rampant!
__________________

Thread Tools  Search this Thread 
Display Modes  

