Actuarial Outpost
 
Go Back   Actuarial Outpost > Actuarial Discussion Forum > Property - Casualty / General Insurance
FlashChat Actuarial Discussion Preliminary Exams CAS/SOA Exams Cyberchat Around the World Suggestions



Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old 04-11-2019, 11:15 AM
Actuarially Me Actuarially Me is offline
Member
CAS
 
Join Date: Jun 2013
Posts: 134
Default Anyone help with by Peril GLMs?

Background:
I've been tasked with creating a rating model by Peril using GLMs. It's commercial lines property, so the data is pretty sparse. The carriers have been asking for Premiums by peril, so we're going with it regardless if it's a better model than a single pure premium model. We're also working under the assumption that Perils are independent. Excluding CAT, that assumption isn't too far off.

My data goes back 15 years, but only 2011-2017 has complete information on some variables so I only use those years. After doing all the necessary scrubbing, I'm sitting at only 50,000 policies with about 6.5% having an incurred claim.

Split by Peril:
Peril1 has 900 claims. Of those 850 have 1 claim, 45 have 2 claims and 5 have 3 claims

Peril2 and Peril 3 have 1500 and 800 claims respectively with a similar claim count breakout as Peril1

I split the data using a random partition of 70% train, 30% test, and leave 2017 as a validation set. The same breakout percentages are reserved.

For the frequency models, they're in the R format:
Code:
glm(formula = Count_Peril1 ~ Variables,
       family = poisson(link = "log),
       offset = log(Exposure),
       data = data.train)
I log all of the continuous variables. My main measure of performance is the Gini. For frequency, that's Counts on the X axis and Exposure on the Y axis. The code I use to create the Gini is this:

Code:
o <- with(f.model1, order(prediction))
x <- with(data.test, cumsum(Count_Peril1[o]) / sum(Count_Peril1[o]))
y <- with(data.test, cumsum(Exposure[o]) / sum(Exposure[o]))
dx <- x[-1] - x[-length(x)]
h <- (y[-1] + y[-length(y)]) / 2
gini.peril1 <- 2*(.5 - sum(h*dx))
I also use AIC to compare models.
I've been using deviance ratio as a replacement for R^2, ie how much of the model is actually explained by the data.

I create the deviance ratio by:
Code:
deviance <- 1-(model.peril1$deviance / model.peril1$null.deviance)

I measure severity with a glm, but target variable is Inc_Peril1, family is Gamma, offset is log(Count_Peril1) and subset of Count_Peril1 > 0

The gini is calculated similarly as above, but x axis is Count_Peril1 and y axis is Incurred_Peril1.

Problem:
I feel like my models are horrible and I don't know how/where to improve them first. The Q-Q plots suggest I'm using the wrong distribution. I've tried using negative binomial with various thetas, but that didn't seem to work.

Also, when I create the null model just looking at the intercept, the Gini is much higher than when I include any variables. The AIC and deviance are worse though. Not sure why that's the case.

While testing different variables, I check the summary to see if they're statistically significant. Then I'll look at the Gini, AIC, and deviance. I'll add/remove variables checking for improvements. Once I start only marginally increasing the Gini, I'll do an anova chi squared test to determine which models are best. Gini varies from .08-.22, which sounds pretty horrible, but I have no benchmark. When I run the summary plots, they all look pretty bad.

Here's an example of Peril1's Frequency model

Model Summary (Gini of .12):


Q-Q:


Residual Plot:


Cooks Distance:



Where do I go from here to improve the model?
Reply With Quote
  #2  
Old 04-11-2019, 12:33 PM
Vorian Atreides's Avatar
Vorian Atreides Vorian Atreides is offline
Wiki/Note Contributor
CAS
 
Join Date: Apr 2005
Location: As far as 3 cups of sugar will take you
Studying for ACAS
College: Hard Knocks
Favorite beer: Most German dark lagers
Posts: 63,806
Default

For frequency, you'll want to calculate crunched residuals instead of "raw" residuals.

See page/slide 13 & 14 of this CAS document for more information.
__________________
I find your lack of faith disturbing

Why should I worry about dying? Itís not going to happen in my lifetime!


Freedom of speech is not a license to discourtesy

#BLACKMATTERLIVES
Reply With Quote
  #3  
Old 04-11-2019, 12:36 PM
itGetsBetter itGetsBetter is offline
Member
CAS AAA
 
Join Date: Feb 2016
Location: Midwest
Studying for Awaiting Exam 9 Result
Favorite beer: Spruce Springsteen
Posts: 248
Default

1. Try some capping to improve the fit of the distribution and the plots.
2. Check out how the models are performing at segmenting the risks by using a lift chart from "7.2.3. Loss Ratio Charts" of the GLM Monograph (https://www.casact.org/pubs/monograp...hare-Tevet.pdf). You can use PP, freq, or sev instead of LR in the lift chart.
__________________
P | FM | MFE | C | S | 5 | 6 | 7 | 8 | 9
VEEs | Course 1 | Course 2 |
Reply With Quote
  #4  
Old 04-11-2019, 12:42 PM
Vorian Atreides's Avatar
Vorian Atreides Vorian Atreides is offline
Wiki/Note Contributor
CAS
 
Join Date: Apr 2005
Location: As far as 3 cups of sugar will take you
Studying for ACAS
College: Hard Knocks
Favorite beer: Most German dark lagers
Posts: 63,806
Default

Quote:
Originally Posted by itGetsBetter View Post
1. Try some capping to improve the fit of the distribution and the plots.
2. Check out how the models are performing at segmenting the risks by using a lift chart from "7.2.3. Loss Ratio Charts" of the GLM Monograph (https://www.casact.org/pubs/monograp...hare-Tevet.pdf). You can use PP, freq, or sev instead of LR in the lift chart.
based on the graphs shown, I can clearly tell which "observations" had claims and which didn't. In some cases, you can clearly identify which ones had zero, one, or two+ claims.

He'll want to address that first before trying to "improve" the model.
__________________
I find your lack of faith disturbing

Why should I worry about dying? Itís not going to happen in my lifetime!


Freedom of speech is not a license to discourtesy

#BLACKMATTERLIVES
Reply With Quote
  #5  
Old 04-11-2019, 01:45 PM
Actuarially Me Actuarially Me is offline
Member
CAS
 
Join Date: Jun 2013
Posts: 134
Default

Quote:
Originally Posted by Vorian Atreides View Post
For frequency, you'll want to calculate crunched residuals instead of "raw" residuals.
Thanks! Haven't come across that presentation yet. Very useful to see. Looking at my predicted values for my frequency is .22. So not one even close to predicts a claim. Guessing that's due to the zero inflated.

Because of this, the crunched residual plot shows a band for 0 claims, 1 claim, and 2 claims. (There were none larger than 2 in the test set).

Did I even calculate crunched residuals correctly?
Code:
data <- data %>%
arrange(Count_Peril1) %>%
mutate(res = Count_Peril1 - Predict_Peril1,
            bucket = cut(res, 500) %>%
ggplot(aes(x = res, y = bucket)) + geom_point()
It produces this image

Assuming my code is correct, definitely something wrong with my model. The average claim amount is $50k, so would it be better to do a logistic model to predict whether or not there's a claim, THEN do a severity model offset by the number of claims?
Reply With Quote
  #6  
Old 04-11-2019, 02:08 PM
Actuarially Me Actuarially Me is offline
Member
CAS
 
Join Date: Jun 2013
Posts: 134
Default

Quote:
Originally Posted by itGetsBetter View Post
1. Try some capping to improve the fit of the distribution and the plots.
2. Check out how the models are performing at segmenting the risks by using a lift chart from "7.2.3. Loss Ratio Charts" of the GLM Monograph (https://www.casact.org/pubs/monograp...hare-Tevet.pdf). You can use PP, freq, or sev instead of LR in the lift chart.
The stuff in Monograph 5 Chapter 7 is when you are comparing pure premium. I can do that when I combine my frequency and severity models, but it'd be harder to tell which part of my model sucks more. Since it's mostly frequency driven, it will probably be that.
Reply With Quote
  #7  
Old 04-11-2019, 02:20 PM
Vorian Atreides's Avatar
Vorian Atreides Vorian Atreides is offline
Wiki/Note Contributor
CAS
 
Join Date: Apr 2005
Location: As far as 3 cups of sugar will take you
Studying for ACAS
College: Hard Knocks
Favorite beer: Most German dark lagers
Posts: 63,806
Default

Quote:
Originally Posted by Actuarially Me View Post
Thanks! Haven't come across that presentation yet. Very useful to see. Looking at my predicted values for my frequency is .22. So not one even close to predicts a claim. Guessing that's due to the zero inflated.

Because of this, the crunched residual plot shows a band for 0 claims, 1 claim, and 2 claims. (There were none larger than 2 in the test set).

Did I even calculate crunched residuals correctly?
Code:
data <- data %>%
arrange(Count_Peril1) %>%
mutate(res = Count_Peril1 - Predict_Peril1,
            bucket = cut(res, 500) %>%
ggplot(aes(x = res, y = bucket)) + geom_point()
Don't bucket by "residual".
  1. sort data by predicted value (looks like that is what you did).

  2. Bucket data to (approx) equal counts based on the number of buckets you want to consider. You might do several options (50, 100, 250, 500, 1000, 10^4, 2.5*10^4, etc.; consider your dataset size for determining the "largest number of buckets to use) and see how the resulting graphics look.

    For example, if you a dataset with 50,000 observations and looking at 250 buckets, then each bucket would have 200 obs.

  3. For each bucket, calculate the "avg" predicted value as well as the (actual) observed value for the bucket.

    "Avg" could be the median, mean, or weighted mean. Use what makes sense or look at both the median and the mean results.

  4. Calculate the residual* from previous step.

    *Might also consider what adjustments to make to the "raw" residual to get a more appropriate result for comparison. For example, calculating a standardized and/or studentized residuals based on the data obtained from the previous step. Obviously, you get "better" results if there are a sufficiently large number of buckets; however, if each bucket has very little data (say, < 25 obs), the results may not be all that accurate.

  5. PROFIT!

    Graph results from prior step.


Quote:
Originally Posted by Actuarially Me View Post
Assuming my code is correct, definitely something wrong with my model. The average claim amount is $50k, so would it be better to do a logistic model to predict whether or not there's a claim, THEN do a severity model offset by the number of claims?
What is the goal/purpose of the model?

With a logistic model, you still run into the issue of "did a claim happen" since the result of a logistic is simply the probability that a claim happens. (You essentially get the same thing with the over-dispersed Poisson model.)
__________________
I find your lack of faith disturbing

Why should I worry about dying? Itís not going to happen in my lifetime!


Freedom of speech is not a license to discourtesy

#BLACKMATTERLIVES

Last edited by Vorian Atreides; 04-11-2019 at 02:23 PM..
Reply With Quote
  #8  
Old 04-11-2019, 02:51 PM
MoralHazard MoralHazard is online now
Member
CAS
 
Join Date: Jul 2011
Favorite beer: Sam Adams Rebel Rouser
Posts: 110
Default

Quote:
I feel like my models are horrible and I don't know how/where to improve them first. The Q-Q plots suggest I'm using the wrong distribution.
What you are seeing in the Q-Q plots is actually normal for a model with discrete data (such as Poisson/claim counts). See CAS GLM monograph, page 66 which discusses this.

Quote:
Also, when I create the null model just looking at the intercept, the Gini is much higher than when I include any variables.
That's odd. For a null model, all the predictions are the same, so the sort order is basically random and you shouldn't see a Gini much above 0. Are you sure you're calculating Gini correctly?

Quote:
Gini varies from .08-.22, which sounds pretty horrible
Not necessarily. Anything significantly above zero means your model has predictive power (and is better than no model). You should be comparing your Gini results to the rating plan that you're replacing.

Also -- try changing the "poisson" in your GLM code to "quasipoisson", and take another look at the GLM summary.
Reply With Quote
  #9  
Old 04-11-2019, 03:07 PM
Actuarially Me Actuarially Me is offline
Member
CAS
 
Join Date: Jun 2013
Posts: 134
Default

Quote:
Originally Posted by MoralHazard View Post
That's odd. For a null model, all the predictions are the same, so the sort order is basically random and you shouldn't see a Gini much above 0. Are you sure you're calculating Gini correctly?
Thanks, I'll give that a try too. I added my code for calculating the gini. Is that how you would do it for frequency? One of my biggest fears is that I've been calculating it incorrectly, which I'm thinking I am.
Reply With Quote
  #10  
Old 04-11-2019, 03:21 PM
Actuarially Me Actuarially Me is offline
Member
CAS
 
Join Date: Jun 2013
Posts: 134
Default

Quote:
Originally Posted by Vorian Atreides View Post
Don't bucket by "residual".

What is the goal/purpose of the model?

With a logistic model, you still run into the issue of "did a claim happen" since the result of a logistic is simply the probability that a claim happens. (You essentially get the same thing with the over-dispersed Poisson model.)
Round 2!

Here's the revised code:
Code:
crunched <- data %>% 
  arrange(predicted.peril1) %>% 
  mutate(bucket = cut2(predicted.peril1,g=100)) %>% 
  group_by(bucket) %>% 
  summarize(avg.pred = mean(predicted.peril1),
                   actual = mean(Count_Peril1)) %>% 
  mutate(crunch.res = actual - avg.pred)

ggplot(crunched, aes(x = bucket, y = crunch.res))+geom_point()
Assuming I did it correctly this time, the output of the graph shows a decreasing trend.




Purpose of the model is to have a rating plan by pure premium. Also, thanks for taking the time to explain things.
Reply With Quote
Reply

Tags
glm, peril, pure premium

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


All times are GMT -4. The time now is 05:06 PM.


Powered by vBulletin®
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
*PLEASE NOTE: Posts are not checked for accuracy, and do not
represent the views of the Actuarial Outpost or its sponsors.
Page generated in 0.27312 seconds with 9 queries