Actuarial Outpost
 
Go Back   Actuarial Outpost > Actuarial Discussion Forum > Property - Casualty / General Insurance
FlashChat Actuarial Discussion Preliminary Exams CAS/SOA Exams Cyberchat Around the World Suggestions


Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old 02-13-2019, 01:05 PM
Actuarially Me Actuarially Me is offline
Member
CAS
 
Join Date: Jun 2013
Posts: 191
Default Single Model Pure Premium GLM: Any reason to not use Poisson over Tweedie?

I've been testing various single model solutions for non-Auto, Pure Premium modeling in R. My goal is to min-max deviance and Gini. The Tweedie distribution usually performs slightly better, but the 'tweedie' and 'HDtweedie' packages have some limitations that make it difficult to work with.

The CAS Monograph 5 suggests Poisson for Frequency Modeling, Gamma for Severity Modeling, and Tweedie for single model pure premium. But I look at that as a suggestion rather than the end all be all.

Is there any major disadvantage to using Poisson in respects to Pure Premium modeling I'm overlooking?

The 'HDtweedie' package is a wrapper of the glmnet. However, glmnet allows the use of a sparse matrix in dgCMatrix format which removes all 0's to reduce storage size. HDtweedie only allows a Matrix format. When I try using model.matrix using the same variables, the matrix size is too large. I checked memory.limit() and I have 16gigs allocated to R.

There's also some better metrics you can pull from the glmnet package that you can't from HDTweedie.


I prefer the glmnet cause has reasonable coefficients and penalizes non-credible categories. As a result, it outperforms the glm by 20%.

Edit for people that discover this later and don't want to browse the data:

This thread got a bit distracted. Here's what I've gathered from user response and other readings I have done.

Per the "Practitioners Guide to GLMS (I'd link it, but it links to a direct pdf):

Page 3: log linked Poisson GLMs are equivalent to multiplicative balance principles of minimum bias estimations (minimum bias estimation goes all the way back to the 1960s and were used when computing power was limited)

Page 19: Log-linked Poisson is commonly used for Frequencies because the log link makes it a multiplicative model (much easier to implement and compare factors) and because it is invariant to measures of time: modeling frequencies per year will yield the same results as per month.

Page 20: Log-linked Gamma is commonly used for Severities because the log link makes it multiplicative and Gamma is invariant to measures of currency. Measuring severity in dollars and cents will yield same results.

The log linked Tweedie distribution w/ p in (1,2) is considered a compound Poisson-Gamma distribution. The closer to 1, it acts more like Poisson, and closer to 2, it acts more like Gamma. Common values are 1.5-1..65. It also makes the assumption that frequency and severity are highly correlated. I'm not too familiar with the Tweedie distribution likelihood function, but due to its complexity, it's a bit harder to grab some metrics.


It really depends on your data if it is appropriate. When you select a family, you're choosing the mean-variance relationship. For Poisson GLMs, the mean-variance relationship is the identity. Despite the warnings that most statistical software gives you, it's completely reasonable to model a relationship in continuous data in which the relationship between two variables is linear on the log scale, and the variance increases in accordance with the mean.

If you look at the residuals, you can determine whether the Poisson mean-variance relationship is accurate. If not, may be better to use Gamma whose mean-variance is x^2.

Back to my original question: Is there any major disadvantages of using Poisson over Tweedie? No, but it's worth also checking Gamma.

Last edited by Actuarially Me; 02-21-2019 at 10:57 AM..
Reply With Quote
  #2  
Old 02-13-2019, 01:14 PM
kevinykuo kevinykuo is offline
Member
CAS
 
Join Date: Nov 2017
Posts: 48
Default

FYI both H2O and Spark/sparklyr support GLM with Tweedie.

Have you tried validating both approaches (pure premium vs. freq/sev) on a validation set?
Reply With Quote
  #3  
Old 02-13-2019, 01:14 PM
ShundayBloodyShunday's Avatar
ShundayBloodyShunday ShundayBloodyShunday is offline
Member
CAS
 
Join Date: Apr 2013
Posts: 2,917
Default

Note that one of the parameters you can select for a Tweedie model is the "power parameter". Let's call it p.

Note that if p = 1, the model is the same as Poisson-distributed model. If p = 2, the model is the same as a Gamma-distributed model.

So using a Poisson model is essentially saying that frequency is driving pure premium (which might very well be the case if severity is pretty homogeneous regardless of the risk profile on the books).

One thing you might do to assess if a Poisson model is appropriate is to create a Severity model and see if there are significant variation for severity with your data. If not (very few variables are statistically significant and/or parameter values have very little deviation and lift chart is fairly flat), then you have support for modeling pure premium with Poisson.
Reply With Quote
  #4  
Old 02-13-2019, 01:22 PM
Actuarially Me Actuarially Me is offline
Member
CAS
 
Join Date: Jun 2013
Posts: 191
Default

Quote:
Originally Posted by ShundayBloodyShunday View Post
Note that one of the parameters
So using a Poisson model is essentially saying that frequency is driving pure premium (which might very well be the case if severity is pretty homogeneous regardless of the risk profile on the books).
That's true. It's probably the case the statistically important variables are frequency driven. I haven't tested too many models separating frequency and severity as I've been trying to stick to a single model, but that gives me some direction.
Reply With Quote
  #5  
Old 02-13-2019, 01:27 PM
Actuarially Me Actuarially Me is offline
Member
CAS
 
Join Date: Jun 2013
Posts: 191
Default

Quote:
Originally Posted by kevinykuo View Post
FYI both H2O and Spark/sparklyr support GLM with Tweedie.

Have you tried validating both approaches (pure premium vs. freq/sev) on a validation set?

I remember seeing h.20 supports Tweedie. I haven't dealt too much with it yet, but something worth looking into.

It's not the file size that's large really, it's the sparse matrix that is created. Only about 75,000 observations, but encoding the dummy variables blows up the size I guess. So I'd rather use h.20 rather than investing time in getting Spark up and running.

Have you noticed any improvements over h.20 or spark?
Reply With Quote
  #6  
Old 02-13-2019, 04:24 PM
kevinykuo kevinykuo is offline
Member
CAS
 
Join Date: Nov 2017
Posts: 48
Default

It's pretty quick/easy to set up sparklyr in local mode on your laptop. However, for 75k records I'm pretty sure H2O will be faster.
Reply With Quote
  #7  
Old 02-13-2019, 09:21 PM
FactuarialStatement's Avatar
FactuarialStatement FactuarialStatement is offline
Member
CAS AAA
 
Join Date: Oct 2012
Studying for 5
Favorite beer: Beer
Posts: 2,107
Default

For 75k records you can use your cellphone. I would not be considering h20 when there are much better R packages.

It sounds like you have some high cardinality categorical features. Why do you think it is the shrinkage priors of a ridge/lasso regression are giving you better coefficient estimates? Think about it for a while.

Do you think you don’t need to reduce the number of levels of those factors or add some hierarchical structure just because you’re using ________ software/package/fancy machine learning method?

Also, Think about why you may not want to/be able to use a counting distribution to model a random process X w support on [0, Inf)
__________________
P | FM | 3F | 3ST | 3LC | C | 5 | 6 |
OC1 | OC2 | COP
Econ | Stats | Corp Fin
ACAS

7
8
9

Last edited by FactuarialStatement; 02-13-2019 at 09:27 PM..
Reply With Quote
  #8  
Old 02-13-2019, 09:22 PM
Whiskey's Avatar
Whiskey Whiskey is offline
Member
CAS
 
Join Date: Jul 2008
Studying for nothing at all
Posts: 40,956
Default

Quote:
Originally Posted by FactuarialStatement View Post
For 75k records you can use your cellphone.
lol
__________________
Whiskey "on tap" - 12/31/15
Spoiler:

Bourbon(16)
Pappy Van Winkle-Family Reserve 15yr & lot "B" 12yr
Baker's 7yr
Blanton's
Knob Creek - 9yr
Basil Hayden
Maker's Mark - 46 & Cask Strength
Woodford Reserve
Jack Daniel's-Gentleman Jack, Single Barrel & Old #7
Four Roses Small Batch
Noah's Mill
Kirkland Bourbon - 7yr

Rye(6)
Angle Envy's - Finished Rye
Ravenswood Rye
Bulleit
High West - Double Rye
Whistle Pig - 10 yr
Old Overholt

Scotch(4)
Glenfiddich - The Distiller's Edition
Tomatin 12yr
The Dimple Pinch - 15 yr
Dewar's White Label

Irish(6)

Middleton Very Rare
Redbreast 12yr
Bushmill - Single Malt 16 yr
Connemarai
Jameson - Caskmates Stout and Regular
Reply With Quote
  #9  
Old 02-14-2019, 11:24 AM
Actuarially Me Actuarially Me is offline
Member
CAS
 
Join Date: Jun 2013
Posts: 191
Default

Quote:
Originally Posted by FactuarialStatement View Post
For 75k records you can use your cellphone. I would not be considering h20 when there are much better R packages.

It sounds like you have some high cardinality categorical features. Why do you think it is the shrinkage priors of a ridge/lasso regression are giving you better coefficient estimates? Think about it for a while.

Do you think you donít need to reduce the number of levels of those factors or add some hierarchical structure just because youíre using ________ software/package/fancy machine learning method?

Also, Think about why you may not want to/be able to use a counting distribution to model a random process X w support on [0, Inf)

How about being helpful instead of being insufferable all the time? Every post I see of yours you're putting someone down, don't even give helpful advice, and end up in a dick measuring contest with other users. I'm sorry mommy and daddy didn't pay attention to you or put your pictures on the refrigerator. It's pretty clear you're the reason they're divorced. You'll probably retort with something doubling down on your arrogance, but please leave the discussion for people that actually want to be helpful.

I'm new to actuarial modeling, worked 5 years in reserving, then switched to a non actuarial predictive modeling role. I don't know all of the pricing actuarial best practices and am just trying to do better at my job. Every pricing example I've seen uses personal lines Auto data which has much more quality data than the lines I'm working with. I don't work for a large company and have to build models from the ground up with no one to really bounce ideas off of. None of the actuaries I work with have a background in predictive modeling. This is the only actuarial forum I know of, so I come here when I have actuarial specific questions. I guess I'm used to the data science community, who are generally collaborative.

This is the first GLM I'm implementing for this company. They have their rating plan set up through SQL and ran through a web service. All I can do is update the rating factors. Doing anything else, I'd need to rehaul the SQL and webservice, which is my long-term goal. I'm not currently concerned about feature selection. The goal of the model isn't to find the best point estimate, but be better at segmenting risks than the current rating plan, hence focusing on Gini and deviance. Is this ideal? Of course not, but it's the situation I'm in.

So instead of assuming everyone here is an idiot; realize not everyone works under ideal circumstances. Or just work on being a better person! You'll feel better long term and people will like you more if you share your knowledge rather than the small shots of dopamine you receive with your sardonic comments.

I understand the difference between frequency and severity

I understand the theoretical differences between Poisson, Gamma, and Tweedie

I understand high cardinality is bad

I understand the importance of feature engineering

I understand how Lasso/Ridge regression work


I only know of the 'tweedie', 'HDtweedie', and 'HDBoost' packages that handle tweedie distributions. I've only known about these packages for a couple months and from my experience, they're not friendly with tidyverse and/or haven't been updated to work with tidyverse. This leads me to rely more on the more fleshed out packages. I need a multiplicative model since that's what underwriters are used to, so as far as I know, I'm limited to Poisson, Gamma, and Tweedie. I'm able to create better visualizations with Poisson given the package support and thus am willing to give up some predictive power because of this.

I came here looking for advice and appreciate everyone who has shared some. I hope these threads will deem useful for future predictive modelers looking for advice so they don't make the same mistakes as me. As a user pointed out, h.20 has built in Tweedie support. That was useful information. Is it overkill for the small amount of data I'm training? Yes, but it's another possibility if I want to keep using the Tweedie distribution. Another user pointed out, it's likely that the data I'm working with is highly frequency driven, hence why Poisson is performing better. Solid advice.
Reply With Quote
  #10  
Old 02-14-2019, 11:51 AM
MoralHazard MoralHazard is offline
Member
CAS
 
Join Date: Jul 2011
Favorite beer: Sam Adams Rebel Rouser
Posts: 110
Default

Relevant paper: https://www.casact.org/pubs/forum/17...ross-Evans.pdf

The authors present a case study of a claims severity model, which is commonly modeled using a gamma GLM. They compare that approach to a minimum bias model, which is equivalent to a Poisson GLM, and conclude that the min bias aka Poisson model validated similarly to (or maybe slightly better than) the gamma model. So at least based on this I'd say that yes, you're probably OK going with the Poisson model for PP.
Reply With Quote
Reply

Tags
glm, poisson, tweedie

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


All times are GMT -4. The time now is 12:20 PM.


Powered by vBulletin®
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
*PLEASE NOTE: Posts are not checked for accuracy, and do not
represent the views of the Actuarial Outpost or its sponsors.
Page generated in 0.27875 seconds with 9 queries