Case Study in Insurance (comments welcome)

I first discussed submitting this case study for loss reserving in insurance about six months ago and got some feedback from Bob Carpenter which I have largely applied.

For those of you interested in the looking at the output directly, the rendered HTML file is available here:

It was suggested I reduce the amount of plots etc regarding fit diagnostics and try to do more PPCs. I’m thinking about other PPCs at the moment but all other comments are welcome too. I have also restructured the workbook a little to ensure the data is discussed and explored a little right at the start.

In terms of logistics, I have this in its own GitHub repo, but should I move it into the Stan example_models repo and do a pull request, or is it better to have it standalone? (either way is fine for me, just not sure what the preferences of the core team are for this)


Thanks for posting this! It’s cool to see what people are up to. I’m not in command of the case studies page or anything, so someone else will haveta handle that, but I had a read through. Here are my comments. I didn’t read the last thread, so if I contradict what someone else said earlier feel free to ignore me.

## # A tibble: 10 x 12
##    acc_year premium   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`
##  *    <chr>   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1     1988     957   133   333   431   570   615   615   615   614   614   614
##  2     1989    3695   934  1746  2365  2579  2763  2966  2940  2978  2978    NA
##  3     1990    6138  2030  4864  6880  8087  8595  8743  8763  8762    NA    NA
##  4     1991   17533  4537 11527 15123 16656 17321 18076 18308    NA    NA    NA

Could you explain this table more?

Is it that each row in this table corresponds to accounts opened in that year? And columns are claims made each year since the account has been open?

What I don’t get is why the numbers at the bottom of the table are so big. Are numbers at the bottom somehow accumulations of all the previous years?

As each cohort year has different volumes of business, we scale the losses by the total premium received for that cohort

Is there a way to include this as another parameter? I get that scaling all the curves so they look alike is good, but you end up with the problem that to make predictions you need to know a total volume of business for something in the future. If I’m interpreting this right… Maybe there isn’t a way around this.

It was suggested I reduce the amount of plots etc regarding fit diagnostics

You have quite a lot of diagnostics plots, haha. As a Case Study, that’s fine but for a general audience I’d definitely shrink em’ a bit.

We do not include the variance around the mean so it is not surprising to observe data outside this cone of uncertainty.

I think it’s a good idea to include the noise in the plots. The noise is part of the model, and sometimes it can do crazy things that you don’t expect.

There’s also a couple plots at the end where you compare posterior predictives for loglogistic and Weibull. The distributions on the predicted quantities seem significantly different. Do you have an idea what’s causing that? Is it possible that it’s because the noise is missing in the posterior predictives?

Thanks for the feedback Ben!

I did not explain this very clearly so that is one thing to add a bit more knowledge of.

Each row represents one year of business, so all the policies that are sold in 1988 are put into the 1988 cohort. The premium value is the volume of premium received for that business. The reason it is increasing down the years is that this particular insurer sold more and more as the years passed.

The ten column after that are the cumulative amount of claims paid out on the policies of that cohort: in insurance, depending on the types of risks, it can take many years for an insurer to learn exactly how much liability was incurred on policies written in any particular year. This may be because the claim was not reported or known about, or simply that the claim was working through the legal system and so the final amount due was not determined.

As a result, for a given year, the cumulative amount up to that year is shown and that is why those numbers tend to increase and then taper off as more and more is learned about the claims on a given year.

This is also why the data is represented like an upper triangle - we only have five years worth of development on claims for policies written five years ago.

I need to explain this much better in the case study!

It would not be a huge deal for future years as most business will focus more on the loss ratio and then scale that to the amount of premium. In most cases, insurance business will budget for a certain amount of GWP (Gross Written Premium) as part of their planning.

Yeah, I’ll get rid of a lot of them. I had just discovered bayesplot so it was more about me playing with the new toys in that section I think. :)

Yes, that is a good point, I’ll look into adding that.

I don’t believe so. I’m not fully knowledgeable on the actuarial theory here, but I do remember reading that the Weibull tends to give fatter tails on the estimates and as a result tends to yield higher estimates of loss ratios. That was one of the things I was planning to investigate a little further in future case studies as I was conscious there is already a reasonable amount of content in this and I think it makes more sense to break it all up.

Thanks for your help Ben, I’ll make some edits today and tomorrow and repost.

In most cases, insurance business will budget for a certain amount of GWP (Gross Written Premium) as part of their planning.

Yeah, I see, that makes sense.

Yeah, I’ll get rid of a lot of them.

You shouldn’t have to for the case study, cause we should all be looking at diagnostics as a matter of habit :D.

I think the diagnostics stuff can mislead people a little though, especially if they aren’t ready for it. At least you have to be careful to not accidentally get them thinking All Is Hopeless And Lost when you get into this, (unless your message is All Is Hopeless And Lost, ofc).

I’ve had it happen to me twice that I’m explaining a bit of a model and I’ve gotten a response like “well I guess that means this doesn’t work and we shouldn’t be wasting our time with it?” And then I have to say, “nonono, these are important issues, but the ship isn’t sunk.”

But I don’t think the diagnostics should totally be hidden. It’s one of the things that sets the analysis apart :/.

Okay, I’m persuaded. I’ll trim them back a little and perhaps add a bit more discussion, but I won’t take the big axe to them like I was intending.

Sure you are! It’s just another repo on stan-dev. We don’t have any kind of official governance in place. If you think it’s worth posting, we’ll post it. It just requires a pull request.

Keep it in your own repo and we can link to it. Those of us with permissions on stan-dev tend to use example-models.

The pull request comes for the web pages repo, which is something like I always struggle to remember where to put the descriptions and the actual HTML for the case studies.

No problem.

I made a few minor edits where I expanded the description on the phenomenon of claims development as per our discussions and I removed the ACF diagnostic plots, but kept everything else.

I’m happy enough with it as it is, but if anyone else has some suggestions or edits before we add it, please let me know. Otherwise, let me know if you want me to make a pull request or if one of you are okay with doing it.

Sorry—just catching up with web-related things.

Would you mind adding an open-source license that explicitly states the copyright holder for the work?

We’ve been licensing code via BSD and text by CC-BY NC. You can see the most recent case study from Milad, for example, where he added it at the end.

Here are some comments about implementing models like these going forward if you care about efficiency and numerical stability. You don’t need to change anything in the case study. I like the diagnostics, by the way, for just this reason!

Given that omega and theta are fixed, it’d be more efficient to implement this all at once and save intermediate terms like -theta^-omega.

gf[i] = growthmodel_id == 1 ?
            growth_factor_weibull    (t_value[i], omega, theta) :
            growth_factor_loglogistic(t_value[i], omega, theta);

Also, it might be helpful to keep everyting on the log scale for as long as possible (I may have said this before), but that’s not actually very long here. For instance, 1 - exp(-(t/theta)^omega) is log1m_exp(-(t / theta)^omega)). But then you multiply again and unless the multiplications offset any potential underflows/overflows, it won’t help much to stay on the log scale.

Subtractions like 1 - exp(-(t/theta)^omega) are very unstable for two reasons: the first is bad behavior of subtraction where it’ll too easily round to one or zero and the second is potential overflow or rounding to zero of the exponentiation and power operations.

There are also a few places you could vectorize, but it won’t help in most cases with efficiency or clarity, so I wouldn’t bother.

Done and I have pushed the changes up to the master branch in the repo.

Thanks for that - I wasn’t focusing too much on efficiency as it is a small model for now and I wanted to make it as accessible as possible to insurance-industry types who have heard about Bayesian techniques. My concern is the large learning curve so I decided to dial back a little on trying to squeeze efficiencies out of the code.

That said, I have a few related expansions of this model in mind for case studies and I expect efficiency will be important there so I will try to implement those as much as possible.

I had not considered the ability to cache computations - I will look into that for further iterations.

At this point I am pretty much happy with the case study as it stands, so feel free to add it to the site whenever you want.

Thanks for making it open source! It’s now linked from the case studies page on the Stan web page:

1 Like