Hi All, I’m just getting started with Stan and I’m having a little trouble. The thing I’m interested in doing is basically hierarchical distribution fitting of insurance claim amounts by state. I’ve simulated (from a lognormal distribution) 50 claim amounts for the state of IL, 20 in the state of WI, and 5 in the state of MN, with each state having different lognormal parameters mu and sigma. From what I’ve read, the hierarchical approach should get me to a credibility weighting between a pooled (all states combined) and unpooled (each state individually) estimate for each state. So for IL I would expect the parameter estimates to be closer to the unpooled estimates because it has 50 claims and MN should be closer to the pooled estimate because it has only 5 claims.
I’ve included the stan code for 3 different models below. I first run a pooled version, then each state individually, then an unpooled version, and then the partial pooling version. I ran each state individually and then compared that to the unpooled estimates to help me understand if I was creating the unpooled model correctly but also to verify that I got the same parameter estimates. They are pretty close, but don’t match exactly. So that is my first question: why don’t the unpooled estimates for each state match the individual state estimates?
My second question stems from the fact that I don’t observe the hierarchical estimates being between the pooled and unpooled estimates. Is there a problem with my hierarchical model which would cause this exercise I’m trying to go through to fail, or am I misunderstanding this exercise from the start? I realize that it’s ending up with different hyperparameter estimates for my gamma priors than what I’m using in my pooled and unpooled versions so I did re-run it all hard coding in the mean estimates from the hierarchical model into the gamma priors in the pooled and unpooled models but that didn’t have a big impact and didn’t solve the problem.
Like I said, I’m just getting started with stan and appreciate any help here. All code is included below.
First up, the pooled model (also used to do each state individually):
//lognormal_pooled.stan
data {
int<lower=1> N; //number of records
real<lower=0> y[N]; //outcomes (loss amounts)
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
//priors
mu ~ gamma(9, 1);
sigma ~ gamma(2, 1);
//likelihood
y ~ lognormal(mu, sigma);
}
}
Next, the unpooled model:
//lognormal_unpooled.stan
data {
int<lower=1> N; //number of records
int<lower=1> K; //number of groups (states)
int<lower=1,upper=K> state[N]; //vector of state values
real<lower=0> y[N]; //outcomes (loss amounts)
}
parameters {
real mu[K];
real<lower=0> sigma[K];
}
model {
//priors
mu ~ gamma(9, 1);
sigma ~ gamma(2, 1);
//likelihood
for (n in 1:N){
y[n] ~ lognormal(mu[state[n]], sigma[state[n]]);
}
}
Lastly, the hierarchical model (or my attempt at it):
//lognormal_partial_pooling.stan
data {
int<lower=1> N; //number of records
int<lower=1> K; //number of groups (states)
int<lower=1,upper=K> state[N]; //vector of state values
real<lower=0> y[N]; //outcomes (loss amounts)
}
parameters {
real<lower=0> mu_alpha;
real<lower=0> mu_beta;
real<lower=0> sigma_alpha;
real<lower=0> sigma_beta;
real mu[K];
real<lower=0> sigma[K];
}
model {
//priors
mu_alpha ~ gamma(9, 1);
mu_beta ~ gamma(1, 1);
sigma_alpha ~ gamma(2, 1);
sigma_beta ~ gamma(1, 1);
mu ~ gamma(mu_alpha, mu_beta);
sigma ~ gamma(sigma_alpha, sigma_beta);
//likelihood
for (n in 1:N){
y[n] ~ lognormal(mu[state[n]], sigma[state[n]]);
}
}
Below is the R code I’m using:
library(rstan)
library(data.table)
# pkgbuild::has_build_tools(debug = TRUE)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
set.seed(20200402)
dt <- data.table(rbind(data.frame(state = "IL", loss = rlnorm(50, 10, 3)),
data.frame(state = "WI", loss = rlnorm(20, 9, 2)),
data.frame(state = "MN", loss = rlnorm(5, 8, 1.5))))
dt[, state_id := .GRP, by = state]
pooled_data <- list(N = nrow(dt),
y = dt[, loss])
pooled_data_IL <- list(N = nrow(dt[state == "IL"]),
y = dt[state == "IL", loss])
pooled_data_WI <- list(N = nrow(dt[state == "WI"]),
y = dt[state == "WI", loss])
pooled_data_MN <- list(N = nrow(dt[state == "MN"]),
y = dt[state == "MN", loss])
unpooled_data <- list(N = nrow(dt),
K = dt[, max(state_id)],
state = dt[, state_id],
y = dt[, loss])
pooled_fit <- stan(file = 'lognormal_pooled.stan', data = pooled_data)
pooled_fit_IL <- stan(file = 'lognormal_pooled.stan', data = pooled_data_IL)
pooled_fit_WI <- stan(file = 'lognormal_pooled.stan', data = pooled_data_WI)
pooled_fit_MN <- stan(file = 'lognormal_pooled.stan', data = pooled_data_MN)
unpooled_fit <- stan(file = 'lognormal_unpooled.stan', data = unpooled_data)
partial_pooling_fit <- stan(file = 'lognormal_partial_pooling.stan', data = unpooled_data)
summary(pooled_fit_IL)$summary[, "mean"]
summary(pooled_fit_WI)$summary[, "mean"]
summary(pooled_fit_MN)$summary[, "mean"]
summary(unpooled_fit)$summary[, "mean"]
summary(pooled_fit)$summary[, "mean"]
summary(partial_pooling_fit)$summary[, "mean"]