Very different estimates every time the model is fit

I’m trying to model the bicycle count data from the table in the exercises of Chapter 3 of Bayesian Data Analysis by Gelman et. al. using a hierarchical binomial model with population effects for street type and the presence of a bike route.

Every time I sample the posterior (using rstan 2.18.2) I get very different results for the final probabilities of seeing bikes in each city block. I’ve tried extracting a huge number of samples each run (about 1.1 million) and the results are still just as variable.

Is there something pathological in the posterior that’s causing this? Or perhaps the model itself is ill-posed? How would I go about diagnosing this situation?

Below is my Stan code as well as self-contained R code to generate a plot of the raw proportions of bikes to total traffic versus the estimates from the model. I’ve also included two examples of those plots at the end showing the variation in results.

data{
    int<lower=1> N;
    int bikes[N];
    int total[N];
    int is_fairlybusy[N];
    int is_busy[N];
    int bike_route[N];
    int block_id[N];
}
parameters{
    vector[N] a_bl;
    real a;
    real a_fair;
    real a_busy;
    real b;
    real<lower=0> sigma;
}
model{
    vector[N] p;
    
    // fixed priors
    sigma ~ gamma(2, 2);
    b ~ normal(1, 10);
    a ~ normal(-2, 10);
    a_fair ~ normal(0, 10);
    a_busy ~ normal(0, 10);
    
    // adaptive prior
    a_bl ~ normal(0, sigma);
    
    // model
    for ( i in 1:N ) {
        p[i] = a + a_bl[block_id[i]] + a_fair*is_fairlybusy[i] + a_busy*is_busy[i] + b*bike_route[i];
    }
    
    // likelihood
    bikes ~ binomial_logit(total, p);
}
library(rstan)

bike_data <- data.frame(
    bikes = as.integer(c(16, 9, 10, 13, 19, 20, 18, 17, 35, 55, 12, 1, 2, 4, 9, 7, 9, 8, 8, 35, 31, 19,
                         38, 47, 44, 44, 29, 18, 10, 43, 5, 14, 58, 15, 0, 47, 51, 32, 60, 51, 58, 59,
                         53, 68, 68, 60, 71, 63, 8, 9, 6, 9, 19, 61, 31, 75, 14, 25)),
    other = as.integer(c(58, 90, 48, 57, 103, 57, 86, 112, 273, 64, 113, 18, 14, 44, 208, 67, 29, 154,
                         29, 415, 425, 42, 180, 675, 620, 437, 47, 462, 557, 1258, 499, 601, 1163, 700,
                         90, 1093, 1459, 1086, 1545, 1499, 1598, 503, 407, 1494, 1558, 1706, 476, 752,
                         1248, 1246, 1596, 1765, 1290, 2498, 2346, 3101, 1918, 2318)),
    is_fairlybusy = as.integer(c(rep(0, 18), rep(1, 20), rep(0, 20))),
    is_busy = as.integer(c(rep(0, 38), rep(1, 20))),
    bike_route = as.integer(c(rep(1, 10), rep(0, 8), rep(1, 10), rep(0, 10), rep(1, 10), rep(0, 10)))
)
bike_data$total <- bike_data$bikes + bike_data$other
bike_data$block_id <- 1:nrow(bike_data)

fit <- stan(
    model_code = model_code,
    data = list(
        N = nrow(bike_data),
        bikes = bike_data$bikes,
        total = bike_data$total,
        is_fairlybusy = bike_data$is_fairlybusy,
        is_busy = bike_data$is_busy,
        bike_route = bike_data$bike_route,
        block_id = bike_data$block_id
    ),
    iter = 1e4,
    warmup = 2e3,
    chains = 4,
    cores = 4,
    control = list(max_treedepth = 15)
)

samples <- extract(fit)
logit_block <- sapply(1:58, function(i) samples$a + samples$a_bl[,i])
logit_residential <- sapply(
    1:18,
    function(i) logit_block[,i] + ifelse(i <= 10, samples$b, 0)
)
logit_fairlybusy <- sapply(
    19:38,
    function(i) logit_block[,i] + samples$a_fair + ifelse(i <= 28, samples$b, 0)
)
logit_busy <- sapply(
    39:58,
    function(i) logit_block[,i] + samples$a_busy + ifelse(i <= 48, samples$b, 0)
)
# any inverse logit function will do
residential <- brms::inv_logit_scaled(logit_residential)
fairlybusy <- brms::inv_logit_scaled(logit_fairlybusy)
busy <- brms::inv_logit_scaled(logit_busy)
residential_mu <- apply(residential, 2, mean)
fairlybusy_mu <- apply(fairlybusy, 2, mean)
busy_mu <- apply(busy, 2, mean)

plot(
    1:58, bike_data$bikes/bike_data$total,
    cex = 1.2,
    ylim = c(0, 0.47),
    xlab = "block number", ylab = "proportion of bikes in traffic",
    main = "Raw proportions versus adjusted estimates",
)
points(1:58, c(residential_mu, fairlybusy_mu, busy_mu), pch = 16, col = "blue")
lines(c(18.5, 18.5), c(-0.02, 0.49))
lines(c(38.5, 38.5), c(-0.02, 0.49))
lines(c(10.5, 10.5), c(-0.02, 0.49), lty = 2)
lines(c(28.5, 28.5), c(-0.02, 0.49), lty = 2)
lines(c(48.5, 48.5), c(-0.02, 0.49), lty = 2)
text(5, 0.46, "residential")
text(28.5, 0.46, "fairly busy")
text(48.5, 0.46, "busy")

In these plots the open circles correspond to raw proportions (bikes/total) and the blue dots correspond to model estimates. Street types are separated with solid vertical lines, and within each street type the estimates corresponding to bike routes and non bike routes are separated by dashed vertical lines (with bike routes on the left of the dashed lines).

1 Like

I think the ifelse statements are messing you up.

For instance:

> ifelse(1 < 10, samples$b, 0)
[1] 1.46124

And I think you’d expect that to be the vector samples$b, not just the first sample.

Maybe try something like:

logit_block <- sapply(1:58, function(i) { samples$a + samples$a_bl[,i] })
logit_residential <- sapply(
  1:18,
  function(i) { logit_block[,i] + (if(i <= 10) samples$b else 0) }
)
logit_fairlybusy <- sapply(
  19:38,
  function(i) { logit_block[,i] + samples$a_fair + (if(i <= 28) samples$b else 0) }
)
logit_busy <- sapply(
  39:58,
  function(i) { logit_block[,i] + samples$a_busy + (if(i <= 48) samples$b else 0) }
)
2 Likes

Yes! You’re exactly right, I did expect the ifelses to return a vector, but I guess they just return the first entry? In any case, your suggested fix is perfect and the results are now essentially identical every run.

Thanks so much for spotting that!