Binomial with over 700 parameters to be estimated

Seeking posterior predictive distributions for 727 parameters (lakes) that are destinations for seaplane pilots but where a portion of the flights originates from source lakes harboring aquatic invasive species. I am modeling this like trials using a hierarchical binomial model just like the rats tumor example to keep it simple. Have data for over 25000 flights but some of the lakes only receive a handfull of flights.
Model runs fine but:

  1. max tree depth reached each time I increase, most recently to 15, but still same warning messages
  2. cannot open pairs() plot as suggested Error in figure margins too large
  3. half of the lakes have Rhat larger than 1.1
    Some of the Rhat >1.1 lakes have chains that run parallel horizontally, not mixing
    Increased iterations from 2000 to 10000 and most recently 20000 (took 4 days to run with multiple processors etc.) same warnings
  4. bulk ESS too low
  5. tail ESS too low
    Used launch_shinystan but found no issues with divergence. So I am wondering given the very large number of parameters/predictive distributions whether the diagnostics are to be expected and as good/simple as we can get here. The distributions will inform detection efforts to survey lakes that show high median introduction rates (tumor successes) first the ones with low variance then the ones with higher variance.
    Here is the model:
J <- nrow(data)
n <- data$flights
y <- data$eflights  # flights from invader source lakes
rstan_options(auto_write = TRUE)
Sys.setenv(LOCAL_CPPFLAGS = '-march=native')
options(mc.cores = parallel::detectCores())
datafit3 <- stan(file="STAN/flights_model/flights_model.stan",data=c("J","y","n"), iter = 20000, chains = 4,control=list(adapt_delta=0.8, max_treedepth =15))


// Following Carpenter et al. 2016 p. 16
data {
  int<lower=0> J; // number of lakes
  int<lower=0> y[J]; // number of flights from elodea-invaded sources into lake j
  int<lower=0> n[J]; // number of total flights into lake j (number of trials)
parameters {
  real<lower=0,upper=1> theta[J]; // chances of success
  real<lower=0,upper=1> lambda; // prior mean chance of success
  real<lower=0.1> kappa; // prior count
transformed parameters {
  real<lower=0> alpha; // prior success count
  real<lower=0> beta; // prior failure count
  alpha = lambda* kappa;   //alpha parameter for the prior
  beta =   (1 - lambda)* kappa;  //beta parameter for the prior
model {
  lambda ~ uniform(0,1); // hyperprior
  kappa ~ pareto(0.1,1.5); // hyperprior was uniform(0.1,5) for results as of 04/30/20 creating datafit1, pareto(0.1,1.5) creating datafit2, uniform(0,1) in Carpenter et al.
  theta ~ beta(alpha,beta); // prior for the probability p that lake j is invaded
  y ~ binomial(n,theta); // likelihood for flights
generated quantities {
  real<lower=0,upper=1> avg; // avg success
  int<lower=0,upper=1> above_avg[J]; // true if theta[j] > mean(theta), meaning the posterior prob. that a given destination is above-average in terms of the elodea introduction rate
  int<lower=1,upper=J> rnk[J]; // rank of j
  int<lower=0,upper=1> highest[J]; // true if j is highest rank
  avg = mean(theta);
    for (j in 1:J)
    above_avg[j] = (theta[j] > avg);
    for (j in 1:J) {
    rnk[j] = rank(theta,j) + 1;
    highest[j] = rnk[j] == 1;

The automatic ESS messages will break (yielding a false warning) if you have any variables that are constant in the generated quantities, which might be happening with some of those int quantities. If the model doesn’t take too long to run, try running without any generated quantities to check if you still get ESS warnings.

When you say that you’re seeing rhat’s >1.1 for some lakes, do you mean for the theta for some lakes?

1 Like

Hi Mike, yes there are constants, indicating the thetas that are above average. Yes, I mean Rhat for thetas.

Maybe try sampling using a small subset of the lakes so that the subsequent pairs plot isn’t so overwhelming? That might uncover something useful.

1 Like

It probably won’t make much of a difference, but you could also try replacing the beta and binomial priors with a single beta_binomial prior:

y ~ beta_binomial(n, alpha, beta);

These compound distributions can be more efficient and stable in how they sample


You can then recover the theta values in the generated quantities block:

generated quantities {
  real<lower=0,upper=1> theta[J] = beta_rng(y + alpha, n - y + beta);

This is because, given the binomial distribution:

Y|\theta \sim \bf{Binom}(n,\theta)

With beta prior:

\theta \sim \bf{Beta}(\alpha, \beta)

The posterior for \theta is given by:

\theta | Y \sim \bf{Beta}(Y + \alpha, n - Y + \beta)

background here


The pairs plot worked out, just by limiting the number of plots to be displayed.

The neff/N ratios are super small for most lakes. I am unfamiliar with reparameterization but would it perhaps help here? If so, how would the model need to be reparameterized? Thanks so much for your help.

Andrew, I switched the code following your example but the code file shows a warning on the line I inserted it in the generated quantities block (white x on round red background):
“No matches for: Available argument signatures for operator+: Expression is ill formed.”

I’m guessing that the error included a message like 'no signature for real + int[]', or something similar?


real<lower=0,upper=1> theta[J] = beta_rng(to_vector(y) + alpha,
                                          to_vector(n) - to_vector(y) + beta);
1 Like

Hi Mike, sorry for the very delayed response and thanks so much again for your help.

I get the same errors with a smaller dataset. But I wonder if the error messages are actually not a large validity concern. Note, when I increased iterations from 10,000 to 20,000 after warm up, I had considerable better convergence. With 20,000 I got 200 (28% of the 727 lakes) with Rhat ≤ 1.05 instead of 41 lakes with 10,000 iterations, 457 (63%) instead of 224 with Rhat ≤ 1.1, and 270 (37%) instead of 503 lakes with Rhat > 1.1.

Since there are so many thetas and the data is somewhat sparse for some of these lakes with only a few flights going in, I believe the model may be as good as it gets. Is there reason to throw out the model entirely? Or could we interpret the results by saying the 200 lakes with Rhat ≤ 1.05 are “trust worthy” results, the estimates for the 457 lakes are “more or less trustworthy” and the 270 remaining lakes with Rhat >1.1 are potentially problematic.

Btw, here is the pairsplot with just one of the 727 lakes. Are there any issues you see? I never interpreted these and don’t know what to look for. But, I noticed, with 10,000 iterations that some of the pairs had some bimodality that went away with 20,000 iterations.

Unfortunately I really don’t think it’s safe to ignore the Rhat warnings generally. They indicate that the different chains ended up exploring measurably different areas of the parameter space, so all computations on the resulting samples have no guarantee of accuracy. There’s even been talk of decreasing the Rhat value that triggers the warning to something even smaller like 1.01 to really ensure folks only trust samples from truly converged chains.

Maybe you need to step back and reconsider this model for the data. You should check out @betanalpha’s workflow case study and do those same steps yourself.

There are some visible “loops” indicating very slow mixing. Max tree depth warning also indicates very small step size and thus many many steps are needed before u-turn. The figure you attached is too small to see anything else (like axis labels). Can you post a bigger figure?

1 Like

Thanks so much Aki for looking at this. Hopefully this image will work.

I don’t still see the problem, but this are rare type of example with very slow mixing.
To better see the geometry can you plot the pairs plot in the unconstrained space (where the sampling happens), that is for kappa plot log(kappa-0.1), for theta and lambda plot logit(theta) and logit(lambda)?

Not sure if this is correct to get the transformations you want
pairs_plot2 <- mcmc_pairs(posterior_cp, np = np_cp, pars = c("alpha","beta","logit(lambda)","log(kappa,-0.1)","logit(theta[1])"), off_diag_args = list(size = 0.75))

How does the plot with that call look like? My current guess is that with the current priors you have way too much prior mass on very small theta values, which in logit space means that the tail is thick and it takes forever to explore that tail. We could see that better when looking the posterior in the unconstrained space and the way to fix it would be to use slightly more informative prior.

The pairs plots of theta[1] against the other parameters indicate that something in the model is being terribly identified. See for recommendations on how to access the unconstrained values as well as some explanation for why extremely degenerate models like this will appear to be multimodal for if the Markov chains aren’t long enough.

1 Like

Does that give you an error? Unless you’ve created variables with those names, you need pars to be the actual parameter names and to then use the transformations argument to provide the functions. The way you currently have it it will look for a parameter named “logit(lambda)” rather than applying the logit transformation to “lambda”. Something like this should work:

pairs_plot2 <-
    np = np_cp,
    pars = c("alpha", "beta", "lambda", "kappa", "theta[1]"),
    transformations = list(
      "lambda" = "qlogis",  # or define a function called logit
      "kappa" = function(x) log(x - 0.1),
      "theta[1]" = "qlogis"
    off_diag_args = list(size = 0.75)

Alternatively, you can do what @betanalpha suggested and just extract and plot the already unconstrained draws themselves.

The above code for the pairs_plot gave me an error, so I used @jonah’s suggestion specifying the transformation argument. The resulting pairs_plot is exactly the same. Hmm?
If I undersand correctly, following @betanalpha suggestion would mean looking at the diagnostic_file which unfortunately, I did not specify. The model ran for four days on 8 cores. Happy to rerun it though or just use fewer iterations.

If the model takes that long to run then it will be much faster to create the pairs scatter plots one by one, and imposing the unconstraining transformations, by hand.