Hierarchal logistic regression non-centered reparameterization

Hi all,

I am performing a hierarchal logistic regression on a relatively sparse dataset – 278 observations grouped into 18 events (with a few events having < 5 data points). Per recommendations on this forum I have implemented the “Matt’s trick” non-centered reparamterization. My code is below (it should be correct, the model runs fine and produces posterior distributions close to what I expect).


data {
  int <lower = 1> N; //number of data points
  int <lower = 1> E; //number of events
  int <lower = 1, upper = E>  event[N]; 
  real x1[N]; 
  real x2 [N];
  int<lower=0,upper=1> y[N];
    }
parameters {
  vector[E] b0_raw; // event level adjustments
  vector[E] b1_raw;
  vector[E] b2_raw;
  real <lower = 0> sigma_b0; 
  real <lower = 0> sigma_b1;
  real <lower = 0> sigma_b2;
  real mu_b0;
  real mu_b1;
  real mu_b2;
}
transformed parameters{
  vector[E] b0;
  vector[E] b1;
  vector[E] b2;
  b0 = mu_b0+ sigma_b0 * b0_raw;
  b1 = mu_b1 + sigma_b1 * b1_raw;
  b2 = mu_b2 + sigma_b2 * b2_raw;
}
model {
      mu_b0 ~ normal(0,100); //priors for coefficents
      mu_b1 ~ normal(0,100); //diffuse normals
      mu_b2 ~ normal(0,100);
      sigma_b0 ~ normal(0,25);
      sigma_b1 ~ normal(0,25);
      sigma_b2 ~ normal(0,25);
      b0_raw ~ normal(0,1);
      b1_raw ~ normal(0,1);
      b2_raw ~ normal(0,1);
      for (i in 1:N) {
        liq[i] ~ bernoulli_logit(b0[event[i]] + b1[event[i]]* x1[i] + b2[event[i]] * x2[i]);
      }

My under standing from the Betancourt and Girolami paper is that the re-parameterization results from the equivalency between these two statements:

y_i \sim N(\theta_i, \sigma_i) with \theta_i \sim N(\mu, \tau)
and
y_i \sim N(\theta^*_i*\tau+\mu, \sigma_i) with \theta^*_i \sim N(0,1).

My questions are as follows: What would the equivalent statements be for a hierarchal logistic regression and why are they valid?

If they aren’t, have I made some faulty assumption and coded my model incorrectly?

Thanks!

I haven’t looked at your Stan program yet, but if you have a varying parameter with a level for each observation (e.g. the eight_schools example but not Gaussian) then mathematically you can write something like y_i \sim {\rm Bernoulli}({\rm logit}^{-1}(\theta_i^\star * \tau+\mu))

They are valid for the same reason as when you are using a Gaussian likelihood. The relevant Gaussian for the non-centered parameterization is the Gaussian prior on \theta. In other words, it’s always true that

\theta \sim N(\mu, \tau)

is equivalent to

\theta = \theta^\star * \tau + \mu, \theta^\star \sim N(0,1)

regardless of the distribution of y. Hope that helps, and welcome to the Stan forums.

I didn’t go through it thoroughly but here are a few things you can tweak to make your Stan program more concise and also run faster:

data {
...
  // vectors not arrays so you can do elementwise vector multiplication in model block 
  vector[N] x1; 
  vector[N] x2; 
...
}
transformed parameters{
  // same but declare and define on a single line (just cosmetic)
  vector[E] b0 = mu_b0 + sigma_b0 * b0_raw;
  vector[E] b1 = mu_b1 + sigma_b1 * b1_raw;
  vector[E] b2 = mu_b2 + sigma_b2 * b2_raw;
}
model {
  // vectorize the log likelihood (will be faster than loop)
  y ~ bernoulli_logit(b0[event] + b1[event] .* x1 + b2[event] .* x2)
}

I would also suggest tighter priors than normal(0,100) (that’s sd=100, variance = 10000) unless it’s really the case that values like say -250 and 250 are plausible. Especially important to think about if your data is sparse.

Ok, thanks for your advice! I’ve actually been struggling a bit with the priors – mostly just because they’re going to be something the peer reviewers (who aren’t Bayesian modelers) are going to be picky about. I’ve been trying to reason through how best to determine an anticipated scale similar to how Gelman et al., 2008 talked about creating a prior that assigns low probabilities to anything greater than a change of 10 on the logistic scale corresponding to unit change in predictor variable. Do you have any suggestions for this?

Maybe this has something

https://mc-stan.org/users/documentation/case-studies/weakly_informative_shapes.html

(and it’s a good read even if it doesn’t!)

I also suggest simulating data like yours in relevant ways, and then fitting the model to that data using priors of different strengths so you can get a sense for what affect that has.