My first Stan model - hierarchical logistic regression


#1

Hello!

I am a psychologist working on data from a psychophysical experiment. I analyze the data in a hierarchical logistic regression with 7 predictors: three main predictors (manipulated in the experiment, within each subject), and their two- and three-way multiplicative interactions. One of them is “continuous”- 1:3 (normalized by subtracting mean and dividing by 2 stds), and the rest are binary.

Up to now, I used JAGS. For each of my subjects (N=34) I find individual coefficients (beta0+beta(1:7)), and all subjects are modeled to be in the same group, with group parameters betamu0, betasigma0, betamu(1:7), betasigma(1:7), which I use to infer if the psychological effect is there or not.
The problem is that my variables are inherently correlated, and my JAGS model yields very small ESS (tried increasing the iterations, thinning and etc. it didn’t help).

I built a Stan model, which ran on my computer, but took hours to accomplish the first 5000 iterations of the first of three chains (even when I simplified the model to 2/7 beta parameters). I guess I may be doing something wrong, and for sure my model is completely non-vectorized, as in JAGS, because I was unsure about the vectorized operations in Stan (even after working with your manual).

Do you have any suggestion on how to correct and optimize my model? and how much running time would be pathologically slow?

This help is much appreciated, and it would allow me to reanalyze data from 6 related experiments which I ran!

data structure: rows are trials, one column for subject, 7 columns for predictors, one column for a binary response.

the model:
modelString = "
data {
int<lower=1> NSubj ;
int<lower=1> Ntotal ;
int<lower=1> Nx ;
real<lower=0> w[Ntotal] ;
int<lower=1> s[Ntotal] ;
matrix [Ntotal, Nx] x ;
int<lower=0,upper=1> y[Ntotal] ;
}
parameters {
real beta0mu ;
real<lower=0> beta0sigma ;
real betamu [Nx] ;
real<lower=0> betasigma[Nx] ;
real beta0[NSubj] ;
real beta[Nx,Nsubj];
}

//prior on group params
//first, intercept:
beta0mu ~ normal( 0, 10 );
beta0sigma ~ uniform( 1.0E-3 , 1.0E+3 );
//multiple predictors:
for ( predictor in 1:Nx ) {
betamu[predictor] ~ normal( 0 , 10 );
betasigma[predictor] ~ uniform( 1.0E-3 , 1.0E+3 );
} // predictors loop

// subject level params (drawn from group params)
for ( subject in 1:NSubj ) {
beta0[subject] ~ normal( beta0mu , beta0sigma ) ;
for ( predictor in 1:Nx ) {
beta[subject, predictor] ~ normal( betamu[predictor] , betasigma[predictor] );
} // predictors loop
} // subjects loop

for ( i in 1:Ntotal ) {
// In Stan, Bernoulli_logit is logistic:
y[i] ~ bernoulli_logit( beta0[s[i]] + sum( beta[s[i],1:Nx].*x[i,1:Nx])); } // all rows loop
}

for ( i in 1:Ntotal ) {
y[i] ~ bernoulli_logit( beta0[s[i]] + sum( beta[s[i],1:Nx].*x[i,1:Nx])); } // all rows loop
}
}"


#2

First off, you shouldn’t need 5K iterations for a logistic regression. Stan usually mixes much more efficiently than BUGS and you can get the same n_eff with many fewer iterations. So try running fewer iterations to start.

Then make sure you have the right answer.

It’ll certainlly speed up if you vectorize. E.g.,

beta0 ~ normal(beta0mu, beta0sigma);

and

for (predictor in 1:Nx)
  beta[ , predictor] ~ normal(betamu[predictor] , betasigma[predictor]);

instead of the double loop.

I had a hard time understanding what the sum’s doing here:

y[i] ~ bernoulli_logit( beta0[s[i]] + sum( beta[s[i],1:Nx].*x[i,1:Nx])); } // all rows loop

You can probably turn that into some kind of matrix operation, which will be much much faster than indexing like this.

The priors you’re using can be imrpoved. Those interval priors are assuming that it’s 999 times as likely to have a value above 1 than below 1, and that values below 1/1000 or above 1000 are illegal. We recommend softer priors that concentrate more of the prior mass where you think the answer is. JAGS and BUGS gave people very bad Bayesian habits!

The usual problem with slow mixing is that there are violated constraints or that there are unidentified parameters. For instance, you should be declaring

real<lower=1e-4, upper=1e4> betasigma[Nx];

given your uniform prior, but we’d recommend just using <lower=0> and including a softer prior on beta sigma, like normal(0, 5) or something if you don’t expect values greater than 15.


#3

That is tough to do a model like that with only 34 observations, but then again people keep doing the eight schools model with only eight schools. I personally would have started with stan_glmer in the rstanarm package to get an idea of what effective sample size is possible:

library(rstanarm)
options(mc.cores = parallel::detectCores())
post <- stan_glmer(y ~ (continuous * binary_1 * binary_2) + 
                       (continuous * binary_1 * binary_2 || subject), 
                        data = dataset, family = binomial(), QR = TRUE)
summary(post) # look at n_eff

That will hopefully give you a sense as to whether estimating all that structure is even feasible from this dataset and then you might be able to go back and tweak things.


#4

Bob and Ben, thanks a lot for taking the time to help.

Ben - sorry for being unclear, I don’t have 34 observations, but rather 34 subjects with 600-700 observations each.

Bob, I adjusted my model to a more vectorized one, as suggested in the hierarchical logistic regression section in the manual (p. 139). I changed my previous uniform prior on sigma to a normal (0,5) with a lower bound in zero. i also used matrix*column vector multiplication instead of the weird sum.

I ran the model with 2/7 group parameters and found some small ESS and high autocorrelation. I added the

control = list(adapt_delta = 0.99))

argument to the samples function.
I then moved on to run my full model with 10K iterations and no thinning. It ran for around eight hours.
(is it normal?). “There were 3 chains where the estimated Bayesian Fraction of Missing Information was low” error. I got ESS <1000 for some of my crucial group level predictors.

as you can see, my predictors are correlated:

CORRELATION MATRIX OF PREDICTORS:
                Distance Context  Style ConXDist DistXStyle ConXStyle ConXStyleXDist
Distance          1.000   0.003  0.003    0.707      0.701     0.003          0.495
Context           0.003   1.000 -0.001    0.003      0.002     0.570          0.003
Style             0.003  -0.001  1.000    0.001      0.003     0.581          0.003
ConXDist          0.707   0.003  0.001    1.000      0.495     0.003          0.700
DistXStyle        0.701   0.002  0.003    0.495      1.000     0.003          0.707
ConXStyle         0.003   0.570  0.581    0.003      0.003     1.000          0.005
ConXStyleXDist    0.495   0.003  0.003    0.700      0.707     0.005          1.000

this is the vectorized version of my model:

 modelString = "
  data {
    int<lower=1> NSubj ; //L
    int<lower=1> Ntotal ; //N
    int<lower=1> Nx ; //D
    int<lower=1> s[Ntotal] ; // ll[N]
    row_vector[Nx] x[Ntotal] ; // row_vector[D] x[N]
    int<lower=0,upper=1> y[Ntotal] ; //y[N]
  }
  parameters {
    real beta0[NSubj] ;
    vector [Nx] beta[NSubj]; 
    real betamu [Nx] ;
    real beta0mu;
    real<lower=0> beta0sigma ;
    real<lower=0> betasigma[Nx] ;
  }
  model {
  //prior on group params
  //first, intercept:
  beta0mu ~ normal( 0, 10 );
  beta0sigma ~  normal( 0, 5  ); // was uniform (1.0E-3 , 1.0E+3)

  betamu ~ normal( 0 , 10 );
  betasigma ~ normal( 0, 5 ); //was uniform (1.0E-3 , 1.0E+3)
  
for (sub in 1:NSubj) {
  beta0[sub] ~ normal( beta0mu , beta0sigma );
  beta[sub] ~ normal( betamu , betasigma );
  } // sub

  {
vector[Ntotal] x_beta_sub;
for (n in 1:Ntotal)
x_beta_sub[n] = beta0[s[n]] + x[n]*beta[s[n]];
y ~ bernoulli_logit(x_beta_sub);

}
}
  " # close quote for modelString

Do you have any suggestions on how to proceed?

Thanks again for the guidance,

Dan


#5

Try non-centered parameterizations for beta0 and beta if those are the parameters giving you trouble.

You can use three back-ticks to start/end code blocks and dumps. I’ll edit the one you included to make it easier to read.


#6

You also want to vectorize beta0 ~ normal(beta0mu, beta0sigma) and same for beta.

And we recommend not including the code as a string as it means you can’t use print statements and you can’t dereference line numbers in error messages easily.


#7

Bob, thanks a lot.

I am now working on your suggestions and will follow-up soon.


#8

Hi Bob! hope all is well in your end.

As you suggested, I tried to do non-centered reparametrization (as in p. 332 in the reference manual). This time I couldn’t run the sampling.
As a newcomer in Stan, I am having some difficulties in understanding the vectorized operations.
I couldn’t implement the vectorization of beta and beta0 for the same reason.

this is the error I got:
error occurred during calling the sampler; sampling not done
Error in 1:ncol(stanFit) : NA/NaN argument
and
[581] “Rejecting initial value:”
[582] " Error evaluating the log probability at the initial value."
[583] “Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:”
[584] “Exception thrown at line 41: multiply: B[1] is nan, but must not be nan!”

this is the model:

  data {
    int<lower=1> NSubj ; //L
    int<lower=1> Ntotal ; //N
    int<lower=1> Nx ; //D
    int<lower=1> s[Ntotal] ; // ll[N]
    row_vector[Nx] x[Ntotal] ; // row_vector[D] x[N]
    int<lower=0,upper=1> y[Ntotal] ; //y[N]
  }
  parameters {
    vector [Nx] beta_raw[NSubj]; 
    real beta0_raw [NSubj] ;
    //real beta0[NSubj] ;
    vector [Nx] betamu ; //mu[D]
    real beta0mu;
    real<lower=0> beta0sigma ;
    vector<lower=0>[Nx] betasigma ;
  }
  transformed parameters {
    vector [Nx] beta[NSubj]; 
    vector[NSubj] beta0;
    // implies: beta ~ normal(betamu, betasigma), and same for beta0
    for (i in 1:NSubj) {
    beta[NSubj] = betamu + betasigma .* beta_raw[NSubj];
    beta0[NSubj] = beta0mu + beta0sigma .* beta0_raw[NSubj];
    }
  }
  model {
  //prior on group params
  //first, intercept:
  beta0mu ~ normal( 0, 10 );
  beta0sigma ~  normal( 0, 5  ); // was uniform (1.0E-3 , 1.0E+3)
  betamu ~ normal( 0 , 10 );
  betasigma ~ normal( 0, 5 ); //was uniform (1.0E-3 , 1.0E+3)
  for (i in 1:NSubj) {
  beta_raw[NSubj] ~ normal( 0 , 1 );
  }
  beta0_raw ~ normal( 0 , 1 );
  {
vector[Ntotal] x_beta_sub;
for (n in 1:Ntotal)
x_beta_sub[n] = beta0[s[n]] + x[n]*beta[s[n]];
y ~ bernoulli_logit(x_beta_sub);
}
}

#9

Just looking over this model (without reading the backlog too carefully), these lines look suspicious:

for (i in 1:NSubj) {
    beta[NSubj] = betamu + betasigma .* beta_raw[NSubj];
    beta0[NSubj] = beta0mu + beta0sigma .* beta0_raw[NSubj];
}

And:

for (i in 1:NSubj) {
  beta_raw[NSubj] ~ normal( 0 , 1 );
}

I suspect you want to index over i in both cases:

for (i in 1:NSubj) {
    beta[i] = betamu + betasigma .* beta_raw[i]; // Each element here is a vector
    beta0[i] = beta0mu + beta0sigma * beta0_raw[i]; // Don't need the .* since we're dealing with scalars here
}

...

for (i in 1:NSubj) {
  beta_raw[i] ~ normal( 0 , 1 );
}

Does that make sense?


#10

this is definitely correct. thanks for the quick wake up call! Don’t know what happened to me there.

now the model is running, let’s see if the reparametrization works…


#11

Do you have any suggestion on how to vectorize instead of looping here? couldn’t do that with
beta = betamu + betasigma .* beta_raw
as the variables have different sizes.


#12

Hmm, I’m not too familiar with the vectorization stuff. Looks pretty good to me (I don’t see an obvious way to make it faster), but someone else might know for sure. The operations you’re using are mostly pretty high level vector things already.


#13

thanks Ben.
is it normal that i’m still in 19% [warmup] in the first of three chains (saving 10k steps, 1k adaptation and 2k burn-in) ? it has been more than 90 minutes since it started running.


#14

is it normal that i’m still in 19% [warmup]

A lot of times when models run unusually slow in Stan it’s a good hint that there’s some sort of weirdness in the model itself. Maybe there’s a non-identifiable parameter or something and NUTS is just having to do a lot of work to explore the posterior.

Given how many parameters you have (at least Nx * NSubj – how big are all these Ns?), it might be good to just run a chain that’ll finish in a reasonable amount of time and see if there are any super correlated parameters in the posterior (do pairplots, or ShinyStan is pretty easy if you’re using R). This’ll help you identify if the parameterization needs changed somehow.


#15

You only need to run a hundred or so iterations to see if a model is working. If there step size is crashing don’t just keep running it. This is not an efficiency of coding issue, it’s a parameterization issue.


#16

Thanks. That’s a lesson to learn.
I will do so and will follow up tomorrow.


#17

There’s not an easy way to vectorize what you have because you’re essentially broadcasting a copy of betamu to each row of a matrix and similarly multiply each row elementwise by betasigma. These could all be pushed up into matrices and then you could write a one-liner, but it wouldn’t be any more efficient than what you’re doing.


#18

Try to simulate a small data set according to the model and fit it to make sure everything’s programmed right before setting off long runs. It will be faster to develop/debug the model and make you more confident in the final results.


#19

First, we only do a warmup phase, which by default is half the iterations. Usually we can get by with many fewer than 10K iterations using HMC.

And since Discourse is nagging me to multi-answer and I’m too lazy to link, I am also wondering if you want a different sigma for each beta. Because it doesn’t look like what you intended from the math. And if you’re not doing that, then you can easily vectorize the intensive part of your definition.


#20

Hi Bob, Krzysztof and Ben, thanks for helping me with my first model. I am learning a lot here.

I was abroad and therefore didn’t answer for a while.

I took my simplest data set (5 subjects, 3 predictors [a subset of a data set that I modeled successfuly in JAGS]) and ran it for 300 iterations with 150 warmup. it took ~200 secs to run.

Bob, according to your suggestion, I switched to one sigma for all betas instead of multiple, and tried to simulate a small data set. see the attached model.

  1. all of my mu parameters for beta (betamu[1:3]) yielded ESS<10% of total sample size and Rhat>1.1.
    *the group parameters show high autocorrelation.

  2. I can’t do the PP check directly in ShinyStan because the number of columns of the my y_tilde variable is somehow different from the length of the data (i have Ntotal vectors/parameters [3639, number of observations in the original data], each with length of 150 [= number of iterations]]), so I transposed it in Rstudio and then did the check. is it OK? is there a way to do without it? i also get the following error in the RStudio when I work in Shiny Stan:“Warning: Error in data.frame: arguments imply differing number of rows: 3639, 150” and this I guess doesn’t allow me to look at scatter plots.

  3. When doing the posterior predictive check:
    a. Distributions of observed data and a random sample of replications: the binary responses show a slight higher probability to 1 responses vs 0 responses and it seems that the simulated data shows this as well. just a clarification: the plot shows the simulated data with the current parameters of each of the iterations in the model?
    b. Distributions of test statistics: the observed values seem in the midst of the simulated values. sd(yrep), is slightly skewing left relative to the observed value.

I remind you that this is with a non-centered reparametrization for beta and beta 0.

here’s the model:

  data {
    int<lower=1> NSubj ; //L
    int<lower=1> Ntotal ; //N
    int<lower=1> Nx ; //D
    int<lower=1> s[Ntotal] ; // ll[N]
    row_vector[Nx] x[Ntotal] ; // row_vector[D] x[N]
    int<lower=0,upper=1> y[Ntotal] ; //y[N]
  }
  parameters {
    vector [Nx] beta_raw[NSubj]; 
    real beta0_raw [NSubj] ;
    //real beta0[NSubj] ;
    vector [Nx] betamu ; //mu[D]
    real beta0mu;
    real<lower=0> beta0sigma ;
    //vector<lower=0>[Nx] betasigma ;
    real<lower=0> betasigma;
  }
  transformed parameters {
    vector [Nx] beta[NSubj]; 
    vector[NSubj] beta0;
    // implies: beta ~ normal(betamu, betasigma), and same for beta0
   for (i in 1:NSubj) {
    beta[i] = betamu + betasigma * beta_raw[NSubj];
    beta0[i] = beta0mu + beta0sigma * beta0_raw[NSubj];
    }
  }
  model {
  //prior on group params
  //first, intercept:
  beta0mu ~ normal( 0, 5 );
  beta0sigma ~  normal( 0, 5  ); // was uniform (1.0E-3 , 1.0E+3)
  betamu ~ normal( 0 , 5 );
  betasigma ~ normal( 0, 5 ); //was uniform (1.0E-3 , 1.0E+3)
  for (i in 1:NSubj) {
  beta_raw[i] ~ normal( 0 , 1 );
  }
  beta0_raw ~ normal( 0 , 1 );
  {
vector[Ntotal] x_beta_sub;
for (n in 1:Ntotal)
x_beta_sub[n] = beta0[s[n]] + x[n]*beta[s[n]];
y ~ bernoulli_logit(x_beta_sub);
}
}
generated quantities {
  vector[Ntotal] y_tilde;
  for (n in 1:Ntotal)
  y_tilde[n] <- bernoulli_logit_rng(beta0[s[n]] + x[n]*beta[s[n]]);
}

thanks a lot again!