 # Different results between regular binomial model and one that uses sufficient stats

I was trying to speed up a hierarchical binomial model that I’m using to run some pre-experiment simulations to plan my sample sizes, and I noticed different posteriors between my regular version and my optimized version. I reduced things down to a minimal reproducible example that I’ve included below (inc. R code for data generation). In `binom_mod_long.stan` I use `bernoulli_logit()` while in `binom_mod_matrix.stan` I try to use the sufficient statistics trick from the manual. The latter runs much faster but yields a lower n_eff, and even when I boost the requested number of samples to make n_eff match, the posteriors for the `pop_mean` parameter are discernibly different. Presumably I’ve misunderstood something or made an error somewhere, but haven’t been able to figure it out myself yet. Help?

`binom_mod_matrix.stan`

``````data {
// nSubj: number of subjects
int nSubj ;
// nK: number of observations per participant
int nK ;
// Y: matrix of outcomes (0/1)
int Y[nSubj,nK] ;
// prior_mean_for_pop_mean
real prior_mean_for_pop_mean ;
}
transformed data{
// Y_sum: sum of observations for each participant
int Y_sum[nSubj] ;
for(s in 1:nSubj){
Y_sum[s] = sum(Y[s,]);
}
}
parameters {
// pop_mean: population mean
real pop_mean ;
// pop_sd: population sd
real<lower=0> pop_sd ;
// subj_coefs: coefficients for each subject
vector[nSubj] subj_coefs ;
}
model {
pop_mean ~ normal(prior_mean_for_pop_mean,1) a;
pop_sd ~ normal(0,1) ;
subj_coefs ~ normal(0,1) ; //non-centered parameterization
Y_sum ~ binomial(nK,inv_logit(pop_mean + subj_coefs*pop_sd)) ;
}
``````

`binom_mod_long.stan`

``````data {
// nSubj: number of subjects
int nSubj ;
// nY: total number of observations
int nY ;
// Y: vector of outcomes (0/1)
int Y[nY] ;
// S: integer label of subject associated with each entry in Y
int S[nY] ;
// prior_mean_for_pop_mean
real prior_mean_for_pop_mean ;
}
parameters {
// pop_mean: population mean
real pop_mean ;
// pop_sd: population sd
real<lower=0> pop_sd ;
// subj_coefs: coefficients for each subject
vector[nSubj] subj_coefs ;
}
model {
pop_mean ~ normal(prior_mean_for_pop_mean,1) ;
pop_sd ~ normal(0,1) ;
subj_coefs ~ normal(0,1) ; //non-centered parameterization
Y ~ bernoulli_logit( pop_mean + subj_coefs[S]*pop_sd ) ;
}
``````

`binom_sim.r`

``````#load packages
library(tidyverse)
library(rstan)

#compile the models
binom_mod_matrix = rstan::stan_model(file='binom_mod_matrix.stan')
binom_mod_long = rstan::stan_model(file='binom_mod_long.stan')

#simulation parameters
N = 100
K = 100
SD = 1
prop_mean = .8

#set the random seed for reproducibility
set.seed(1)

#combine to long data format
subj_true_coefs = data.frame(
subj = 1:N
, value = rnorm(N,qlogis(prop_mean),SD)
)

#generate observed binomial data
a_mat = matrix(NA,nrow=N,ncol=K)
for(i in 1:nrow(subj_true_coefs)){
a_mat[i,] = rbinom(K,1,plogis(subj_true_coefs\$value[i]))
}

#sample the matrix-formatted data
fit_matrix = rstan::sampling(
object = binom_mod_matrix
, data = list(
nSubj = N
, nK = K
, Y = a_mat
, prior_mean_for_pop_mean = qlogis(prop_mean)
)
, chains = 4
, iter = 2e3
, refresh = 0
)

#reshape for the long model
a_long = tidyr::gather(as.data.frame(a_mat))

#sample the long-formatted data
fit_long = rstan::sampling(
object = binom_mod_long
, data = list(
nSubj = N
, nY = N*K
, Y = a_long\$value
, S = as.numeric(as.factor(a_long\$key))
, prior_mean_for_pop_mean = qlogis(prop_mean)
)
, chains = 4
, iter = 2e3
, refresh = 0
)

#How long did each take
mean(rowSums(get_elapsed_time(fit_matrix)))
# <1s
mean(rowSums(get_elapsed_time(fit_long)))
# ~25s

#check whether the samples for the pop_mean parameter roughly match
summary(fit_matrix,par=c('pop_mean'))\$summary
#             mean     se_mean         sd    2.5%      25%      50%     75%    97.5%    n_eff     Rhat
#pop_mean 1.465358 0.003453002 0.08865914 1.28569 1.408115 1.465362 1.52566 1.638436 659.2549 1.002375

summary(fit_long,par=c('pop_mean'))\$summary
#             mean     se_mean         sd     2.5%      25%      50%     75%    97.5%    n_eff      Rhat
#pop_mean 1.294175 0.000279221 0.02465094 1.246923 1.277222 1.293975 1.31059 1.342536 7794.187 0.9995637
``````

Hi,
I am not 100% sure what is happening. I am able to reproduce your discrepancy with the original model, but rewriting `binom_mod_matrix.stan` to have exactly the same input as `binom_mod_long.stan` makes the discrepancy go away so I’d guess this has to do something with your code to reformat the data:

`binom_mod_long_sufficient.stan`

``````data {
// nSubj: number of subjects
int nSubj ;
// nY: total number of observations
int nY ;
// Y: vector of outcomes (0/1)
int Y[nY] ;
// S: integer label of subject associated with each entry in Y
int S[nY] ;
// prior_mean_for_pop_mean
real prior_mean_for_pop_mean ;
}
transformed data{
// Y_sum: sum of observations for each participant
int Y_sum[nSubj] = rep_array(0, nSubj);
int nK[nSubj] = rep_array(0, nSubj);
for(n in 1:nY) {
Y_sum[S[n]] = Y_sum[S[n]] + Y[n];
nK[S[n]] = nK[S[n]] + 1;
}
}
parameters {
// pop_mean: population mean
real pop_mean ;
// pop_sd: population sd
real<lower=0> pop_sd ;
// subj_coefs: coefficients for each subject
vector[nSubj] subj_coefs ;
}
model {
pop_mean ~ normal(prior_mean_for_pop_mean,1);
pop_sd ~ normal(0,1) ;
subj_coefs ~ normal(0,1) ; //non-centered parameterization
Y_sum ~ binomial(nK,inv_logit(pop_mean + subj_coefs*pop_sd));
}
``````

Hope that helps!

3 Likes

At the risk of violating social norms, I’m going to bump this thread and renew my request for help tracking down what’s going on here. I presume @martinmodrak is correct that something is erroneous in how the data is either being packed in r or indexed in Stan but i’ve hit a wall in my attempts at debugging. Any help would be greatly appreciated!

Hi Mike,
I played with your example and got the same results. But then I tried to reduce the number of subjects and it seems that it is indeed the way the long format of the data is generated that is off. If you transpose the a_mat matrix before applying the gather function, the long model gives the same results as the matrix version

``````a_long = tidyr::gather(as.data.frame( t(a_mat) ))
``````
3 Likes

@lott999, good find and welcome! I was looking at this, but you beat me to it :)

`tidyr::gather` stacks the columns of `a_mat` so the index of `a_long` is going across subjects first. Each subject’s first trial, and then each subject’s second trial, and so on. Transposing `a_mat` before `gather`ing flips the index to be within subject first.

Alternatively, you could define `S` to deal with the indexing across subjects

``````#sample the long-formatted data
fit_long = rstan::sampling(
...
, Y = a_long\$value # without t()
, S = as.numeric(gl(100, 1, N*K))
...
)
``````
2 Likes