Help with understanding my results

performance
fitting-issues
specification

#1

Please share your Stan program and accompanying data if possible.

I am fitting a multilevel 2PL IRT model to some simulated data with a multilevel 2PL model with rstan. I have tweaked the parameters for a week and find that I cannot really understand what’s going on with rstan and the results.
My data looks like this: 20_10_0.3_TwoPL_x1.txt (797.1 KB)

my rstan program looks like this:


data {
int<lower=1> N; // number of students
int<lower=1> NFam; // number of item families
int<lower=1> NObs; // number of observations
int<lower=1,upper=N> ID_Stu[NObs]; // student id
int<lower=1,upper=NFam> ID_Fam[NObs]; // item family id
int<lower=0,upper=1> Resp[NObs];// resp
//real x[NObs]; // design matrix
int NIC; //number of item clones, fixed in this syntax
int<lower=1,upper=NIC> ID_Clone[NObs];//item clone id
}
parameters {
real theta[N]; // ability
real mu[NFam]; // difficulty
real<lower=0, upper=5> alpha[NFam]; // discrimination
//real lambda[I]; // covariate
real<lower=0> sigma[NFam]; // random variation
vector[NIC] beta[NFam]; // item clone difficulties
}
model {
row_vector[NObs] alpha_local; // temporary parameters
row_vector[NObs] diff_local;
int fam_n;
theta~normal(0,1); // priors
mu~normal(0,10);
//lambda ~normal(0,10);
sigma~cauchy(0,.3);
for(i in 1:NFam){
beta[i] ~normal(mu[i], sigma[i]);
}
for (n in 1:NObs) {
fam_n <- ID_Fam[n];
alpha_local[n] <- alpha[fam_n];
diff_local[n] <- theta[ID_Stu[n]] - beta[fam_n, ID_Clone[n]];
}
Resp~bernoulli_logit(alpha_local .* diff_local);
}


The rstan model fitting looks like this:

data <- list(
N = N, #number of students
NFam = NFam, #number of item families
NObs = N * NFam * NItemClone, #number of observations
Resp = as.vector(t(irv)),#person by person response vector

ID_Stu = seq(0,length(Resp)-1)%/% (NFam*NItemClone) + 1, #student ID 

ID_Fam = rep(seq(0,(NFam*NItemClone-1)) %/% NItemClone + 1,N), #item family ID 

NIC = NItemClone, #number of item clones 
ID_Clone = rep(rep(1:NItemClone),length(Resp)/NItemClone) #item clone ID

)

fit1 <- stan(
file = paste0(path,“TwoPL.stan”), # Stan program
data = data, # named list of data
chains = 2, # number of Markov chains
warmup = 200, # number of warmup iterations per chain
iter = 1200, # total number of iterations per chain
thin = 10,
cores = 4, # number of cores (using 2 just for the vignette)
refresh = 1, # show progress every ‘refresh’ iterations
control = list(max_treedepth = 20,adapt_delta=0.90)
)

sometimes I have divergent chains, so the warning message recommends increasing acceptance rate, which is already 0.90. I thought if the acceptance rate is too high, that would introduce more noises to the estimates because we are pretty much accepting whatever in the posterior. If not divergence problem, I may have chains with really bad mixing, such as the followings: 10_10_0.3_TwoPL_Mu_Traceplot_1.pdf (18.0 KB), like mu[2] and mu[7], they band is really wide, while other parameter estimates look ok. Why would that happen?

For the AC plot, some looks really bad too. Not sure why because I picked every 10th value so I thought the AC should be low for all the estimates. 20_10_0.3_TwoPL_Sigma_ACplot_1.pdf (9.8 KB)

I also spent a couple days just to play with the iterations and number of chains. I did read in the literature that 10000 effective sample size is desirable for social science research, but my effective sample sizes for the 2 chains for each parameter is around 100 or something. When there are divergent chains, some may end up with 20 to 30 effective sample size, which also reflected on very high se of the estimate. This is one output I had. If my chain is 1000 steps long after burned in 200, and I thinned every 10th, with 2 chains, should I have about 200 sample sizes ideally if no AC is presented?

So I would appreciate if anybody would help me take a look at my code and give me some suggestions in terms of how to specify the number of chains, number of iterations for IRT models, step size(reflected in the max_treedepth), and acceptance rate? Would .95 be too high? How big and small can be max_treedepth be? My actual step is less than 10 (for example: 10_10_0.3_TwoPL_Diagnostics_1.txt (1.9 KB)
) so I specified 20. If I make stan to walk big steps, and accept the posterior with high rate like 0.95, would it mean it actually converges fast? But actually it does not. For the calibration I attached here, it took about 4 hours. :-( just learned a bit about Bayestian computation, so I have so many dump questions. Any feedback would be highly appreciated! I


#2

Maybe non-center this. So do:

for(i in 1:NFam) {
  beta[i] ~ mu[i] + beta_z[i] * sigma[i];
}

Where you’ve defined beta as a local variable (you gotta put these at the top of the model or transformed parameters block), beta_z is now a parameter:

parameters {
  vector[NIC] beta_z[NFam];
  ...
}

With prior:

model {
  for(i in 1:NFam) {
    beta_z[i] ~ normal(0, 1);
  }
}

This is usually something to try when you have divergences. Check here for a description of why (and another example of non-centering cause what I wrote above probably isn’t clear) https://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html .

That is for a Metropolis sampler. HMC samplers (which Stan uses a variant of) are not like this. It’s perfectly normal to operate them near 100% accept rates. I’m not sure the current Stan sampler actually does rejections haha. The short story is adapt delta=0.9999 is fine if that gets rid of your divergences. Divergences are the thing to worry about with HMC.

You shouldn’t need to thin your Stan samples. Use the Neff estimates to gauge the sampler efficiency and for any standard error calculations you need to do.

If you need to make it larger than the default 10, there’s something interesting going on. Hopefully you don’t have to. That max_treedepth is an adaptation limit. You shouldn’t need to make it lower. Ideally if the sampler doesn’t need to use the whole tree, it automatically won’t.

After non-centering, I’d replace the Cauchy prior on sigma and with a normal and see if that helps. Cauchy priors are super non-informative and that can make the sampling really slow.


#3

Thank you for your response. I will try to get convergence first and then play with the priors.


#4

Hi bbbales2, after delving into the non-convergence issue I had, I learned that stan is not efficient in estimating hierarchical models. So I tried non-centering of one of the variable I am estimating, just like what you suggested. It converged and relatively efficient. I wonder if you could help me take a look at my code and my results because I do see some catepillar like traceplot, which I do not understand what that means.

My stan model goes as follows:
data {
int<lower=1> N; // number of students
int<lower=1> NFam; // number of item families
int<lower=1> NObs; // number of observations
int<lower=1,upper=N> ID_Stu[NObs]; // student id
int<lower=1,upper=NFam> ID_Fam[NObs]; // item family id
int<lower=0,upper=1> Resp[NObs];// resp
int NIC; //number of item clones, fixed in this syntax
int<lower=1,upper=NIC> ID_Clone[NObs];//item clone id
}
parameters {
real theta[N]; // ability
real<lower=0, upper=5> alpha[NFam]; // discrimination
real mu[NFam]; // difficulty
real<lower=0> sigma[NFam]; // random variation
vector[NIC] beta_z[NFam]; //*beta reparameterization;
}

transformed parameters {
vector[NIC] beta[NFam]; // item clone difficulties
for(i in 1:NFam) {
beta[i]= mu[i] + beta_z[i] * sigma[i];
}
}

model {
row_vector[NObs] alpha_local; // temporary parameters
row_vector[NObs] diff_local;
int fam_n;
theta~normal(0,1); // priors
mu~normal(0,10);
sigma~normal(0,1);
for(i in 1:NFam){
beta_z[i] ~ normal(0, 1);
//beta[i] ~normal(mu[i], sigma[i]);
}
for (n in 1:NObs) {
fam_n <- ID_Fam[n];
alpha_local[n] <- alpha[fam_n];
diff_local[n] <- theta[ID_Stu[n]] - beta[fam_n, ID_Clone[n]];
}
Resp~bernoulli_logit(alpha_local .* diff_local);
}

The model I fitted in rstan goes as follows:
fit1 <- stan(
file = paste0(path,“TwoPL1.stan”), # Stan program
data = data, # named list of data
seed=4562627,
chains = 4, # number of Markov chains
warmup = 2500, # number of warmup iterations per chain
iter = 5000, # total number of iterations per chain
#thin = 10,
cores = 4, # number of cores (using 2 just for the vignette)
refresh = 100, # show progress every ‘refresh’ iterations
control = list(max_treedepth = 10,adapt_delta=0.80)
)

The concerning trace plot looks like as follows: 10_10_0.15_TwoPL_Mu_Traceplot_1.pdf (820.1 KB)
10_10_0.15_TwoPL_Sigma_Traceplot_1.pdf (868.0 KB)

The same problem of mu[4] is reflected in the Neff too as it is the lowest (partial results: 10_10_0.15_TwoPL_Par_Ests_1.pdf (85.4 KB))
I ran 4 chains of 2500 iterations, so it should be around 10,000, but it is only around 1000 for this parameter estimate. Why some parameter estimates are so unstable? I can understand partly because the distribution of items and people are not exactly overlapped so when items are extremely low in terms of alpha, there are not many information(reflected in people) down there to estimate these item parameters. What do you think? Should I be concerned about the unstable estimates after I get a converged solution?

I also have a question about the number of iterations ( relately, Neff). How large the Neff should be to get stable parameter estimates? I did read that it differs from field to field. Is there a rough rule of thumb?

I am running a simulation study. In my field, we always run multiple repetitions and then aggregate the results. How many repetitions would you recommend based on the rstan model I fit? As I already did 4 chains of 2500 iterations, I think this is also repetitions as well. I guess I am all thinking about saving time but getting the most reliable results. :-) Thank you for your attention.


#5

Neff of 1000 should be pretty good.

Check the MC_error column there. That’s an estimate of the Monte Carlo error: https://web.northeastern.edu/afeiguin/phys5870/phys5870/node71.html.

The posterior mean of mu[4] is -2.8. The MC_error estimate is like 0.05. So do you need to be able to identify the mean of mu[4] more accurately than within a range of like -2.9 to -2.7? If so, you’ll need more samples. If not, you’re done.

We’re usually interested in more complicated things than the mean of one parameter. If you want to avoid going crazy checking Monte Carlo diagnostics on your posterior samples, work on checking your model calibration with posterior predictives and such once the Rhat/Neff diagnostics seem okay.

I’ll worm out of this and say it is up to you to convince yourself and your audience :D.

The generic ways to check a model are:

  1. Make sure your prior predictions are sane
  2. Convince yourself that you can recover parameters with simulated data
  3. Make sure your diagnostics are good (4 chains, Rhats = 1.0, no divergences, enough Neffs)
  4. Check the calibration of the model with posterior predictions

So if you’ve done all those things and you’re happy with your model, awesome, make a nice presentation and convince other folks of your stuff.


#6

Thank you very much, Ben. All the comments are very helpful. :-)