Convergence issues with a hierarchical model

JohnDoe · February 10, 2022, 9:12am

Hello!

For some time now I am struggling with estimation of a hierarchical model. I will briefly describe the model here and then provide my Stan code.

Let’s start with the manifest variables in the model. One of the manifest variables is a binary variable Y, representing whether a person has responded correctly to an item. The other manifest variable is a continuous variable t, representing the response time of each person to each item. Besides these two variables, we also have the auxiliary variable X, which represents the ‘time scale’ of each item. This variable X is calculated as (i-1)/k, where i is the position of the current item and k is the total number of items.

The latent variable part of the model rests upon the item response theory models for responses and response times. Usually what you would have is the latent variable theta, which represents respondents’ latent ability (accuracy) and the latent variable tau, which represents the person’s latent speed. Theta is defined so that P(Y = 1) = (e^(theta - b))/(1 + e^(theta - b)), where b is the difficulty of the item. This is the so-called Rasch model. The latent variable tau is defined so that the logarithm of t (the response times) is normally distributed with mean beta - tau and variance alpha, beta being the time intensity parameter of the item, analogous to the item difficulty parameters b.

We extend this model to a latent growth model by replacing theta and tau with their respective growth terms, so instead of theta we would have theta0 + theta1X (where theta0 is the intercept, theta1 is the slope and X is the time scale variable described in the beginning). We do the same for tau, and so tau is replaced by tau0 + tau1X.

Therefore, we have four latent variables (theta0, theta1, tau0 and tau1) for which a multivariate normal distribution is assumed.

To identify the model, we have set all the means of the latent variables to 0 and we have constrained the alpha (variance of the response times) to be the same across all items. Other item parameters (b and beta) vary across items.

For several months already I have been trying to fit this model, but I am constantly running into convergence issues. Specifically, what seems to be the problem are the variance parameters of the latent variables and the correlation matrix of the latent variables. Below I will post my Stan code, my R code that I use to generate simulated data and some traceplots which demonstrate bad convergence. I would really appreciate any help with improving the model. I have already tried most of the suggestions on the reparametrization of the model, so I am all out of ideas and I don’t know what else I could try. Currently I am trying to run the model with really large sample sizes (10000 respondents and 50 items), but I don’t have a feeling that that will actually help.

EDIT: Also what would be maybe important to note is that I use the following arguments for the estimation in Stan:

max_treedepth = 20
adapt_delta = .99
stepsize = .01

Stan code:

data {
  int<lower=0> n_person; // number of examinees
  int<lower=0> n_item;  // number of items
  int<lower = 1> Dim; // number of dimensions 
  int<lower = 1> N; // number of data points
  int<lower = 1> jj[N]; // item id
  int<lower = 1> ii[N]; // person id
  int<lower=0, upper = 1> Y[N];
  vector<lower=0>[N] t; // response time data
  vector<lower=0, upper=1>[N] X; // time scale variable
  vector[Dim] Zero; // vector of 0s for person parameter means
}

parameters {
  matrix[n_person, Dim] PersParStar;      // provisional variable used in reparametrization of the multivariate normal
  cholesky_factor_corr[Dim] choleskyP;   // Cholesky decomposition of correlation of person parameters
  vector<lower=0>[Dim] sigmaPStar; // provisional variable used in reparametrization of variances
  vector[n_item] b;          // item difficulty parameters
  vector[n_item] beta;       // item time intensity parameters
  real<lower=0> alphaStar;      // common variance of RTs
  real<lower=0> Prec1;          // provisional precision variable used for the reparametrization of t-distribution for variances
  real<lower=0> Prec2;          // same as above
}

transformed parameters {
  vector<lower=0>[Dim] sigmaP = sigmaPStar/sqrt(Prec1); //variance of person parameters
  real<lower=0> alpha = alphaStar/sqrt(Prec2);         // variance of response times
  matrix[n_person, Dim] PersPar = PersParStar * diag_pre_multiply(sigmaP, choleskyP);  //person parameters
}

model {
  to_vector(PersParStar) ~ std_normal();
  Prec1 ~ gamma(1.5, 1.5);
  Prec2 ~ gamma(1.5, 1.5);
  sigmaPStar ~ std_normal();
  alphaStar ~ std_normal();
  choleskyP ~ lkj_corr_cholesky(1);
  b ~ normal(0, 2);
  beta ~ normal(0, 2);
  target += bernoulli_logit_lpmf(Y | (PersPar[ii,1] + PersPar[ii,2] .* X) - b[jj]);
  target += lognormal_lpdf(t | beta[jj] - (PersPar[ii,3] + PersPar[ii,4] .* X), alpha);
}

R code for data generation:


library(tidyverse)
library(faux)
library(rethinking)
library(extraDistr)

set.seed(172)

n_item <- 20
n_person <- 500

sigmaP <- c(1, .4, 1, .5)


correlP <- c(1, .2, .5, -.2,
             .2, 1, -.1, -.3,
             .5, -.1, 1, -.15,
             -.2, -.3, -.15, 1)

person_pars <- rnorm_multi(n = n_person, 
                           mu = rep(0, 4), 
                           sd = sigmaP, 
                           r = correlP, 
                           varnames = c("Theta0", "Theta1", "Tau0", "Tau1"))

theta0 <- person_pars$Theta0

theta1 <- person_pars$Theta1

tau0 <- person_pars$Tau0

tau1 <- person_pars$Tau1

b <- rep(seq(-1.5, 1.5, .75), 4)

beta <- rep(seq(-1, 1, .5), each = 4)

alpha <- .3

y <- list()
t <- list()
x <- list()

for (i in 1:n_item) {
  x[[i]] <- (i - 1)/n_item
  t[[i]] <- rlnorm(n_person, (beta[[i]] - (tau0 + (tau1*x[[i]]))), alpha)
  y[[i]] <- rbinom(n_person, 1, ((exp((theta0 + (theta1*x[[i]])) - b[[i]]))/(1 + exp((theta0 + (theta1*x[[i]])) - b[[i]]))))
}

jj <- rep(1:n_item, each = n_person)
ii <- rep(1:n_person, n_item)

data <- list("n_person" = n_person,
             "n_item" = n_item,
             "N" = n_item*n_person,
             "Dim" = 4,
             "Y" = unlist(y),
             "t" = unlist(t),
             "jj" = jj,
             "ii" = ii,
             "X" = rep(unlist(x), each = n_person),
             "Zero" = rep(0, 4))

saveRDS(file = "data_linspeedabil_1.rds", object = data)

Bad traceplots:

mike-lawrence · February 16, 2022, 8:14pm

Hoping I can help here. I’m pretty familiar with hierarchical models, inc simultaneous inference from RT/accuracy data, and I have a bit of experience with IRT stuff. I’m not 100% confident on this, but I think you’ve over-constrained things; as presented, your model does not have an intercept for either manifest variables (or more precisely you have constrained the intercept to precisely zero). Try adding intercepts for both (with an appropriate prior on each of course), and report back if that helps at all.

JohnDoe · March 12, 2022, 1:36pm

Thank you for your suggestion and sorry for taking a long time to reply! Could I please ask you for more help with this - I don’t understand how I would implement your suggestion.

For the sake of simplicity, let’s assume that this is not a growth model and so that the variable log(t) is distributed with the mean β - τ and the standard deviation σ. When looking at this specification, it is not clear to me what is the intercept and what is the slope. This is because the usual regression formulation takes on a form of ax + b, and I don’t see what the x would be in my case.

Are β and τ actually x1 and x2? But in this case they would have no slope, and thus the model wouldn’t exist. Are β and τ slopes? But slopes of what? They are not multiplied by anything. I just can’t seem to differentiate what is what here. Would appreciate it if someone could clear this up for me, thanks.

andre.pfeuffer · March 14, 2022, 1:27pm

If I’m not mistaken the cholesky matrix part needs to be transposed. See ’ at the end.

matrix[n_person, Dim] PersPar = PersParStar * diag_pre_multiply(sigmaP, choleskyP)';

ckoenig · March 15, 2022, 12:21pm

I was wondering, is there a (theoretical) reason why the timescale should also work on \theta_{i} in the response model? If I understand Fox et al. (2021) correctly, the timescale would only be relevant for the response-time model.

When I run the model without \theta_{1i} (i.e., only the Rasch-Model as response model), it converges without issues:

Summary

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Effective sample size satisfactory.
Split R-hat values satisfactory all parameters.
Processing complete, no problems detected.

Rhat and ESS for the respective variance components and the Cholesky-factor look fine (although they could be better, admittedly; two chains with 500 iterations):

           variable        mean         sd         q5         q95     rhat ess_bulk ess_tail
1502 choleskyP[1,1]  1.00000000 0.00000000  1.0000000  1.00000000       NA       NA       NA
1503 choleskyP[2,1]  0.50449072 0.03917249  0.4391273  0.56802350 1.005844 167.5161 252.5995
1504 choleskyP[3,1] -0.19483793 0.05219148 -0.2790557 -0.10986470 1.002983 351.8496 581.0875
1505 choleskyP[1,2]  0.00000000 0.00000000  0.0000000  0.00000000       NA       NA       NA
1506 choleskyP[2,2]  0.86222636 0.02285642  0.8230124  0.89842515 1.005843 167.5180 252.5995
1507 choleskyP[3,2] -0.04410442 0.04899902 -0.1233592  0.03847848 1.001978 286.3059 439.8277
1508 choleskyP[1,3]  0.00000000 0.00000000  0.0000000  0.00000000       NA       NA       NA
1509 choleskyP[2,3]  0.00000000 0.00000000  0.0000000  0.00000000       NA       NA       NA
1510 choleskyP[3,3]  0.97716976 0.01061358  0.9580923  0.99217615 1.002872 408.3171 592.4018
1511  sigmaPStar[1]  0.99974172 0.30586399  0.5334621  1.51872700 1.006515 633.9277 446.6684
1512  sigmaPStar[2]  1.01187953 0.30921176  0.5408461  1.53960600 1.003791 630.9543 388.4636
1513  sigmaPStar[3]  0.51641683 0.15809822  0.2745193  0.78778335 1.002635 619.2730 424.1621

Please note that I switched the dimensions of the auxiliary person parameter matrix in the parameters-Block and changed the corresponding calculation of the person parameter matrix in the transformed parameters-block (similar to Andre’s suggestion):

matrix[Dim, n_person] PersParStar; 
[...]
matrix[n_person, Dim] PersPar = (diag_pre_multiply(sigmaP, choleskyP) * PersParStar)';

Best wishes,
Chris

JohnDoe · March 15, 2022, 1:22pm

@andre.pfeuffer Thank you very much for spotting the typo! I am now re-running the models - the estimation is much faster. Initially there wasn’t much improvement with the standard sample sizes (500 participants, 20 items), but now I am running it with larger sample sizes to see what will happen. Even without the increased sample sizes, there were just some parameters that showed improvement.

@ckoenig Yes, we are not trying to recreate Fox’s model, but to extend it with also a slope on the ability scale. Thank you for your efforts. With no slope for the ability scale, the model works fine on my side too.

Topic		Replies	Views
Identification in latent variable model Modeling	5	1600	April 17, 2018
Convergence issues in simple latent variable model Modeling fitting-issues	1	844	February 7, 2020
Non-convergence on 2PL IRT model Modeling	11	3066	May 25, 2017
Multiple Chains do not converge Modeling	1	2093	July 4, 2019
Why do I find chains running in parallel and not converging General stanc	3	2309	April 16, 2018

Convergence issues with a hierarchical model

Related topics