Hello! I’m in need of some help.
Using Stan, I’m trying to implement the simplified version of the Signal Detection Theory (SDT) mixture model described on pg. 711 of DeCarlo (2002):
P(Y \leq k \mid X) = \theta \cdot \Phi(c_k - d_A X) + (1 - \theta) \cdot \Phi(c_k)
Applied to recognition (confidence rating) data, this model assumes that there are two latent classes of trials for “old” items (which could be mapped to, e.g., attended trials and unattended trials) and attempts to estimate the proportion of trials in each class (theta) as well as sensitivity (d’) and thresholds (c). I started off with a non-hierarchical version of the model to try and ensure I translated the likelihood correctly. That version worked and did a good job recovering true parameter values from simulated data, so I have been incrementally adding to that model. Currently, the model has subject-level random intercepts on each of the three parameters (i.e., d’, c, and theta). Here is the version of the model I am testing:
data {
int<lower=1> N;
int<lower=1> N_subj;
array[N] int<lower=1,upper=6> Y;
array[N] int<lower=1,upper=N_subj> subj;
vector[N] stimulus;
int<lower=2> nthres;
}
parameters {
// d prime
real d;
// theta
real theta_logit;
// Thresholds
ordered[nthres] mu_c;
vector[N_subj] d_subj_z;
vector[N_subj] theta_subj_z;
array[N_subj] vector[nthres] c_z;
real<lower=0> sigma_d;
real<lower=0> sigma_theta;
vector<lower=0>[nthres] sigma_c;
}
transformed parameters {
vector[N_subj] d_subj = sigma_d * d_subj_z;
vector[N_subj] theta_subj = sigma_theta * theta_subj_z;
array[N_subj] ordered[nthres] c_subj;
for (s in 1:N_subj) {
c_subj[s] = mu_c + sigma_c .* c_z[s];
for (k in 2:nthres) {
// to prevent thresholds from collapsing
c_subj[s, k] = fmax(c_subj[s, k], c_subj[s, k - 1] + 0.01);
}
}
}
model {
d ~ normal(1, 1);
theta_logit ~ normal(0, 1);
mu_c ~ normal(0, 1.5);
sigma_d ~ normal(0, 0.5);
sigma_theta ~ normal(0, 0.5);
sigma_c ~ normal(0, 0.5);
d_subj_z ~ std_normal();
theta_subj_z ~ std_normal();
for (s in 1:N_subj) {
c_z[s] ~ std_normal();
}
for (n in 1:N) {
int s = subj[n];
real theta = inv_logit(theta_logit + theta_subj[s]);
real curr_d = d + d_subj[s];
target += log_mix(
theta,
ordered_probit_lpmf(Y[n] | stimulus[n] * curr_d, c_subj[s]),
ordered_probit_lpmf(Y[n] | 0, c_subj[s])
);
}
}
This version of the model only sort of works. The model runs, does not diverge, and even still does a good job with parameter recovery - except for sigma_theta. This parameter seems to be very poorly identified - the estimates are pretty far off of simulated true values, the CIs are incredibly wide, ESS and RHats are abysmal, etc. Of course, the individual-level theta_subj parameters are similarly poorly identified, too. Additionally, including this parameter in the model seems to hinder sampling efficiency and recovery of the other parameters. The model works fairly well when I include my other hierarchical effects but exclude subject-level mixing proportions. However, If I fit such a model to data simulated with subject-level variance in theta, the population level estimate of theta is very severely biased. In the real world, I think it’s reasonable to expect that mixture proportions would vary systematically at the subject level (e.g., people probably vary in how much attention they’re paying), so I don’t think this assumption could be safely ignored for real data.
Thus far, I have not been able to find a way to improve the estimation of subject-level theta parameters. I’m aware that a model like this poses substantial identifiability challenges - it’s difficult for the model to separate d’ and theta because of the tradeoffs between those parameters that can be made. I’m thinking that perhaps this model is simply doomed because of how difficult it is to learn subject-level mixing proportions from the data. However, I’m also wondering if there’s anything I should try before throwing in the towel - alternative parameterizations, constraints that could hypothetically be defensible in the real world, etc.? Or perhaps, am I making mistakes somewhere that I’m not aware of? I’m totally unsure - if anyone has any ideas at all for me, I’d love to know! Here is R code for a reproducible example, if anyone wants to take it that far. Note that this model is very, very slow to run.
# design parameters
N_subj <- 100
N_trials <- 100
nthres <- 5
# true thresholds and variance
true_mu_c <- c(-1.2, -0.6, 0.5, 1.0, 1.5)
true_sigma_c <- c(0.35,0.4,0.6,0.5,0.28)
# true d' and variance
true_d <- 1.2
true_sigma_d <- 0.4
# true theta and variance
true_theta_logit <- 0.5
true_sigma_theta <- 0.4
sim_data <- list()
for (s in 1:N_subj) {
subj_d <- rnorm(1, 0, true_sigma_d)
subj_theta <- rnorm(1, 0, true_sigma_theta)
subj_c <- rnorm(nthres, 0, true_sigma_c)
c_subj <- true_mu_c + subj_c
for (k in 2:nthres) {
c_subj[k] <- max(c_subj[k], c_subj[k-1] + 0.01)
}
for (t in 1:N_trials) {
stimulus <- ifelse(t <= N_trials/2, 0.5, -0.5)
curr_d <- true_d + subj_d
curr_theta <- plogis(true_theta_logit + subj_theta)
if (runif(1) < curr_theta) {
latent <- rnorm(1, mean = stimulus * curr_d, sd = 1)
Y <- sum(latent > c_subj) + 1
} else {
latent <- rnorm(1, mean = 0, sd = 1)
Y <- sum(latent > c_subj) + 1
}
sim_data[[(s-1)*N_trials + t]] <- data.frame(
subj = s,
Y = Y,
stimulus = stimulus
)
}
}
stan_data <- list(
N = N_subj * N_trials,
N_subj = N_subj,
Y = sapply(sim_data, function(x) x$Y),
subj = sapply(sim_data, function(x) x$subj),
stimulus = sapply(sim_data, function(x) x$stimulus),
nthres = nthres
)
mix_basic <- cmdstan_model('sdt-mix-basic.stan')
fit_mix <- mix_basic$sample(
stan_data,
num_cores = 12,
num_chains = 12,
iter_warmup = 1000,
iter_sampling = 1000)