Dear Stan Community,
I am trying to understand a certain behavior of the sampler and I would greatly appreciate your insights.
I noticed that for the Generalized Partial Credit Model (GPCM), the convergence of the sampler depends on the parameterization, which can be either (full Stan code below)
y[n] ~ categorical_logit(append_row(0, cumulative_sum(alpha * (theta[pers[n]] - beta[item[n]]))));
or
y[n] ~ categorical_logit(append_row(0, cumulative_sum(alpha * theta[pers[n]] - alpha_times_beta[item[n]])));
In the model, I set a standard normal prior on the person parameters, which should identify the model (the \beta parameters via the mean and the \alpha parameter via the standard deviation).
Now, the following behavior seems to be dependent on data, though I have not yet managed to figure out which property of the data causes this. The more persons I have in my data, the less likely it is to see this behavior. I have fitted the model now 3 times with the same data and get similar results every time (the data are in the data generation block for replication as are the seeds used).
The behavior is the following: In the first parameterization, the \alpha parameter converges to 0 (in 3 out of 4 chains) thus letting the other parameters diverge,
mean se_mean sd n_eff Rhat
alpha 1.900000e-01 2.400000e-01 3.400000e-01 2 7.25
beta[1,1] -1.334264e+14 5.823576e+13 1.273290e+14 5 1.89
beta[1,2] -7.652890e+13 3.142264e+13 6.818061e+13 5 1.86
beta[1,3] 6.540592e+13 2.839202e+13 5.587164e+13 4 1.97
beta[1,4] -3.771799e+13 1.374357e+13 4.378537e+13 10 1.60
beta[1,5] 1.891218e+14 1.031652e+14 1.781346e+14 3 2.25
beta[2,1] -5.936002e+13 6.019235e+13 1.148266e+14 4 2.36
beta[2,2] -1.629996e+14 7.454210e+13 1.479430e+14 4 2.17
beta[2,3] 1.051952e+13 1.078627e+13 3.600888e+13 11 1.50
beta[2,4] 2.416356e+14 9.918299e+13 2.038615e+14 4 2.70
beta[2,5] 6.715247e+13 5.131381e+13 1.216214e+14 6 2.34
beta[3,1] -2.006120e+14 7.855115e+13 1.732332e+14 5 2.51
beta[3,2] -3.018620e+13 2.019872e+13 4.874629e+13 6 2.92
beta[3,3] -1.492348e+13 1.052963e+13 2.445115e+13 5 1.81
beta[3,4] 6.001989e+13 2.614826e+13 5.185520e+13 4 2.10
beta[3,5] 1.964036e+14 9.137762e+13 1.753262e+14 4 2.70
beta[4,1] -3.878628e+13 1.924098e+13 5.617727e+13 9 1.35
beta[4,2] -9.922681e+13 4.443476e+13 8.936020e+13 4 2.61
beta[4,3] 9.233981e+13 4.643471e+13 8.703491e+13 4 2.80
beta[4,4] 1.404745e+13 1.205617e+13 2.825254e+13 5 2.31
beta[4,5] 6.539573e+13 3.860926e+13 8.434015e+13 5 2.38
beta[5,1] -2.220818e+14 1.125310e+14 2.344032e+14 4 2.77
beta[5,2] -1.862638e+14 8.958434e+13 1.797510e+14 4 2.33
beta[5,3] 8.174792e+13 4.467253e+13 9.246698e+13 4 2.26
beta[5,4] 2.442712e+13 4.558670e+13 7.986051e+13 3 2.78
beta[5,5] 3.305617e+14 1.824615e+14 3.492551e+14 4 2.64
mean-chain:1 mean-chain:2 mean-chain:3 mean-chain:4
alpha 0.8024847 6.412734e-15 6.656451e-15 3.694287e-15
whereas the second parameterization works fine.
mean se_mean sd n_eff Rhat
alpha 1.15 0.00 0.13 1303 1
beta[1,1] -1.87 0.01 0.48 4225 1
beta[1,2] -0.97 0.01 0.32 3447 1
beta[1,3] 0.30 0.00 0.29 3518 1
beta[1,4] 0.43 0.01 0.31 3796 1
beta[1,5] 2.26 0.01 0.49 3938 1
beta[2,1] -1.50 0.01 0.45 5172 1
beta[2,2] -1.26 0.01 0.35 3071 1
beta[2,3] -0.10 0.00 0.26 3599 1
beta[2,4] 1.92 0.01 0.40 3358 1
beta[2,5] 2.03 0.01 0.63 4620 1
beta[3,1] -2.01 0.01 0.45 3874 1
beta[3,2] -0.55 0.00 0.31 4049 1
beta[3,3] -0.16 0.00 0.28 3316 1
beta[3,4] 1.24 0.01 0.34 3969 1
beta[3,5] 1.92 0.01 0.51 4149 1
beta[4,1] -1.18 0.01 0.38 4630 1
beta[4,2] -0.97 0.01 0.32 3744 1
beta[4,3] 0.55 0.01 0.30 3468 1
beta[4,4] 0.65 0.01 0.33 4336 1
beta[4,5] 1.92 0.01 0.47 4439 1
beta[5,1] -2.28 0.01 0.58 4459 1
beta[5,2] -1.55 0.01 0.35 3393 1
beta[5,3] 0.46 0.00 0.28 3505 1
beta[5,4] 0.76 0.00 0.32 4116 1
beta[5,5] 2.96 0.01 0.70 4070 1
Of course I can make the first parameterization work just as fine by setting a prior.
alpha ~ lognormal_lpdf(1, 1);
However, I would be really curious why this happens in the first place. My thoughts on this so far:
As I understand it, setting up a constrained parameter \alpha>0 will make stan convert it to an unconstrained parameter, adding a jacobian adjustment (which in this case, since \alpha=\exp(\alpha_{\mathrm{unconstrained}}), should be 1/\alpha). But should this not prevent stan from setting \alpha to zero since this would basically mean blowing up the model? Also, setting \alpha to 0 should result in a constant posterior, no matter the other parameters, is this expected behavior?
I have a very basic understanding of the HMC sampler (correct me if I’m wrong, but the sampler sets up a Hamiltonian dynamical system with the posterior as the energy and solves it in forward time via the “leapfrog steps” to get new parameters to propose, right?), so I thought that maybe that relates to the constant posterior, since the sampler cruises around on the trajectories, which are constant under the first integral that is given by the Hamiltonian. But then the question is: Why only in 3 out of 4 chains? And why does Stan not let the other parameterization “collapse”, too (by setting alpha to zero and alpha_times_beta=0 for all k)?
Maybe I’m greatly mistaken on the conceptual foundations, so please correct me if I’m wrong and maybe I’m just making some mistakes in the code. Any thoughts would be appreciated.
Thank you.
Full Stan code and data generation:
Parameterization 1:
data {
// Counts
int<lower=0> N; // No. observations
int<lower=0> I; // No. persons
int<lower=0> K; // No. items
int<lower=0> J; // No. parameters per item
// Data
int<lower=1, upper=J+1> y[N]; // data
int<lower=1, upper=I> pers[N]; // persons
int<lower=1, upper=K> item[N]; // items
}
parameters {
real<lower=0> alpha;
vector[J] beta[K];
vector[I] theta;
}
model {
theta ~ std_normal();
for (n in 1:N) {
y[n] ~ categorical_logit(
append_row(0, cumulative_sum(alpha * (theta[pers[n]] - beta[item[n]])))
);
}
}
Parameterization 2:
data {
// Counts
int<lower=0> N; // No. observations
int<lower=0> I; // No. persons
int<lower=0> K; // No. items
int<lower=0> J; // No. parameters per item
// Data
int<lower=1, upper=J+1> y[N]; // data
int<lower=1, upper=I> pers[N]; // persons
int<lower=1, upper=K> item[N]; // items
}
parameters {
real<lower=0> alpha;
vector[J] alpha_times_beta[K];
vector[I] theta;
}
model {
theta ~ std_normal();
for (n in 1:N) {
y[n] ~ categorical_logit(
append_row(0, cumulative_sum(alpha * theta[pers[n]] - alpha_times_beta[item[n]]))
);
}
}
generated quantities {
vector[J] beta[K];
for (k in 1:K) beta[k] = alpha_times_beta[k]/alpha;
}
Data generation and execution:
make_data <- FALSE
if (make_data) {
gpcm_cat_prob <- function(theta, alpha, beta, normalized = FALSE) {
num <- cbind(
rep(1, length(theta)),
exp(
alpha * sweep(theta %o% seq_along(beta), 2, cumsum(beta), "-")
)
)
if (normalized)
sweep(num, 1, rowSums(num), "/")
else
num
}
alpha <- 1
betas <- seq(-2, 2, by = 1)
thetas <- rnorm(100)
gpcm_data <- replicate(
5,
extraDistr::rcat(
n = length(thetas),
prob = gpcm_cat_prob(thetas, alpha, betas)
)
)
} else {
raw_data <- c(3, 4, 2, 5, 5, 4, 2, 1, 5, 3, 4, 5, 5, 1, 1, 5, 4, 3, 5, 5, 3, 2, 4, 3, 3,
3, 4, 4, 2, 2, 1, 3, 5, 2, 3, 6, 5, 3, 2, 2, 5, 1, 3, 2, 3, 4, 3, 6, 2, 3,
4, 4, 5, 5, 3, 5, 3, 4, 3, 3, 4, 3, 4, 2, 6, 1, 5, 4, 5, 4, 3, 4, 6, 4, 5,
3, 2, 4, 6, 4, 2, 2, 5, 5, 5, 3, 3, 2, 5, 1, 3, 2, 3, 5, 3, 3, 3, 6, 2, 4,
4, 3, 1, 5, 3, 5, 4, 3, 4, 2, 4, 4, 4, 1, 1, 4, 3, 2, 4, 5, 4, 2, 5, 4, 2,
3, 1, 5, 2, 4, 3, 2, 3, 1, 4, 6, 6, 3, 1, 2, 5, 2, 3, 3, 4, 3, 3, 4, 5, 3,
3, 3, 4, 3, 3, 4, 3, 4, 5, 3, 4, 3, 4, 3, 3, 2, 4, 2, 6, 4, 2, 3, 4, 5, 6,
3, 4, 3, 4, 4, 3, 2, 3, 4, 3, 4, 2, 4, 4, 1, 1, 4, 4, 4, 3, 4, 3, 4, 4, 4,
2, 5, 4, 4, 4, 4, 2, 3, 3, 3, 4, 4, 5, 1, 1, 6, 5, 1, 4, 5, 2, 3, 3, 4, 3,
4, 3, 5, 1, 4, 4, 4, 4, 2, 3, 4, 5, 4, 2, 3, 4, 2, 3, 2, 3, 3, 1, 6, 3, 6,
1, 6, 5, 5, 2, 4, 3, 2, 5, 3, 4, 2, 4, 2, 5, 3, 4, 4, 5, 2, 1, 4, 5, 6, 2,
2, 3, 3, 6, 2, 2, 2, 3, 4, 4, 2, 2, 2, 4, 3, 3, 4, 5, 5, 3, 3, 2, 4, 4, 4,
2, 1, 3, 6, 5, 4, 2, 1, 6, 3, 5, 5, 3, 1, 1, 4, 2, 1, 2, 6, 2, 1, 3, 5, 1,
3, 3, 5, 2, 3, 3, 3, 3, 1, 3, 4, 6, 2, 2, 3, 5, 4, 2, 1, 4, 3, 4, 6, 3, 3,
5, 5, 5, 6, 3, 5, 4, 3, 5, 3, 4, 2, 4, 1, 5, 2, 5, 3, 6, 4, 1, 3, 3, 4, 4,
4, 3, 1, 5, 4, 2, 4, 3, 5, 3, 4, 2, 2, 5, 2, 3, 3, 5, 4, 3, 3, 2, 4, 3, 3,
3, 4, 4, 3, 4, 3, 2, 2, 3, 5, 3, 4, 3, 3, 2, 5, 4, 3, 5, 4, 5, 2, 2, 5, 1,
3, 4, 4, 3, 3, 5, 3, 4, 1, 4, 6, 6, 1, 4, 3, 3, 3, 4, 3, 2, 5, 4, 5, 3, 3,
3, 4, 5, 4, 3, 3, 2, 3, 5, 2, 3, 3, 5, 3, 5, 3, 5, 5, 6, 4, 3, 5, 4, 3, 4,
3, 3, 5, 5, 3, 3, 2, 2, 4, 3, 4, 3, 2, 3, 2, 2, 3, 4, 3, 4, 5, 1, 5, 3, 4)
gpcm_data <- matrix(raw_data, nrow = 100)
}
gpcm_par1 <- stan(
file = "gpcm_par1.stan",
data = list(
N = length(gpcm_data),
I = nrow(gpcm_data),
K = ncol(gpcm_data),
J = max(gpcm_data) - 1,
y = c(gpcm_data),
pers = rep(1:nrow(gpcm_data), times = ncol(gpcm_data)),
item = rep(1:ncol(gpcm_data), each = nrow(gpcm_data))
),
seed = 1713178356
)
gpcm_par2 <- stan(
file = "gpcm_par2.stan",
data = list(
N = length(gpcm_data),
I = nrow(gpcm_data),
K = ncol(gpcm_data),
J = max(gpcm_data) - 1,
y = c(gpcm_data),
pers = rep(1:nrow(gpcm_data), times = ncol(gpcm_data)),
item = rep(1:ncol(gpcm_data), each = nrow(gpcm_data))
),
seed = 567626326
)