Convergence depending on parameterization of discrimination parameter in the GPCM

Dear Stan Community,

I am trying to understand a certain behavior of the sampler and I would greatly appreciate your insights.

I noticed that for the Generalized Partial Credit Model (GPCM), the convergence of the sampler depends on the parameterization, which can be either (full Stan code below)

y[n] ~ categorical_logit(append_row(0, cumulative_sum(alpha * (theta[pers[n]] - beta[item[n]]))));

or

y[n] ~ categorical_logit(append_row(0, cumulative_sum(alpha * theta[pers[n]] - alpha_times_beta[item[n]])));

In the model, I set a standard normal prior on the person parameters, which should identify the model (the \beta parameters via the mean and the \alpha parameter via the standard deviation).

Now, the following behavior seems to be dependent on data, though I have not yet managed to figure out which property of the data causes this. The more persons I have in my data, the less likely it is to see this behavior. I have fitted the model now 3 times with the same data and get similar results every time (the data are in the data generation block for replication as are the seeds used).

The behavior is the following: In the first parameterization, the \alpha parameter converges to 0 (in 3 out of 4 chains) thus letting the other parameters diverge,

                   mean      se_mean           sd n_eff Rhat
alpha      1.900000e-01 2.400000e-01 3.400000e-01     2 7.25
beta[1,1] -1.334264e+14 5.823576e+13 1.273290e+14     5 1.89
beta[1,2] -7.652890e+13 3.142264e+13 6.818061e+13     5 1.86
beta[1,3]  6.540592e+13 2.839202e+13 5.587164e+13     4 1.97
beta[1,4] -3.771799e+13 1.374357e+13 4.378537e+13    10 1.60
beta[1,5]  1.891218e+14 1.031652e+14 1.781346e+14     3 2.25
beta[2,1] -5.936002e+13 6.019235e+13 1.148266e+14     4 2.36
beta[2,2] -1.629996e+14 7.454210e+13 1.479430e+14     4 2.17
beta[2,3]  1.051952e+13 1.078627e+13 3.600888e+13    11 1.50
beta[2,4]  2.416356e+14 9.918299e+13 2.038615e+14     4 2.70
beta[2,5]  6.715247e+13 5.131381e+13 1.216214e+14     6 2.34
beta[3,1] -2.006120e+14 7.855115e+13 1.732332e+14     5 2.51
beta[3,2] -3.018620e+13 2.019872e+13 4.874629e+13     6 2.92
beta[3,3] -1.492348e+13 1.052963e+13 2.445115e+13     5 1.81
beta[3,4]  6.001989e+13 2.614826e+13 5.185520e+13     4 2.10
beta[3,5]  1.964036e+14 9.137762e+13 1.753262e+14     4 2.70
beta[4,1] -3.878628e+13 1.924098e+13 5.617727e+13     9 1.35
beta[4,2] -9.922681e+13 4.443476e+13 8.936020e+13     4 2.61
beta[4,3]  9.233981e+13 4.643471e+13 8.703491e+13     4 2.80
beta[4,4]  1.404745e+13 1.205617e+13 2.825254e+13     5 2.31
beta[4,5]  6.539573e+13 3.860926e+13 8.434015e+13     5 2.38
beta[5,1] -2.220818e+14 1.125310e+14 2.344032e+14     4 2.77
beta[5,2] -1.862638e+14 8.958434e+13 1.797510e+14     4 2.33
beta[5,3]  8.174792e+13 4.467253e+13 9.246698e+13     4 2.26
beta[5,4]  2.442712e+13 4.558670e+13 7.986051e+13     3 2.78
beta[5,5]  3.305617e+14 1.824615e+14 3.492551e+14     4 2.64

      mean-chain:1 mean-chain:2 mean-chain:3 mean-chain:4
alpha    0.8024847 6.412734e-15 6.656451e-15 3.694287e-15

whereas the second parameterization works fine.

           mean se_mean   sd n_eff Rhat
alpha      1.15    0.00 0.13  1303    1
beta[1,1] -1.87    0.01 0.48  4225    1
beta[1,2] -0.97    0.01 0.32  3447    1
beta[1,3]  0.30    0.00 0.29  3518    1
beta[1,4]  0.43    0.01 0.31  3796    1
beta[1,5]  2.26    0.01 0.49  3938    1
beta[2,1] -1.50    0.01 0.45  5172    1
beta[2,2] -1.26    0.01 0.35  3071    1
beta[2,3] -0.10    0.00 0.26  3599    1
beta[2,4]  1.92    0.01 0.40  3358    1
beta[2,5]  2.03    0.01 0.63  4620    1
beta[3,1] -2.01    0.01 0.45  3874    1
beta[3,2] -0.55    0.00 0.31  4049    1
beta[3,3] -0.16    0.00 0.28  3316    1
beta[3,4]  1.24    0.01 0.34  3969    1
beta[3,5]  1.92    0.01 0.51  4149    1
beta[4,1] -1.18    0.01 0.38  4630    1
beta[4,2] -0.97    0.01 0.32  3744    1
beta[4,3]  0.55    0.01 0.30  3468    1
beta[4,4]  0.65    0.01 0.33  4336    1
beta[4,5]  1.92    0.01 0.47  4439    1
beta[5,1] -2.28    0.01 0.58  4459    1
beta[5,2] -1.55    0.01 0.35  3393    1
beta[5,3]  0.46    0.00 0.28  3505    1
beta[5,4]  0.76    0.00 0.32  4116    1
beta[5,5]  2.96    0.01 0.70  4070    1

Of course I can make the first parameterization work just as fine by setting a prior.

alpha ~ lognormal_lpdf(1, 1);

However, I would be really curious why this happens in the first place. My thoughts on this so far:

As I understand it, setting up a constrained parameter \alpha>0 will make stan convert it to an unconstrained parameter, adding a jacobian adjustment (which in this case, since \alpha=\exp(\alpha_{\mathrm{unconstrained}}), should be 1/\alpha). But should this not prevent stan from setting \alpha to zero since this would basically mean blowing up the model? Also, setting \alpha to 0 should result in a constant posterior, no matter the other parameters, is this expected behavior?

I have a very basic understanding of the HMC sampler (correct me if I’m wrong, but the sampler sets up a Hamiltonian dynamical system with the posterior as the energy and solves it in forward time via the “leapfrog steps” to get new parameters to propose, right?), so I thought that maybe that relates to the constant posterior, since the sampler cruises around on the trajectories, which are constant under the first integral that is given by the Hamiltonian. But then the question is: Why only in 3 out of 4 chains? And why does Stan not let the other parameterization “collapse”, too (by setting alpha to zero and alpha_times_beta=0 for all k)?

Maybe I’m greatly mistaken on the conceptual foundations, so please correct me if I’m wrong and maybe I’m just making some mistakes in the code. Any thoughts would be appreciated.

Thank you.

Full Stan code and data generation:

Parameterization 1:

data {
  // Counts
  int<lower=0> N; // No. observations
  int<lower=0> I; // No. persons
  int<lower=0> K; // No. items
  int<lower=0> J; // No. parameters per item
  
  // Data
  int<lower=1, upper=J+1> y[N];  // data
  int<lower=1, upper=I> pers[N]; // persons
  int<lower=1, upper=K> item[N]; // items
}

parameters {
  real<lower=0> alpha;
  vector[J] beta[K];
  vector[I] theta;
}

model {
  theta ~ std_normal();

  for (n in 1:N) {
    y[n] ~ categorical_logit(
      append_row(0, cumulative_sum(alpha * (theta[pers[n]] - beta[item[n]])))
    );
  }
}

Parameterization 2:

data {
  // Counts
  int<lower=0> N; // No. observations
  int<lower=0> I; // No. persons
  int<lower=0> K; // No. items
  int<lower=0> J; // No. parameters per item
  
  // Data
  int<lower=1, upper=J+1> y[N];  // data
  int<lower=1, upper=I> pers[N]; // persons
  int<lower=1, upper=K> item[N]; // items
}

parameters {
  real<lower=0> alpha;
  vector[J] alpha_times_beta[K];
  vector[I] theta;
}

model {
  theta ~ std_normal();

  for (n in 1:N) {
    y[n] ~ categorical_logit(
      append_row(0, cumulative_sum(alpha * theta[pers[n]] - alpha_times_beta[item[n]]))
    );
  }
}

generated quantities {
  vector[J] beta[K];
  
  for (k in 1:K) beta[k] = alpha_times_beta[k]/alpha;
}

Data generation and execution:

make_data <- FALSE

if (make_data) {
  gpcm_cat_prob <- function(theta, alpha, beta, normalized = FALSE) {
    num <- cbind(
      rep(1, length(theta)),
      exp(
        alpha * sweep(theta %o% seq_along(beta), 2, cumsum(beta), "-")
      )
    )
    if (normalized)
      sweep(num, 1, rowSums(num), "/")
    else
      num
  }
  
  alpha <- 1
  betas <- seq(-2, 2, by = 1)
  thetas <- rnorm(100)
  
  gpcm_data <- replicate(
    5,
    extraDistr::rcat(
      n = length(thetas),
      prob = gpcm_cat_prob(thetas, alpha, betas)
    )
  )
} else {
  raw_data <- c(3, 4, 2, 5, 5, 4, 2, 1, 5, 3, 4, 5, 5, 1, 1, 5, 4, 3, 5, 5, 3, 2, 4, 3, 3,
                3, 4, 4, 2, 2, 1, 3, 5, 2, 3, 6, 5, 3, 2, 2, 5, 1, 3, 2, 3, 4, 3, 6, 2, 3,
                4, 4, 5, 5, 3, 5, 3, 4, 3, 3, 4, 3, 4, 2, 6, 1, 5, 4, 5, 4, 3, 4, 6, 4, 5,
                3, 2, 4, 6, 4, 2, 2, 5, 5, 5, 3, 3, 2, 5, 1, 3, 2, 3, 5, 3, 3, 3, 6, 2, 4,
                4, 3, 1, 5, 3, 5, 4, 3, 4, 2, 4, 4, 4, 1, 1, 4, 3, 2, 4, 5, 4, 2, 5, 4, 2,
                3, 1, 5, 2, 4, 3, 2, 3, 1, 4, 6, 6, 3, 1, 2, 5, 2, 3, 3, 4, 3, 3, 4, 5, 3,
                3, 3, 4, 3, 3, 4, 3, 4, 5, 3, 4, 3, 4, 3, 3, 2, 4, 2, 6, 4, 2, 3, 4, 5, 6,
                3, 4, 3, 4, 4, 3, 2, 3, 4, 3, 4, 2, 4, 4, 1, 1, 4, 4, 4, 3, 4, 3, 4, 4, 4,
                2, 5, 4, 4, 4, 4, 2, 3, 3, 3, 4, 4, 5, 1, 1, 6, 5, 1, 4, 5, 2, 3, 3, 4, 3,
                4, 3, 5, 1, 4, 4, 4, 4, 2, 3, 4, 5, 4, 2, 3, 4, 2, 3, 2, 3, 3, 1, 6, 3, 6,
                1, 6, 5, 5, 2, 4, 3, 2, 5, 3, 4, 2, 4, 2, 5, 3, 4, 4, 5, 2, 1, 4, 5, 6, 2,
                2, 3, 3, 6, 2, 2, 2, 3, 4, 4, 2, 2, 2, 4, 3, 3, 4, 5, 5, 3, 3, 2, 4, 4, 4,
                2, 1, 3, 6, 5, 4, 2, 1, 6, 3, 5, 5, 3, 1, 1, 4, 2, 1, 2, 6, 2, 1, 3, 5, 1,
                3, 3, 5, 2, 3, 3, 3, 3, 1, 3, 4, 6, 2, 2, 3, 5, 4, 2, 1, 4, 3, 4, 6, 3, 3,
                5, 5, 5, 6, 3, 5, 4, 3, 5, 3, 4, 2, 4, 1, 5, 2, 5, 3, 6, 4, 1, 3, 3, 4, 4,
                4, 3, 1, 5, 4, 2, 4, 3, 5, 3, 4, 2, 2, 5, 2, 3, 3, 5, 4, 3, 3, 2, 4, 3, 3,
                3, 4, 4, 3, 4, 3, 2, 2, 3, 5, 3, 4, 3, 3, 2, 5, 4, 3, 5, 4, 5, 2, 2, 5, 1,
                3, 4, 4, 3, 3, 5, 3, 4, 1, 4, 6, 6, 1, 4, 3, 3, 3, 4, 3, 2, 5, 4, 5, 3, 3,
                3, 4, 5, 4, 3, 3, 2, 3, 5, 2, 3, 3, 5, 3, 5, 3, 5, 5, 6, 4, 3, 5, 4, 3, 4,
                3, 3, 5, 5, 3, 3, 2, 2, 4, 3, 4, 3, 2, 3, 2, 2, 3, 4, 3, 4, 5, 1, 5, 3, 4)
  
  gpcm_data <- matrix(raw_data, nrow = 100)
}

gpcm_par1 <- stan(
  file = "gpcm_par1.stan",
  data = list(
    N = length(gpcm_data),
    I = nrow(gpcm_data),
    K = ncol(gpcm_data),
    J = max(gpcm_data) - 1,
    y = c(gpcm_data),
    pers = rep(1:nrow(gpcm_data), times = ncol(gpcm_data)),
    item = rep(1:ncol(gpcm_data), each = nrow(gpcm_data))
  ),
  seed = 1713178356
)

gpcm_par2 <- stan(
  file = "gpcm_par2.stan",
  data = list(
    N = length(gpcm_data),
    I = nrow(gpcm_data),
    K = ncol(gpcm_data),
    J = max(gpcm_data) - 1,
    y = c(gpcm_data),
    pers = rep(1:nrow(gpcm_data), times = ncol(gpcm_data)),
    item = rep(1:ncol(gpcm_data), each = nrow(gpcm_data))
  ),
  seed = 567626326
)
1 Like

Sorry, it seems your inquiry fell through a bit, despite being relevant and well written.

I would guess that the problem really is that with a small dataset, the data do not really inform both alpha and beta, but only alpha_times_beta.

If you want to dig deeper, I would recommend Mike Betancourt’s post on non-identifiability (https://betanalpha.github.io/assets/case_studies/identifiability.html) and possibly also on underdetermined regression (https://betanalpha.github.io/assets/case_studies/underdetermined_linear_regression.html) as this should give you enough background to be able to investigate yourself.

If you’ll have further questions feel free to ask!

1 Like