 # Sampling issues: Ordinal probit model with fixed threshold extremes

I’m trying to implement an ordinal probit model with fixed threshold extremes (Liddell & Kruschke, 2018; Kruschke, 2014). The lower and upper limits of the thresholds are fixed to match the scale of the response. Rather than using the standard normal distribution, the \mu and \sigma of the underlying distribution are to be estimated.

I ran into two issues during sampling:

1. There will be a lot of rejected initial values before sampling starts. Many times, sampling just won’t start because initial values all failed.
2. I have to raise adapt_delta to some unusually large values such as 0.9999 to eliminate divergent transitions.

How can I improve my model and/or the sampling options to solve these problems?

## JAGS implementation

This kind of models is employed in Liddell & Kruschke (2018) as well as in Chap 23 of Doing Bayesian Data Analysis 2nd ed (Kruschke, 2014). Below is Kruschke’s (2014) original JAGS implementation of a basic model for ordinal data from a single group.

Original JAGS model
model {
for ( i in 1:Ntotal ) {
y[i] ~ dcat( pr[i,1:nYlevels] )
pr[i,1] <- pnorm( thresh , mu , 1/sigma^2 )
for ( k in 2:(nYlevels-1) ) {
pr[i,k] <- max( 0 ,  pnorm( thresh[ k ] , mu , 1/sigma^2 )
- pnorm( thresh[k-1] , mu , 1/sigma^2 ) )
}
pr[i,nYlevels] <- 1 - pnorm( thresh[nYlevels-1] , mu , 1/sigma^2 )
}
mu ~ dnorm( (1+nYlevels)/2 , 1/(nYlevels)^2 )
sigma ~ dunif( nYlevels/1000 , nYlevels*10 )
for ( k in 2:(nYlevels-2) ) {  # 1 and nYlevels-1 are fixed, not stochastic
thresh[k] ~ dnorm( k+0.5 , 1/2^2 )
}
}


The two extremes of the thresholds are fixed to constant values:

thresh=1+0.5
thresh[nYlevels-1] = nYlevels - 0.5


## Stan implementation

I followed Gregory’s advice on the parameterization of the thresholds, and this is my Stan model:

data {
int<lower=0> N;  // Ntotal
int<lower=2> K;  // nYlevels
int<lower=1,upper=K> y[N];
}

transformed data {
real c_lower;
real c_upper;

c_lower = 1 + 0.5;
c_upper = K - 0.5;
}

parameters {
real mu;
real<lower=0> sigma;

vector<lower=c_lower,upper=c_upper>[K-3] c_;
}

transformed parameters {
ordered[K-1] c;  // thresholds
simplex[K] theta;  // probabilities

c = c_lower;
c[K-1] = c_upper;

for (i in 2:(K-2)) {
c[i] = c_[i-1];
}

theta = Phi((c - mu) / sigma);
theta[K] = 1 - Phi((c[K-1] - mu) / sigma);

for (i in 2:(K-1)) {
theta[i] = Phi((c[i] - mu) / sigma) - Phi((c[i-1] - mu) / sigma);
}
}

model {
mu ~ normal((1 + K) / 2.0, K);

sigma ~ uniform(K / 1000.0, K * 10);

for (i in 1:(K-3)) {
c_[i] ~ normal(i + 1 + 0.5, 2);
}

y ~ categorical(theta);
}


## Sampling issues of the Stan model

I fitted the model with rstan. The example data is from Kruschke (2014): OrdinalProbitData-1grp-1.csv (305 Bytes).

library(rstan)

S <- list(
N = length(df$Y), K = 7, y = df$Y
)

model <- "model.stan"
fit <- stan(model, data = S,
cores = 4, chains = 4, iter = 3000, warmup = 1000,


I ran into two issues:

### 1. Rejected initial values

There will be a lot of rejected initial values before sampling starts.
Many times, sampling just won’t start because initial values all failed.

Chain 1: Rejecting initial value:
Chain 1:   Error evaluating the log probability at the initial value.
Chain 1: Exception: validate transformed params: c is not a valid ordered
vector. The element at 3 is 4.43762, but should be greater than the previous
element, 5.87994 (in 'model416d74ee2c8c_ordinal_1group' at line 27)


### 2. Divergent transitions

Even adapt_delta = 0.99 is not enough to remove the divergency.
I have to raise adapt_delta to some unusually large values such as 0.9999 to eliminate divergent transitions.

Warning messages:
1: There were 2 divergent transitions after warmup. Increasing adapt_delta
above 0.99 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
2: Examine the pairs() plot to diagnose sampling problems


How can I improve my model and/or the sampling options to solve these problems?

Here’s how to properly construct an ordered vector with fixed upper and lower bounds

parameters {
...
simplex[K-2] c_;
}
transformed parameters {
ordered[K-1] c;  // thresholds
c = c_lower;
for (i in 2:(K-1)) {
c[i] = c[i-1] + c_[i-1]*(c_upper-c_lower);
}
...
}


Second, that uniform prior on sigma is going to cause problems. I’d say try normal(0, K) instead (it is automatically truncated to positive only).

And finally, don’t compute theta manually. Stan has a builtin ordered probit distribution.

Thank you for this neat solution!

Re-implementing the thresholds with your code and applying a more proper prior to sigma have solved most of my problems. Now sampling is done efficiently with default options.

The only issue is that there can be a few rejected initial values before sampling, complaining Log probability evaluates to log(0). However, the warnings can be completely eliminated by using smaller init_r values, e.g. init_r = 0.1. I guess this isn’t a big problem.

Chain 1: Rejecting initial value:
Chain 1:   Log probability evaluates to log(0), i.e. negative infinity.
Chain 1:   Stan can't start sampling from this initial value.


Below is my revised model:

data {
int<lower=0> N;  // Ntotal
int<lower=2> K;  // nYlevels
int<lower=1,upper=K> y[N];
real c_lower;  // 1 + 0.5
real c_upper;  // K - 0.5
}

transformed data {
real c_span = c_upper - c_lower;
}

parameters {
real mu;
real<lower=0> sigma;
simplex[K-2] c_prop;
}

transformed parameters {
ordered[K-1] c;  // thresholds
simplex[K] theta;  // probabilites

c = c_lower;

for (i in 2:(K-1)) {
c[i] = c[i-1] + c_prop[i-1] * c_span;
}

theta = Phi((c - mu) / sigma);
theta[K] = 1 - Phi((c[K-1] - mu) / sigma);

for (i in 2:(K-1)) {
theta[i] = Phi((c[i] - mu) / sigma) - Phi((c[i-1] - mu) / sigma);
}
}

model {
mu ~ normal((1 + K) / 2.0, K);

sigma ~ exponential(1.0 / K);

y ~ categorical(theta);
}



I have also tried the built-in ordered probit distribution:

// codes related to theta were removed
model {
vector[N] mu_vector;

mu ~ normal((1 + K) / 2.0, K);

for (i in 1:N) {
mu_vector[i] = mu;
}

sigma ~ exponential(1.0 / K);

y ~ ordered_probit(mu_vector/sigma, c/sigma);
}


But it seems that using manually coded theta has better performance over the ordered_probit approach. On my PC, the average total time for a chain to complete was about 0.12 sec for the manual approach vs 0.32 sec for the ordered_probit one. So it might still be worthwhile to compute theta manually for performance considerations. Or is it possible to improve the code above to achieve better performance?

Also interestingly, the model using the built-in ordered probit distribution didn’t have any Rejecting initial value issue with default init_r.