Sequential ordinal models

Is there any plan to include sequential ordinal models like detailed here:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.110
or any other variations of ordinal regression?
I hear there’s an ordered_probit in the pipeline, is there any estimate of when it will be released?

edit: from what I can gather, this does not impose ordering constraints on your thresholds, which I am thinking would be a significant advantage.

From glancing the paper, I believe that the sequential ordinal model can be easily implemented in Stan, so I would encourage you to give it a go if you are interested.

If you prefer a ready-made solution, the brms package which is based on Stan has recently added support for (cumulative) ordinal models and they seem to work great (see the paper Ordinal Regression Models in Psychological Research: A Tutorial for details). The ordering constraints are easy to enforce in Stan using the ordered data type (see Stan manual for more details).

I have no knowledge of whether and when and ordered_probit is planned.

I am already using the cumulative link via ordered_logit, but sequential regression seems to have a more straightforward latent representation (and interpretation) as well as being a member of the exponential family.
I agree that it probably wouldn’t be difficult to code in Stan, but I’m wondering whether it can be efficient enough coded in that manner. To vectorize it, we could consider it as being generated from sequences of binary decisions, but ideally we should handle the almost inevitable sparsity of these binary vectors?

I am a bit out of my depth here, but I don’t see why you would model the problem as a sequence of binary decisions (which is might be terrible for Stan) The direct representation (as in equation 2 in the paper) looks like it can be implemented straightforwardly and efficiently. It literally tells you what to add to the target log prob. This reduces to a simple sum e.g. (just a sketch not really tested)

for(n in 1: N) {
   real val = beta[N] * X[N]; //the predictor value
   for(option in 1:(observed_option[n] - 1)) {
      target += log1m(normal_cdf(gamma[option] - val, 0, 1)); //equals to log(1 - normal_cdf(...)) but is more stable
   }
   target += normal_lcdf(gamma[observed_option[n]] - val, 0, 1); //note the lcdf here, not cdf as above
}

This IMHO could be reasonably efficient and you can obviously do a bunch of optimizations (precompute things, vectorize, …)

It might be me who is out of my depth, since I am not quite used to the performance trade-offs in Stan. Is it not costly to loop over N, especially compared to the often much lower K (number of options)?
Of course, your approach does implicitly exploit sparsity of the binary latents as you only loop over the relevant variables, disregarding the later ones.
(or does your final note hint at possible ways this could be vectorized?)

Yeah, the code could be made more efficient. One more sketch to illustrate some compression (I hope all of those work):

transformed data {
matrix[N, num_options] skipped_options = rep_matrix(0, N, num_options);
for(n in 1: N) {
   for(option in 1:(observed_option[n] - 1)) {
     skipped_options[n,option] = 1;
   }
}
}

model {
vector[N] val = beta * X;
matrix[N,num_options] gamma_rep = skipped_options .* rep_matrix(to_rowvector(gamma), N);
matrix[N,num_options] val_rep = skipped_options .* rep_matrix(val, num_options);
matrix[N,num_options] skipped_options_lp = skipped_options .* log1m(normal_cdf(gamma_rep - val_rep, 0, 1));
target += to_vector(skipped_options_lp);
target += normal_lcdf(gamma[observed_option[n]] - val, 0, 1);
}

I guess I see what you meant by the “latent binaries” (the skipped_options matrix) but those can be computed in transformed data so should not make problems. There will be some wasted computation in the matrix operations so you should test whether this is an improvement, but I guess it could be worth it. I also multipled gamma_rep and val_rep with skipped_options in hope it reduces the autodiff stack (the elements multiplied by zero won’t propagate) but this is also just a guess and might not be an improvement in practice.

Another approach would be to group the data by the value of observed_option which will leave you with an iteration over options and the matrices involved at each step won’t have any zeroes and no computation will be wasted.

There’s a chapter of the manual on efficiency that discusses this from a user perspective.

Almost everything can be made faster in C++, but that’s because we include analytic derivatives and remove redundant calculations, not because the loops are faster.

I am slowly digging my way through the manual. :)

I am not sure I caught the take-home message here; is this arguing for/against loops, increased vectorization, or direct coding in C++ as opposed to Stan?
My main point of confusion is loops vs. vectorization. The latter is often argued as being the gold standard, yet I have also seen arguments that could be paraphrased as “loops are not that slow”. And a lot of the functions do not seem designed for vectorization to the same degree that you’d see in e.g. Numpy (distributions taking matrix or vector arguments for instance).
For instance, would you recommend @martinmodrak’s first or second implementation? Or is that impossible to decide without more info?

brms can handle the sequential model as well (see https://psyarxiv.com/x8swp/). Basically it can handle all theree main families of ordinal models.

Thanks for the reference, it looks quite handy!
I keep hearing good things about your package, but unfortunately I am not really an R user, and I would prefer to have more direct access to the model components. Did you implement the sequential model directly in Stan’s specification language? Do you have any tips/pointers on how to make it run efficiently?

You don’t need to use my package necessarily, but can just let it generate the relevant Stan code in particular the self-defined functions for ordinal models. For instance, run in R

library(brms)
dat <- data.frame(y = 1:3)
make_stancode(y ~ 1, data = dat, family = sratio())

to get the functions for (stopping) sequential models.

Got brms up and running, and for the record this is what it returned:

functions { 
  /* sratio-logit log-PDF for a single response 
   * Args: 
   *   y: response category 
   *   mu: linear predictor 
   *   thres: ordinal thresholds 
   *   disc: discrimination parameter 
   * Returns: 
   *   a scalar to be added to the log posterior 
   */ 
   real sratio_logit_lpmf(int y, real mu, vector thres, real disc) { 
     int ncat = num_elements(thres) + 1; 
     vector[ncat] p; 
     vector[ncat - 1] q; 
     for (k in 1:(ncat - 1)) { 
       q[k] = 1 - inv_logit(disc * (thres[k] - mu)); 
       p[k] = 1 - q[k]; 
       for (kk in 1:(k - 1)) p[k] = p[k] * q[kk]; 
     } 
     p[ncat] = prod(q); 
     return categorical_lpmf(y | p); 
   } 
  /* sratio-logit log-PDF for a single response 
   * including category specific effects 
   * Args: 
   *   y: response category 
   *   mu: linear predictor 
   *   mucs: predictor for category specific effects 
   *   thres: ordinal thresholds 
   *   disc: discrimination parameter 
   * Returns: 
   *   a scalar to be added to the log posterior 
   */ 
   real sratio_logit_cs_lpmf(int y, real mu, row_vector mucs, vector thres, real disc) { 
     int ncat = num_elements(thres) + 1; 
     vector[ncat] p; 
     vector[ncat - 1] q; 
     for (k in 1:(ncat - 1)) { 
       q[k] = 1 - inv_logit(disc * (thres[k] - mucs[k] - mu)); 
       p[k] = 1 - q[k]; 
       for (kk in 1:(k - 1)) p[k] = p[k] * q[kk]; 
     } 
     p[ncat] = prod(q); 
     return categorical_lpmf(y | p); 
   } 
} 
data { 
  int<lower=1> N;  // total number of observations 
  int Y[N];  // response variable 
  int<lower=2> ncat;  // number of categories 
  real<lower=0> disc;  // discrimination parameters 
  int prior_only;  // should the likelihood be ignored? 
} 
transformed data { 
} 
parameters { 
  vector[ncat-1] temp_Intercept;  // temporary thresholds 
} 
transformed parameters { 
} 
model { 
  vector[N] mu = rep_vector(0, N); 
  // priors including all constants 
  target += student_t_lpdf(temp_Intercept | 3, 0, 10); 
  // likelihood including all constants 
  if (!prior_only) { 
    for (n in 1:N) { 
      target += sratio_logit_lpmf(Y[n] | mu[n], temp_Intercept, disc); 
    } 
  } 
} 
generated quantities { 
  // compute actual thresholds 
  vector[ncat - 1] b_Intercept = temp_Intercept; 
}

Very impressive package!
This seems to be identical to strategy #1 by @martinmodrak , except that the whole likelihood function is constructed, instead of only calculating the multinomial likelihood at the observed values.

For the record, I also tried strategy #1 in my program, and the sampler ran in half the time compared to the ordered_logistic with no warnings (as observed in my other post).

1 Like