Multivariate Cumulative Probit with mixed ordinal and continuous data

Hi everyone,

I am seeking help on a series of two connected problems. First, how can I extend @bgoodri’s multivariate probit model here to cases where each the ordinal response variable is polychotomous and each category may be of different size (e.g., one variable has 3 categories, another may have 14). Second and connected to this problem, the response vector y is mixed with a combination of continuous and ordinal values. How (or is it possible) can I model this complex data vector?

I have fit a multivariate normal (and Student T) model to up to 6 continuous response variables (see model below). The next step is how do I combine these continuous variables with the ordinal variables that all make up a complex response vector. Let x be a scalar independent variable and Y be a vector of response variables. J indexes the ordinal variables, and K the continuous variables, so Y is a vector of size J + K. I assume the overall response vector is distributedas y_n ~ N (f(g,h), \sigma(x)) where h is a function for the continuous mean and g for that of an ordinal variable and \sigma(x) a function for the standard deviation. The continuous mean function is h_k(x) = a_k*x^r_k + b_k. The ordinal is g_k(x)=x^r_k. Here a = 1 and b=0 for identifiability. For the sd I assume linearity where \sigma(x) = s_k[1+\kappa_k*x]. For each ordinal variable j, only m categories are observable and where I need help is mapping this ordinal response to a latent continuous response z via any number of threshold parameters \tau where \tau_0 = -\infty and \tau_{m+1}= \infty.

Apologies if this is a bit confusing. Essentially, I would like to combine continuous and ordinal response variables in a principled manner. My domain knowledge suggests the mean is related to an offset power law and the noise is heteroskedastic and scales linearly with x. I provide the continuous model below, which is tested on simulated and real data and returns good results. I am seeking help to extend this to the ordinal + continuous sense. I am thinking it maybe something like this. But, I could be totally wrong.

Thank you for all the help!

data{
  int N; // # of individuals
  int K; // # of responses
  vector[K] y[N]; // vector of growth responses per individual
  real X[N]; //age
}
parameters{
  vector<lower=0>[K] a; // multiplicative constant
  vector<lower=0>[K] r; // scaling parameter
  vector[K] b; // offset
  vector<lower=0>[K] s_scale; //constant noise
  vector<lower=0>[K] kappa; // slope/gradient of linear noise function
  cholesky_factor_corr[K] L_Omega; // cholesky factor the corr matrix for efficiency
}
transformed parameters{
  vector[K] mu[N]; //mean array of size N containing vectors of length K
  vector<lower=0>[K] s[N]; //sd array of size N containing vectors of length K
  matrix[K,K] Sigma[N]; // cov array of size N with K x K cov matrices
  for(i in 1:N){
    for(k in 1:K){
      mu[i,k] = a[k]*X[i]^(r[k])+b[k]; // mean function
      s[i,k] = s_scale[k]*(1 + kappa[k]*X[i]); // sd function
      }
  }
  for(i in 1:N){
    Sigma[i] = diag_pre_multiply(s[i],L_Omega); // combining the cholesky corr matrix with the noise  
      }
}
model{
  //priors
  a ~ normal(0,10); // half-normal
	r ~ normal(0,1); //half-normal
	b ~ normal(0,10);
	kappa ~ normal(0,1); //half-normal
	s_scale ~ cauchy(0,5); //half-cauchy
  L_Omega ~ lkj_corr_cholesky(0.5);
	for(i in 1:N){
	 y[i] ~ multi_normal_cholesky(mu[i], Sigma[i]); //likelihood
	}
}
generated quantities{
  vector[K] Y_mean[N]; //posterior mean
  vector[K] Y_pred[N]; //posterior predictive check
  corr_matrix[K] Omega; //re-factor corr matrix
  cov_matrix[K] Sigma_new[N];//put cov matrix back together
  vector[N] log_lik; // log-likelihood for model comparison
  vector[K] s_new[N]; // posterior variance
  Omega = multiply_lower_tri_self_transpose(L_Omega); //corr matrix
  for(i in 1:N){
    for(k in 1:K){
      Y_mean[i,k] = a[k]*X[i]^r[k] +b[k]; //posterior mean
      s_new[i,k] = s_scale[k]*(1+kappa[k]*X[i]);
    }
     Sigma_new[i] = quad_form_diag(Omega, s_new[i]); //posterior cov matrix
    Y_pred[i] = multi_normal_rng(Y_mean[i],Sigma_new[i]); //posterior predictive
  }
  for(n in 1:N){
    log_lik[n] = multi_normal_cholesky_lpdf(y[n]|mu[n],Sigma[n]); // log lik of each obs
  }
} 
2 Likes

Hi,
the bgoodris model can be extended relatively straightforwardly - here’s code using the same number of dimensions for each variable (N_thres is the number of thresholds), but it can definitely be extended to the case of different number of levels with some index-juggling (you need “ragged arrays” which are not supported in Stan by default so you need to implement them yourself).

       for(n in 1:N) { //iterate over rows
            real mus[N_dims] = //linear predictors for this row
            int Ys[N_dims] = //observed values for this row

            vector[N_dims] z;
            real prev;
            prev = 0;
            for (d in 1:N_dims) {
              real t; // threshold at which utility = 0
              if (Ys[d] == 1){
                real ub = approx_Phi((thresholds[d, 1] -(mus[d] + prev)) / L_rescor[d,d]);
                t = ub * u[n,d];
                target += log(ub);  // Jacobian adjustment
              } else if (Ys[d] == N_thres + 1) {
                real lb = approx_Phi((thresholds[d, N_thres] -(mus[d] + prev)) / L_rescor[d,d]);
                t = lb + (1 - lb) * u[n,d];
                target += log1m(lb);  // Jacobian adjustment
              } else {
                real lb = approx_Phi((thresholds[d, Ys[d] - 1] -(mus[d] + prev)) / L_rescor[d,d]);
                real ub = approx_Phi((thresholds[d, Ys[d]    ] -(mus[d] + prev)) / L_rescor[d,d]);
                t = lb + (ub - lb) * u[n,d];
                target += log(ub - lb);
              }
              z[d] = approx_inv_Phi(t);
              if (d < N_dims) prev = L_rescor[d+1,1:d] * head(z, d);
              // Jacobian adjustments imply z is truncated standard normal
              // thus utility --- mu + L_rescor * z --- is truncated multivariate normal
   }

Here u is a matrix of nuisance parameters mus are the linear predictors
Here we use a simple approximation to inv_Phi that is more stable the built-in one:

functions {
      real approx_Phi(real x) {
        return inv_logit(x * 1.702);
      }

      real approx_inv_Phi(real x) {
        return logit(x) / 1.702;
      }
}

I am indebted to @CerulloE who helped with getting the code together.

Combining with continuous outcomes would be a bit more tricky. I guess that you could adapt the code to compute z - if I get it right, then mus[d] + prev holds the mean conditional on what was “observed” in the previous dimensions and L_rescor[d,d] the relevant standard deviation, so you would somehow need to use this to obtain z as a standard normal variate that represents the deviation of the observed value from the expected one. I however fail to see clearly at the moment how exactly this could be done or whether it actually makes sense.

Also note that the code by Ben Goodrich is based on GHK algorithm - Wikipedia so studying that might provide further insight.

Best of luck!

1 Like

Thank you so much! This is wonderful - I have spent the last few days going over the GHK algorithm, Albert and Chib 1993, and even stumbled upon @CerulloE’s paper. As to the code above I think it all makes sense to me. Few questions:

  1. Could mus be written as vector [K] mus [N]. Thus, it would equate to my mean function g(x) = x^r? In turn, as you have it written Ys would be my ordinal observed responses variables (0,1,2,3,etc.) as vector[K]Ys[N]? Is this correct?

  2. Would Threshholds be defined in the parameter block as and N_thresh in the data block? If so, I’d have to read in N_thres into the data block and define vector [N_thresh] Thresholds [N,K]? Something like this?

  3. I could then out a prior on Thresholds like Betancourt’s induced dirichlet? If I wanted to return the latent values of z would I move those to the generated quantities block?

I suppose my big question is where/how do I declare n_thresh and thresholds. I’ve been looking at some of @CerulloE code as well and I’m trying to figure out how to modify. Let’s say I have categoris labelled 0,1,2 so 3 ordered categories. How would this translate?

Thank you!! Also, I’ll start to play around with this code and then see what I can do about the mixed setting.

1 Like

Hi,
currently, my code is made as a “plugin” for brms (which is a bit annoying to do, but gives me some additional flexibility). So to have a concrete example, I generate some simple data as:

rcumulative_logit <- function(N, mean, thresholds) {
  raw <- rlogis(N) + mean
  res <- rep(1, N)
  for(t in thresholds) {
    res <- res + (t < raw)
  }
  res
}

set.seed(58224522)
N_patients <- 7
simple_test <- data.frame(patient = factor(1:N_patients)) %>%
  crossing(time_from_diagnosis = c(0, 3, 5)) %>%
  mutate(q1 = rcumulative_logit(n(), rnorm(N_patients)[as.integer(patient)] + time_from_diagnosis * 0.5, thresholds = c(-1,-0.3,0,1)),
         q2 = rcumulative_logit(n(), rnorm(N_patients, sd = 0.5)[as.integer(patient)] + time_from_diagnosis * 0.1, thresholds = c(-0.4,-0.1,0.5,1.5)),
         q3 = rcumulative_logit(n(), rnorm(N_patients, sd = 2)[as.integer(patient)] + time_from_diagnosis * 0.3, thresholds = c(-2,-0.8,-0.2,1.1)),
         obs_id = 1:n()
  )

simple_test
# A tibble: 21 x 6
   patient time_from_diagnosis    q1    q2    q3 obs_id
   <fct>                 <dbl> <dbl> <dbl> <dbl>  <int>
 1 1                         0     1     1     5      1
 2 1                         3     5     5     4      2
 3 1                         5     5     5     5      3
 4 2                         0     4     1     1      4
 5 2                         3     4     3     4      5
 6 2                         5     5     1     4      6
 7 3                         0     1     5     1      7
 8 3                         3     1     4     3      8
 9 3                         5     4     4     2      9
10 4                         0     4     2     4     10

I then run the hacked-around brms model with formula bf(mvbind(q1,q2,q3) ~ 1 + time_from_diagnosis giving the following data for Stan and full Stan code:

$N_q1
[1] 21

$Y_q1
 [1] 1 5 5 4 4 5 1 1 4 4 5 5 4 5 5 5 5 5 5 5 5

$nthres_q1
[1] 4

$N_q2
[1] 21

$Y_q2
 [1] 1 5 5 1 3 1 5 4 4 2 5 1 1 5 5 1 4 5 4 1 2

$nthres_q2
[1] 4

$N_q3
[1] 21

$Y_q3
 [1] 5 4 5 1 4 4 1 3 2 4 5 4 5 5 4 5 5 5 1 1 2

$nthres_q3
[1] 4

$N
[1] 21

$K_q1
[1] 1

$X_q1
   time_from_diagnosis
1                    0
2                    3
3                    5
4                    0
5                    3
6                    5
7                    0
8                    3
9                    5
10                   0
11                   3
12                   5
13                   0
14                   3
15                   5
16                   0
17                   3
18                   5
19                   0
20                   3
21                   5

$K_q2
[1] 1

$X_q2
   time_from_diagnosis
1                    0
2                    3
3                    5
4                    0
5                    3
6                    5
7                    0
8                    3
9                    5
10                   0
11                   3
12                   5
13                   0
14                   3
15                   5
16                   0
17                   3
18                   5
19                   0
20                   3
21                   5

$K_q3
[1] 1

$X_q3
   time_from_diagnosis
1                    0
2                    3
3                    5
4                    0
5                    3
6                    5
7                    0
8                    3
9                    5
10                   0
11                   3
12                   5
13                   0
14                   3
15                   5
16                   0
17                   3
18                   5
19                   0
20                   3
21                   5

$prior_only
[1] 0
// generated with brms 2.15.7
functions {
  
      real empty_cumulative_lpmf(int y, real mu, vector intercept) {
        return 0;
      }

    
  
      real approx_Phi(real x) {
        return inv_logit(x * 1.702);
      }

      real approx_inv_Phi(real x) {
        return logit(x) / 1.702;
      }
    
}
data {
  int<lower=1> N;  // total number of observations
  int<lower=1> N_q1;  // number of observations
  int Y_q1[N_q1];  // response variable
  int<lower=2> nthres_q1;  // number of thresholds
  int<lower=1> K_q1;  // number of population-level effects
  matrix[N_q1, K_q1] X_q1;  // population-level design matrix
  int<lower=1> N_q2;  // number of observations
  int Y_q2[N_q2];  // response variable
  int<lower=2> nthres_q2;  // number of thresholds
  int<lower=1> K_q2;  // number of population-level effects
  matrix[N_q2, K_q2] X_q2;  // population-level design matrix
  int<lower=1> N_q3;  // number of observations
  int Y_q3[N_q3];  // response variable
  int<lower=2> nthres_q3;  // number of thresholds
  int<lower=1> K_q3;  // number of population-level effects
  matrix[N_q3, K_q3] X_q3;  // population-level design matrix
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int Kc_q1 = K_q1;
  matrix[N_q1, Kc_q1] Xc_q1;  // centered version of X_q1
  vector[Kc_q1] means_X_q1;  // column means of X_q1 before centering
  int Kc_q2 = K_q2;
  matrix[N_q2, Kc_q2] Xc_q2;  // centered version of X_q2
  vector[Kc_q2] means_X_q2;  // column means of X_q2 before centering
  int Kc_q3 = K_q3;
  matrix[N_q3, Kc_q3] Xc_q3;  // centered version of X_q3
  vector[Kc_q3] means_X_q3;  // column means of X_q3 before centering
  
           int N_thres = nthres_q1;
         if(N != N_q1) { reject("Requiring equal sample size in all dimensions."); }

         if(N != N_q2) { reject("Requiring equal sample size in all dimensions."); }

         if(N != N_q3) { reject("Requiring equal sample size in all dimensions."); }
  
         if(nthres_q1 != nthres_q2) { reject("Requiring equal number of thresholds in all dimensions."); }

         if(nthres_q1 != nthres_q3) { reject("Requiring equal number of thresholds in all dimensions."); }
  for (i in 1:K_q1) {
    means_X_q1[i] = mean(X_q1[, i]);
    Xc_q1[, i] = X_q1[, i] - means_X_q1[i];
  }
  for (i in 1:K_q2) {
    means_X_q2[i] = mean(X_q2[, i]);
    Xc_q2[, i] = X_q2[, i] - means_X_q2[i];
  }
  for (i in 1:K_q3) {
    means_X_q3[i] = mean(X_q3[, i]);
    Xc_q3[, i] = X_q3[, i] - means_X_q3[i];
  }
}
parameters {
  vector[Kc_q1] b_q1;  // population-level effects
  ordered[nthres_q1] Intercept_q1;  // temporary thresholds for centered predictors
  vector[Kc_q2] b_q2;  // population-level effects
  ordered[nthres_q2] Intercept_q2;  // temporary thresholds for centered predictors
  vector[Kc_q3] b_q3;  // population-level effects
  ordered[nthres_q3] Intercept_q3;  // temporary thresholds for centered predictors
  
      cholesky_factor_corr[3] L_rescor;
    
  
      real<lower=0, upper=1> u[N, 3]; // raw residuals
    
}
transformed parameters {
}
model {
  
         vector[N_thres] thresholds[3];
         thresholds[1] = Intercept_q1;
         thresholds[2] = Intercept_q2;
         thresholds[3] = Intercept_q3;
  
      target += lkj_corr_cholesky_lpdf(L_rescor | 1);
    
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N_q1] mu_q1 = Xc_q1 * b_q1;
    // initialize linear predictor term
    vector[N_q2] mu_q2 = Xc_q2 * b_q2;
    // initialize linear predictor term
    vector[N_q3] mu_q3 = Xc_q3 * b_q3;
    for (n in 1:N_q1) {
      target += empty_cumulative_lpmf(Y_q1[n] | mu_q1[n], Intercept_q1);
    }
    for (n in 1:N_q2) {
      target += empty_cumulative_lpmf(Y_q2[n] | mu_q2[n], Intercept_q2);
    }
    for (n in 1:N_q3) {
      target += empty_cumulative_lpmf(Y_q3[n] | mu_q3[n], Intercept_q3);
    }
    
         for(n in 1:N) {
              real mus[3] = {mu_q1[n], mu_q2[n], mu_q3[n]};
              int Ys[3] = {Y_q1[n], Y_q2[n], Y_q3[n]};

              vector[3] z;
              real prev;
              prev = 0;
              for (d in 1:3) {
                real t; // threshold at which utility = 0
                if (Ys[d] == 1){
                  real ub = approx_Phi((thresholds[d, 1] -(mus[d] + prev)) / L_rescor[d,d]);
                  t = ub * u[n,d];
                  target += log(ub);  // Jacobian adjustment
                } else if (Ys[d] == N_thres + 1) {
                  real lb = approx_Phi((thresholds[d, N_thres] -(mus[d] + prev)) / L_rescor[d,d]);
                  t = lb + (1 - lb) * u[n,d];
                  target += log1m(lb);  // Jacobian adjustment
                } else {
                  real lb = approx_Phi((thresholds[d, Ys[d] - 1] -(mus[d] + prev)) / L_rescor[d,d]);
                  real ub = approx_Phi((thresholds[d, Ys[d]    ] -(mus[d] + prev)) / L_rescor[d,d]);
                  t = lb + (ub - lb) * u[n,d];
                  target += log(ub - lb);
                }
                z[d] = approx_inv_Phi(t);
                if (d < 3) prev = L_rescor[d+1,1:d] * head(z, d);
                // Jacobian adjustments imply z is truncated standard normal
                // thus utility --- mu + L_rescor * z --- is truncated multivariate normal
              }
          }
      
  }
  // priors including constants
  target += student_t_lpdf(Intercept_q1 | 3, 0, 2.5);
  target += student_t_lpdf(Intercept_q2 | 3, 0, 2.5);
  target += student_t_lpdf(Intercept_q3 | 3, 0, 2.5);
}
generated quantities {
  // compute actual thresholds
  vector[nthres_q1] b_q1_Intercept = Intercept_q1 + dot_product(means_X_q1, b_q1);
  // compute actual thresholds
  vector[nthres_q2] b_q2_Intercept = Intercept_q2 + dot_product(means_X_q2, b_q2);
  // compute actual thresholds
  vector[nthres_q3] b_q3_Intercept = Intercept_q3 + dot_product(means_X_q3, b_q3);
  
     corr_matrix[3] Rescor = multiply_lower_tri_self_transpose(L_rescor);
     vector<lower=-1,upper=1>[3] rescor;
     // extract upper diagonal of rescor matrix
     for (k in 1:3) {
        for (j in 1:(k - 1)) {
          rescor[choose(k - 1, 2) + j] = Rescor[j, k];
        }
      }
    
}

Hope that makes it a bit clearer. In particular, you can do anything you want to compute mus, the ordinal variables start at 1 and you can definitely put any prior you like on the thresholds. The empty_cumulative_lpmf is just a hack to force brms to ignore it’s default likelihood and do what I want. Also, instead of using N_dims, 3 is hardcoded (because the code is completely generated).

I also realized that the best way to combine with continous variables might be to go the other way around. If the continuous variables were just multivariate normal, you would:

  • Take the marginal means and correlation matrix for the continous variables and compute their likelihood
  • Compute the conditional means and correlation matrix for the probit variables given the observed continuous values (e.g. as in Multivariate normal distribution - Wikipedia). I don’t doubt the computations can be rewritten to take advantage of the fact that you have a Cholesky decomposition, I just don’t have enough skill in linear algebra to see it quickly.
  • Use the conditional means and correlation in the GHK algorithm.

Generalizing to non-normal continuous distributions should then be possible by converting by the appropriate CDF and then normal inverse CDF. (this is something I am less sure about, so please double check)

It is quite possible that you can do this all in one go by just modifying the GHK algorithm - once again, I am not good enough in linear algebra to see this quickly.

It looks like a tough model, so hope you can succed in building it.

1 Like

All this is pretty clear. I have enough here to combine both binary and polytomous variables for sure. In terms of mixed data, I’m thinking something like “A Bayesian method for analyzing combinations of continuous” by Zhang et al. 2015 in The Journal of Multivariate Analysis (see attached). It appears they combine the truncated normal of the probit with that of a standard multivariate normal for continuous. Now, how to translate to Stan (or even R for that matter). That is the question!

Zhang_etal_2015.pdf (1.5 MB)

@martinmodrak Question as I work through coding this mixed model. Is it possible to allow the correlation terms to vary as a function of x (age in my case)? In my code above, when I combine terms in diag_pre_multiply, the sd varies as a function of age, but the correlation terms are uniform per response across age. Can these also be coded to be dependent on X as well? Does this make sense? Or does it become way to complex?

In principle, this seems possible. The easiest way would probably be to put predictors on the unconstrained representation of the correlation matrix (see 10.9 Correlation matrices | Stan Reference Manual) and then apply the transformation yourself. @spinkney recently shared Stan implementation of the transform at Correlation matrix with positively constrained off-diagonals - #3 by spinkney although since you would not be putting priors on the transformed matrix (but presumably rather on the predictors), you must not add the log - Jacobian correction.

However, I would expect this to work quite badly in practice. In all models where I used correlations, a huge amount of data was necessary to learn the correlations with any precision. And you would need substantially more data to be able to decompose the correlations into several terms.

I also have no intuition whether putting predictors on the canonical partial correlations is a sensible thing to do from theoretical perspective (i.e. whether you would assume the CPCs to change monotonically with your predictors).

1 Like