Translating (ordered-probit) model to statistical notation - where to look?

Hi all,

I’m wondering if anyone can point me towards some resources for translating models into accurate statistical notation.

I’m working on translating an ordered-probit model (discussed in this thread) into statistical notation, but am not at all confident what I have is correct.

What I have so far:

Y_{sm} = \gamma_{sm} \Leftrightarrow \theta_{m,\gamma_{sm}} < \widetilde Y_{sm} \leq \theta_{m,\gamma_{sm}}, \gamma_{sm} \equiv {1,...,K_{m}}

where \gamma_{sm} is a category out of K_{m} ordered categories and \theta_{m} is a vector of threshold parameters for outcome m.

\begin{align*} p(y_{m} = k_{m} | $\mu_{m}, \sigma_{m}, \{ \theta_{sm}) & = \Phi((\theta_{km} - \mu_{m})/ (\sigma_{s} + \sigma_{m} + \sigma_{t}) - \Phi((\theta_{k_{m} - 1} - \mu_{m})/ (\sigma_{s} + \sigma_{m} + \sigma_{t})) \\ \mu_{smt} & = \beta_{0smt} + \beta_{Intensity_{smt}}*\text{Intensity}_{smt} \\ &+ \beta_{Condition_{smt}}*\text{Condition}_{smt} \\ &+ \beta_{Category_{smt}}* \text {Category}_{smt} \\ &+ \beta_{Condition.Category} * \text{Condition.Category}_{smt}\\ \beta_0 & = 0 \\ \beta_{Intensity} & \sim \operatorname{Student} (6,1,1.5) \\ \beta_{Condition} & \sim \operatorname{Student} (6,1,1.5) \\ \beta_{Category} & \sim \operatorname{Student} (6,1,1.5) \\ \beta_{Condition.Category} & \sim \operatorname{Student} (6,1,1.5) \\ \sigma_{s} & \sim \operatorname{Exp} (3) \\ \sigma_{t} & \sim \operatorname{Exp}(2) \\ \sigma_{m} & = 1 / disc \\ \log (disc) & = \gamma_{Category} \\ \gamma_{Category = Unpleasant} & = 0 \\ \gamma_{Category = Neutral} & \sim \operatorname{Normal} (0, 2) \\ \gamma_{Category = Pleasant} & \sim \operatorname{Normal} (0, 2) \\ \theta_{m1},...,\theta_{m_{K - 1}} & \sim \operatorname{Normal} (0, 6) \\ Corr_{smt} & \sim \operatorname{Cholesky LKJ} (1) \\ \end{align*}

My brms model is:

Mod_fmla <- bf(mvbind(Valence, Arousal) | resp_thres(8, gr = Category) ~ Category +
                                            Condition +
                                            Category:Condition +
                                            Intensity.c + 
                                            (1|o|Token) +
                                            (1|p|Condition) +
                                            (1|q|Condition:Subject)) +
        lf(disc ~ 0 + Category, resp = "Valence", cmc = FALSE) +
        lf(disc ~ 0 + Category, resp = "Arousal", cmc = FALSE)

## priors
priors <- c(
        set_prior("normal(0,6)",  
                  class = "Intercept",
                  coef = "",
                  group = "Neutral", 
                  resp = ""),
        
        set_prior("normal(0,6)",  
                  class = "Intercept",
                  coef = "",
                  group = "Pleasant",
                  resp = ""),
        
        set_prior("student_t(6,1,1.5)",
                  class = "b",
                  coef = "",
                  resp = ""),
       
        set_prior("normal(0,2)",
                  class = "b",
                  coef = "CategoryNeutral",
                  resp = "",
                  dpar = "disc"),
        
        set_prior("normal(0,2)",
                  class = "b",
                  coef = "CategoryPleasant",
                  resp = "",
                  dpar = "disc"),
        
        set_prior("exponential(2)",
                  class = "sd",
                  coef = "Intercept",
                  group = "Condition",
                  resp = ""),
        set_prior("exponential(3)",
                  class = "sd",
                  coef = "Intercept",
                  group = "Condition:Subject",
                  resp = ""),
        set_prior("exponential(2)",
                  class = "sd",
                  coef = "Intercept",
                  group = "Token",
                  resp = "")      
       )

The stan code is:

parameters {
  vector[K_Valence] b_Valence;  // population-level effects
  ordered[nthres_Valence[1]] Intercept_Valence_1;  // temporary thresholds for centered predictors
  ordered[nthres_Valence[2]] Intercept_Valence_2;  // temporary thresholds for centered predictors
  ordered[nthres_Valence[3]] Intercept_Valence_3;  // temporary thresholds for centered predictors
  vector[K_disc_Valence] b_disc_Valence;  // population-level effects
  vector[K_Arousal] b_Arousal;  // population-level effects
  ordered[nthres_Arousal[1]] Intercept_Arousal_1;  // temporary thresholds for centered predictors
  ordered[nthres_Arousal[2]] Intercept_Arousal_2;  // temporary thresholds for centered predictors
  ordered[nthres_Arousal[3]] Intercept_Arousal_3;  // temporary thresholds for centered predictors
  vector[K_disc_Arousal] b_disc_Arousal;  // population-level effects
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_1;  // cholesky factor of correlation matrix
  vector<lower=0>[M_2] sd_2;  // group-level standard deviations
  matrix[M_2, N_2] z_2;  // standardized group-level effects
  cholesky_factor_corr[M_2] L_2;  // cholesky factor of correlation matrix
  vector<lower=0>[M_3] sd_3;  // group-level standard deviations
  matrix[M_3, N_3] z_3;  // standardized group-level effects
  cholesky_factor_corr[M_3] L_3;  // cholesky factor of correlation matrix
}
transformed parameters {
  matrix[N_1, M_1] r_1;  // actual group-level effects
  // using vectors speeds up indexing in loops
  vector[N_1] r_1_Valence_1;
  vector[N_1] r_1_Arousal_2;
  matrix[N_2, M_2] r_2;  // actual group-level effects
  // using vectors speeds up indexing in loops
  vector[N_2] r_2_Valence_1;
  vector[N_2] r_2_Arousal_2;
  matrix[N_3, M_3] r_3;  // actual group-level effects
  // using vectors speeds up indexing in loops
  vector[N_3] r_3_Valence_1;
  vector[N_3] r_3_Arousal_2;
  // compute actual group-level effects
  r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)';
  r_1_Valence_1 = r_1[, 1];
  r_1_Arousal_2 = r_1[, 2];
  // compute actual group-level effects
  r_2 = (diag_pre_multiply(sd_2, L_2) * z_2)';
  r_2_Valence_1 = r_2[, 1];
  r_2_Arousal_2 = r_2[, 2];
  // compute actual group-level effects
  r_3 = (diag_pre_multiply(sd_3, L_3) * z_3)';
  r_3_Valence_1 = r_3[, 1];
  r_3_Arousal_2 = r_3[, 2];
}
model {
  vector[nmthres_Valence] merged_Intercept_Valence;  // merged thresholds
  // initialize linear predictor term
  vector[N_Valence] mu_Valence = X_Valence * b_Valence;
  // initialize linear predictor term
  vector[N_Valence] disc_Valence = X_disc_Valence * b_disc_Valence;
  vector[nmthres_Arousal] merged_Intercept_Arousal;  // merged thresholds
  // initialize linear predictor term
  vector[N_Arousal] mu_Arousal = X_Arousal * b_Arousal;
  // initialize linear predictor term
  vector[N_Arousal] disc_Arousal = X_disc_Arousal * b_disc_Arousal;
  merged_Intercept_Valence[Kthres_start_Valence[1]:Kthres_end_Valence[1]] = Intercept_Valence_1;
  merged_Intercept_Valence[Kthres_start_Valence[2]:Kthres_end_Valence[2]] = Intercept_Valence_2;
  merged_Intercept_Valence[Kthres_start_Valence[3]:Kthres_end_Valence[3]] = Intercept_Valence_3;
  merged_Intercept_Arousal[Kthres_start_Arousal[1]:Kthres_end_Arousal[1]] = Intercept_Arousal_1;
  merged_Intercept_Arousal[Kthres_start_Arousal[2]:Kthres_end_Arousal[2]] = Intercept_Arousal_2;
  merged_Intercept_Arousal[Kthres_start_Arousal[3]:Kthres_end_Arousal[3]] = Intercept_Arousal_3;
  for (n in 1:N_Valence) {
    // add more terms to the linear predictor
    mu_Valence[n] += r_1_Valence_1[J_1_Valence[n]] * Z_1_Valence_1[n] + r_2_Valence_1[J_2_Valence[n]] * Z_2_Valence_1[n] + r_3_Valence_1[J_3_Valence[n]] * Z_3_Valence_1[n];
  }
  for (n in 1:N_Arousal) {
    // add more terms to the linear predictor
    mu_Arousal[n] += r_1_Arousal_2[J_1_Arousal[n]] * Z_1_Arousal_2[n] + r_2_Arousal_2[J_2_Arousal[n]] * Z_2_Arousal_2[n] + r_3_Arousal_2[J_3_Arousal[n]] * Z_3_Arousal_2[n];
  }
  for (n in 1:N_Valence) {
    // apply the inverse link function
    disc_Valence[n] = exp(disc_Valence[n]);
  }
  for (n in 1:N_Arousal) {
    // apply the inverse link function
    disc_Arousal[n] = exp(disc_Arousal[n]);
  }
  // priors including all constants
  target += student_t_lpdf(b_Valence | 6,1,1.5);
  target += normal_lpdf(Intercept_Valence_2 | 0,6);
  target += normal_lpdf(Intercept_Valence_3 | 0,6);
  target += normal_lpdf(b_disc_Valence[1] | 0,2);
  target += normal_lpdf(b_disc_Valence[2] | 0,2);
  target += student_t_lpdf(b_Arousal | 6,1,1.5);
  target += normal_lpdf(Intercept_Arousal_2 | 0,6);
  target += normal_lpdf(Intercept_Arousal_3 | 0,6);
  target += normal_lpdf(b_disc_Arousal[1] | 0,2);
  target += normal_lpdf(b_disc_Arousal[2] | 0,2);
  target += exponential_lpdf(sd_1[1] | 2);
  target += exponential_lpdf(sd_1[2] | 2);
  target += std_normal_lpdf(to_vector(z_1));
  target += lkj_corr_cholesky_lpdf(L_1 | 1);
  target += exponential_lpdf(sd_2[1] | 3);
  target += exponential_lpdf(sd_2[2] | 3);
  target += std_normal_lpdf(to_vector(z_2));
  target += lkj_corr_cholesky_lpdf(L_2 | 1);
  target += exponential_lpdf(sd_3[1] | 2);
  target += exponential_lpdf(sd_3[2] | 2);
  target += std_normal_lpdf(to_vector(z_3));
  target += lkj_corr_cholesky_lpdf(L_3 | 1);
  // likelihood including all constants
  if (!prior_only) {
    for (n in 1:N_Valence) {
      target += cumulative_probit_merged_lpmf(Y_Valence[n] | mu_Valence[n], disc_Valence[n], merged_Intercept_Valence, Jthres_Valence[n]);
    }
    for (n in 1:N_Arousal) {
      target += cumulative_probit_merged_lpmf(Y_Arousal[n] | mu_Arousal[n], disc_Arousal[n], merged_Intercept_Arousal, Jthres_Arousal[n]);
    }
  }
}
[details="Summary"]
This text will be hidden
[/details]

2 Likes

First, I would mention that a growing part of academic readership would IMHO be faster to understand your model from the R formula than from the math notation, so - depending on your target audience - doing away with the math notation completely might be feasible, though a bit riskier as it is not (yet) commonly done.

Definitely also check out
Understanding the Mathematical Model and Mathematical description of a brms model where a similar problem is discussed

I see a few weird things:

  • I thought you were using cumulative("logit") family, but using \Phi suggest you use cumulative("probit")
  • Categorical variables would be encoded via dummy coding with a reference level sou you would have a coefficient for each level of category, condition and category:condition, except for the reference level.
  • The \sigma_s, \sigma_m, \sigma_t should probably be the sd on the varying intercepts - those are however currently missing from the models, you should have something like so you would need to explicitly write those out. An example of how that might look like is on page 8 of my recent publication

And there are potentially other issues.

You might also want to simplify and just to state the model in matrix notation ((https://en.wikipedia.org/wiki/Mixed_model)) and then rely on the brms formula for details.

But if you need to have the full math notation, feel free to get back for additional clarification :-)

Thank you, Martin!

I am using cumulative("probit") to align with Paul’s tutorial and a recent paper by Liddell and Kruschke

I attempted to capture the dummy coding and priors with the betas for Intensity, Category, Condition and Condition.Category, which I’m trying to reflect in the linear portion describing mu, but I’m not sure how to incorporate the sigmas into that part of the model description. I have them summed in the phi terms, but I don’t think that’s correct.

You are right, summing them in phi is incorrect. I am not quite sure where you are stuck, but it seems some introductory material looking at hierarchical models from a more mathematical standpoint might help. I think the description in Statistical Rethinking is quite accessible while detailed enough to help you move forward, but I would expect other books to have good treatment.

If I understand your question, the sigmas refer to the sd parameters for Token etc. Those are the standard deviations for the varying intercepts. Mapping it to the example from my recent publication that I already linked to, this is similar to how e.g. the (1 || phenotype) term is coded via the \beta^1_{p_i} and \sigma_1 coefficients (but you also have a covariance matrix to take care of).