Exception: S_n (cov_matrix) not symmetric for large value elements

Hi,

I’m working on a program that tries to infer the prior distribution (the distribution over means and covariance matrices) of a number of multivariate Normal categories given categorization responses provided for stimuli in the multi-dimensional space (program attached below). The program works just fine as long as the relevant covariances matrices are ‘sufficiently small’ but, for covariance matrices with larger values, I’m getting exceptions that the covariance matrices are not symmetric, even though they should be/are except for numerical accuracy. I’d appreciate any pointers, e.g., how I could use different types of operations to avoid these issues (which I suspect stem from numerical precision?).

Here is an example error message (all errors involve large values):

Chain 3: Exception: model_mvg_conj_sufficient_stats_lapse_namespace::write_array: S_n is not symmetric. S_n[1,2] = -5.36654e+06, but S_n[2,1] = -5.36654e+06 (in 'mvg_conj_sufficient_stats_lapse', line 73, column 2 to column 25)

The stan model uses the Normal-Inverse-Wishart (NIW) prior described in Murphy (2012), and update it with the sufficient statistics (sample means, covariances, and counts of each category that subjects observed during ‘exposure’). The objective function is the likelihood of subjects’ categorization responses to test stimuli during test. This likelihood is multinomial density based the predicted probabilities of each category. That predicted probability results from applying Bayes theorem over prior category probabilities p(cat) [assumed to be uniform for now] and the category likelihood for the stimulus \mathcal{T}(stimulus | m_{n,cat}, \frac{1}{\kappa_{n,cat}(\nu_{n,cat} - D + 1)}S_{n,cat}, \nu_{n,cat} - D + 1)—the posterior predictive of the updated NIW after having seen n observations of a category during ‘exposure’, NIW(\mu_{cat}, \Sigma_{cat} | m_n, S_n, \kappa_n, \nu_n) (following Murphy). D is the dimensionality of the data (and the multivariate Normal categories).

Given that the error messages I’m getting (only sometimes for some chains and only for entries > 10^6) are about S_n, my best guess is that the culprit is this line:

S_n[cat,group] = S_0[cat] +
          x_ss[cat,group] +
          kappa_0 * m_0[cat] * m_0[cat]' -
                        kappa_n[cat,group] * m_n[cat,group] * m_n[cat,group]';

I realize it is a big ask to wade through my code. But perhaps someone here sees something obvious or has some guesses / pointers I could explore. Thank you in advance!

The stan program:

/*
* Fit multinomial response data using a belief-updating model to infer prior
* parameters.  A normal-inverse-Wishart prior is used, and it's assumed
* that the subject knows the true labels of all the input stimuli (e.g.,
* doesn't model any uncertainty in classification.
*
* This version has a lapse rate parameter (probability of random guessing)
*
* Input is in the form of raw data points for observations.
*
*
*/

data {
  int M;                        // number of categories
  int L;                        // number of exposure groups (e.g. subjects)
  int K;                        // number of features

  /* Efficiency considerations: declaring n as matrix rather than array of int is supposedly faster
  (https://mc-stan.org/docs/2_18/stan-users-guide/indexing-efficiency-section.html). It also means
  that standard arithmetic functions can be directly applied to that matrix (unlike to more-than-1-
  dimensional arrays). */
  matrix[M,L] N;                // number of observations per category (M) and exposure group (L)
  vector[K] x_mean[M,L];        // means for each category (M) and exposure group (L)
  cov_matrix[K] x_ss[M,L];      // sum of uncentered squares matrix for each category (M) and group (L)

  int N_test;                   // number of unique combinations of test locations & exposure groups
  vector[K] x_test[N_test];     // locations (in cue space) of test trials
  int y_test[N_test];           // exposure group indicator of test trials
  int z_test_counts[N_test,M];  // responses for test trials (for each category M)

  int<lower=0, upper=1> lapse_rate_known;
  int<lower=0, upper=1> mu_0_known;
  int<lower=0, upper=1> Sigma_0_known;
  real lapse_rate_data[lapse_rate_known ? 1 : 0];                         // optional: user provided lapse_rate
  vector[mu_0_known ? K : 0] mu_0_data[mu_0_known ? M : 0];               // optional: user provided expected mu_0 (prior category means)
  cov_matrix[Sigma_0_known ? K : 0] Sigma_0_data[Sigma_0_known ? M : 0];  // optional: user provided expected Sigma_0 (prior category covariance matrices)

  /* For now, this script assumes that the observations (cue vectors) are centered. The prior
     mean of m_0 is set to 0. In the future, one could automatically adjust the location based
     on the input data (e.g., by placing the mean of the Normal prior in the center of the
     exposure data).
  */
  vector<lower=0>[mu_0_known ? 0 : K] tau_scale;  // separate taus for each of the K features to capture that features can be on separate scales
  // real<lower=0> tau_scale;      // scale of cauchy prior for variances of m_0
  real<lower=0> L_omega_scale;                    // scale of LKJ prior for correlation of variance of m_0

  int<lower=0, upper=1> split_loglik_per_observation;
}

transformed data {
  real sigma_kappanu;

  /* Scale for the prior of kappa/nu_0. In order to deal with input that does not contain observations
     (in which case n_each == 0), we set the minimum value for SD to 10.
  */
  sigma_kappanu = max(N) > 0 ? max(N) * 4 : 10;
}

parameters {
  // these are all shared across groups (same prior beliefs):
    real<lower=K> kappa_0;                  // prior pseudocount for category mu
    real<lower=K + 1> nu_0;                 // prior pseudocount for category Sigma

    real<lower=0, upper=1> lapse_rate_param[lapse_rate_known ? 0 : 1];

    vector[K] m_0_param[mu_0_known ? 0 : M];                // prior mean of means
    vector<lower=0>[mu_0_known ? 0 : K] m_0_tau;            // prior variances of m_0
    cholesky_factor_corr[mu_0_known ? 0 : K] m_0_L_omega;   // prior correlations of variances of m_0 (in cholesky form)

    vector<lower=0>[K] tau_0_param[Sigma_0_known ? 0 : M];          // standard deviations of prior scatter matrix S_0
    cholesky_factor_corr[K] L_omega_0_param[Sigma_0_known ? 0 : M]; // correlation matrix of prior scatter matrix S_0 (in cholesky form)
}

transformed parameters {
  real lapse_rate;
  vector[K] m_0[M];                    // prior mean of means m_0
  cov_matrix[K] S_0[M];                // prior scatter matrix S_0

  // updated beliefs depend on input and group
  real<lower=K> kappa_n[M,L];          // updated mean pseudocount
  real<lower=K> nu_n[M,L];             // updated sd pseudocount
  vector[K] m_n[M,L];                  // updated expected mean
  cov_matrix[K] S_n[M,L];              // updated expected scatter matrix
  cov_matrix[K] t_scale[M,L];          // scale matrix of predictive t distribution

  simplex[M] p_test_conj[N_test];
  vector[M] log_p_test_conj[N_test];

  if (lapse_rate_known) {
    lapse_rate = lapse_rate_data[1];
  } else {
    lapse_rate = lapse_rate_param[1];
  }
  if (mu_0_known) {
    m_0 = mu_0_data;
  } else {
    m_0 = m_0_param;
  }
  if (Sigma_0_known) {
   // Get S_0 from expected Sigma given nu_0
   for (cat in 1:M) {
      S_0[cat] = Sigma_0_data[cat] * (nu_0 - K - 1);
   }
  }

  // update NIW parameters according to conjugate updating rules are taken from
  // Murphy (2007, p. 136)
  for (cat in 1:M) {
    if (!Sigma_0_known) {
      // Get S_0 from its components: correlation matrix and vector of standard deviations
      S_0[cat] = quad_form_diag(multiply_lower_tri_self_transpose(L_omega_0_param[cat]), tau_0_param[cat]);
    }
    for (group in 1:L) {
      if (N[cat,group] > 0 ) {
        kappa_n[cat,group] = kappa_0 + N[cat,group];
        nu_n[cat,group] = nu_0 + N[cat,group];
        m_n[cat,group] = (kappa_0 * m_0[cat] + N[cat,group] * x_mean[cat,group]) /
          kappa_n[cat,group];
        S_n[cat,group] = S_0[cat] +
          x_ss[cat,group] +
          kappa_0 * m_0[cat] * m_0[cat]' -
                        kappa_n[cat,group] * m_n[cat,group] * m_n[cat,group]';
      } else {
        kappa_n[cat,group] = kappa_0;
        nu_n[cat,group] = nu_0;
        m_n[cat,group] = m_0[cat];
        S_n[cat,group] = S_0[cat];
      }

      t_scale[cat,group] = S_n[cat,group] * (kappa_n[cat,group] + 1) /
        (kappa_n[cat,group] * (nu_n[cat,group] - K + 1));
    }
  }

  // compute posterior probabilities of all categories for each of the test stimuli
  for (j in 1:N_test) {
    int group;
    group = y_test[j];
    /* calculate un-normalized log posterior probability for each category. Under the assumption
       of uniform prior probabilities for each category, the log probabilities identical to the
       normalized log likelihoods. If we ever were to change this assumption, we'd have to add
       the log prior probabilities of categories here. */
    for (cat in 1:M) {
      // log likelihood of the test stimulus given the category
      log_p_test_conj[j,cat] = multi_student_t_lpdf(x_test[j] |
                                                    nu_n[cat,group] - K + 1,
                                                    m_n[cat,group],
                                                    t_scale[cat,group]);
    }
    // normalize to get actual posterior probs in simplex
    p_test_conj[j] = exp(log_p_test_conj[j] - log_sum_exp(log_p_test_conj[j]));
  }
}

model {
  vector[M] lapsing_probs;

  /* Assuming unifom bias, so that lapsing_prob = probability of each category prior to
     taking into account stimulus is 1/M */
  lapsing_probs = rep_vector(lapse_rate / M, M);

  kappa_0 ~ normal(0, sigma_kappanu);
  nu_0 ~ normal(0, sigma_kappanu);

  if (!mu_0_known) {
      m_0_tau ~ cauchy(0, tau_scale);
      m_0_L_omega ~ lkj_corr_cholesky(L_omega_scale);
      m_0_param ~ multi_normal_cholesky(rep_vector(0, K), diag_pre_multiply(m_0_tau, m_0_L_omega));
  }

  // Specifying prior for components of S_0:
  if (!Sigma_0_known) {
    for (cat in 1:M) {
      tau_0_param[cat] ~ cauchy(0, tau_scale);
      L_omega_0_param[cat] ~ lkj_corr_cholesky(L_omega_scale);
    }
  }

  for (i in 1:N_test) {
    z_test_counts[i] ~ multinomial(p_test_conj[i] * (1 - lapse_rate) + lapsing_probs);
  }

}

I haven’t tried to wade through your code, but I notice that the magnitude of the covariance matrix entry that’s causing problems is very large (e+06) and that the two numbers match one another to the number of significant digits shown. You don’t mention the scale of your data, but would it be possible to rescale it? I can’t guarantee that rescaling will help, but it might be worth a try, and it would cost only a bit of pre- and post-processing.

Thanks for including it. It’s even harder to wade through code that’s not there :-).

If possible, work with Cholesky factors. Those won’t run into symmetry issues. I’m assuming that’s not feasible here. But then I see you are computing full covariance matrices like S_0 out of the Cholesky factors. It’s far better to leave those as Cholesky factors.

We have a built-in that will enforce symmetry from an almost-symmetric matrix:

matrix symmetrize_from_lower_tri (matrix A)

Or if that disrupts derivatives due to how you’ve constructed the matrix, you can just use

A = (A + A') / 2;

Stan doesn’t exploit conjugacy so you don’t need to use Wishart priors. We tend to favor an LKJ prior on correlation and an independent prior on the scales.

+1 to that. Stan assumes by default that all variables are unit scales and that governs things like initialization. We can adapt to non-unit scale, but it can be difficult in some cases. It’s always better to rescale to roughly unit scale where possible.

P.S. Shouldn’t this be an array? If the values are required to be non-negative, I would suggest a <lower=0> constraint for doc and error checking of inputs.

matrix[M, L] N;                // number of observations per category (M)

You can declare and define a variable in the same line, e.g., (I’ve added the constraint for doc and error checking):

real<lower=0> sigma_kappa_nu = min(max(N) * 4, 10);

I also rewrote it so it doesn’t have to calculate the max of an array twice, not that it really matters in transformed data. This assumes the minimum value of N is 0 and there aren’t any NaN values.

P.P.S. I’d recommend updating Stan. We no longer support the array syntax you’re using, so you must be using an older version of Stan. The old array syntax you’re using is deprecated and is only available in our older interfaces. I you’re using RStan or PyStan, I’d suggest using cmdstanr or cmdstanpy instead as they stay more up to date with Stan.

Wow, thank you so much @Bob_Carpenter and @kholsinger! I’ll go through these comments carefully tomorrow. For now, here are some initial clarifications and answers:

  • @Bob_Carpenter Why I care about conjugacy in this case: it’s not for Stan =). My program is meant to model learning that subjects are assumed to undergo when they are exposed to example draws from shifted categories (e.g., hearing someone pronounce the “t” in “tin” with an acoustic feature value that is shifted relative to listeners’ prior expectations). I’m using the NIW model to describe how subjects’ unknown prior beliefs change with exposure to shifted realizations of categories.

    Conjugacy makes that learning / belief-updating process tractable (because it only depends on the sufficient statistics of categories during exposure: the sample means, sum of squares matrices, and observation count of each category). And being able to estimate the learning process in turn allows me to infer subjects’ unknown prior beliefs about the multivariate normal categories (the beliefs prior to any exposure), and their posterior beliefs following exposure, as well as the strength of the prior beliefs.

    In case it’s of interest, here’s the relevant part from Murphy (2012, p 134), showing the posterior beliefs about each category mean \mu and covariance \Sigma (determined by m_N, S_N, \kappa_N, and \nu_N), as a function of the prior beliefs (determined by m_0, S_0, \kappa_0, and \nu_0) and the sufficient statistics of the categories experienced during exposure (N, \bar{x}, S):

  • @Bob_Carpenter, that all said, I think in line with your suggestion, I am already using an LKJ prior and an independent (Cauchy) prior on the scales, both for the prior distribution of the m_0 (=m_0_param in the program) and the prior distribution of S_0 (=S_0_param):
 if (!mu_0_known) {
      m_0_tau ~ cauchy(0, tau_scale);
      m_0_L_omega ~ lkj_corr_cholesky(L_omega_scale);
      m_0_param ~ multi_normal_cholesky(rep_vector(0, K), diag_pre_multiply(m_0_tau, m_0_L_omega));
  }

  // Specifying prior for components of S_0:
  if (!Sigma_0_known) {
    for (cat in 1:M) {
      tau_0_param[cat] ~ cauchy(0, tau_scale);
      L_omega_0_param[cat] ~ lkj_corr_cholesky(L_omega_scale);
    }
  • @kholsinger Scaling. I originally scaled inputs but then became worried that this might affect the solution the program arrives at (but perhaps my thinking is wrong). I figured whatever transformation I apply to my exposure and test data has to maintain the ratios of the category likelihoods (the multivariate \mathcal{T} densities that are the posterior predictive of the NIW posterior beliefs), since these ratios determine the predicted posterior distribution over categories (which is used to determine the likelihood of the observed categorization responses from listeners).

    I figured that scaling each input dimension does not change the correlation matrix of the categories that sit in the scaled space. But I wasn’t confident that it doesn’t change their density of those multivariate Normals (or \mathcal{T} s). To simplify the question I have, let’s say I have 3 bivariate Normals and one test token x that sits in that 2D space. I first thought that scaling both the Normals and x by D^-1 = diag(\hat{\sigma_1}, \hat{\sigma_2}) would suffice to keep the density of the scaled Normals at the scaled x identical to the density of the normals at the original x. But when I tried to implement it, that didn’t seem to work out (perhaps I need to correct for the Jacobian?)

Thank you in any case of the already provided pointers. I will update Stan and my syntax, and also think about the other suggestions!

I’m confused. There are no tractable updates for LKJ priors and Cauchy priors on scales (I’d recommend much tighter priors on scales because those Cauchys don’t do much due to their very wide tails).

Fun! I used to work on language modeling for speech recognition (and classification and spelling correction and clustering) and did a lot of theoretical linguistics on the computational syntax, semantics, and phonology side.

It’s common to standardize covariates (subtract mean then divide by standard deviation) and you can then convert back to the original scale. This is what brms and rstanarm do, for example.

If you’re just talking about multiplier and offset transforms, those do not change the posterior density being targeted, they just attempt to make the transformed density look more standard normal.

If you have a parameter theta and its unconstrained posterior is roughly normal(mu, sigma), then declaring <offset=mu, multiplier=sigma> will make the unconstrained posterior roughly standard normal.

If you scale by fixed amounts, you don’t need a Jacobian, but we take into account the Jacobian with the offset and multiplier in cases where they are also parameters (if they’re constants, then the Jacobian is constant, so we can avoid computing it).

@Bob_Carpenter Ha, I didn’t know you had a background in language / comp ling! Well, perhaps the following is helpful context then: I’m trying to model how listeners adapt to changing pronunciations. We have experimental evidence that listeners can quite quickly—within a handful of observations—adapt their categorization of the same acoustic input based on recent evidence that the realization of phonetic categories has changed (e.g., after listening to a previously unfamiliar talker who produces the phonetic categories in a way that is shifted relative to a ‘typical’ talker).

Specifically, I’m updating the model described in this figure, describing the updating of beliefs about multivariate Normal category representations (modified from What we do (not) know about the mechanisms underlying adaptive speech perception: A computational framework and review - ScienceDirect):

As that figure shows, I observe i) listeners categorization responses during test (e.g, that they heard “t”) to various test stimuli, ii) the cues of those test stimuli (their location in the multivariate space), and iii) the cue statistics of categories that listener heard during exposure (e.g., by hearing various shifted instances of “t” and “d”). I assume that all listeners have unknown prior beliefs about the categories (“t” and “d”) that they update based on exposure (using updating process described by Murphy, see graphical model below, modified from the same paper linked above), and then apply during test to categorize the test stimuli (using Bayes theorem to the posterior probability of the category given the posterior beliefs about each categories mean and covariance matrix).

I guess what makes this problem extra-confusing to describe is that it involves priors and posteriors with regard to the learning model that I assume describes listeners’ adaptation (the model that describes how the m, S, \kappa, and \nu parameters in the above figure change with exposure) and it also involves Stan’s priors on the parameters of that model (e.g., priors on the prior m_0).

I don’t want to wear out your patience—you’ve already been very helpful! So no worries, of course, if this is all becoming too time-consuming. Below, I’m summarizing what I’ve done so far in response to your comments.

  • I will have another look at whether I can’t ‘just’ standardize the input without changing the results (perhaps I made a mistake when I first tried to center & scale and then unscale & uncenter after fitting). It certainly would make things more stable.

  • You also commented “I’m confused. There are no tractable updates for LKJ priors and Cauchy priors on scales (I’d recommend much tighter priors on scales because those Cauchys don’t do much due to their very wide tails).” I wasn’t quite sure what you meant by the first part. I think my initial response might have been confusing: for the part of my model that is meant to model learning/adaptation, I use the NIW because I care about the conjugacy because it gives me an easy to calculate listeners’ prior beliefs about the multivariate categories. But the parameters for NIW that describes the prior beliefs have themselves priors (here I mean the Stan priors, not listeners’ priors). e.g., I’m assuming that the m_0s are drawn from a multivariate Normal, and I describe the covariance matrix of that normal in terms of Cauchy priors over the vector of standard deviations and LKJ priors over the correlation matrix. E.g., this part of the code:

      m_0_tau ~ cauchy(0, tau_scale);
      m_0_L_omega ~ lkj_corr_cholesky(L_omega_scale);
      m_0_param ~ multi_normal_cholesky(rep_vector(0, K), diag_pre_multiply(m_0_tau, m_0_L_omega));

Does that clarify things? Or perhaps what I’m doing makes no sense =). In any case, I assume your comment also meant that your recommending to use something different than a cauchy for e.g., the prior over m_0_tau?

In the meantime, I’ve tried to follow the advice you kindly provided yesterday:

  • Updated stan
  • Changed to up-to-date array syntax
  • Declare constraints on data (lower, upper bounds) where I had missed it
  • Tried to use Cholesky form … that mostly was straightforward, I think, and the updated code is pasted below. But one place where this seems awkward is the belief-updating step for the posterior scatter matrix of the NIW, S_n (old computation of S_n commented out; new computation of Cholesky of S_n, L_S_n pasted):
        // S_n[cat,group] = S_0[cat] +
        //   x_ss[cat,group] +
        //   kappa_0 * m_0[cat] * m_0[cat]' -
        //                 kappa_n[cat,group] * m_n[cat,group] * m_n[cat,group]';
        L_S_n[cat, group] = cholesky_decompose(
          symmetrize_from_lower_tri(
            multiply_lower_tri_self_transpose(L_S_0[cat]) +
            x_ss[cat, group] +
            kappa_0 * m_0[cat] * m_0[cat]' -
            kappa_n[cat, group] * m_n[cat, group] * m_n[cat, group]'
          )
        );
  • Given my ignorance, I had an extended chat with my pal GPT, which led to the suggestion to use rank-one updates to achieve this updating step without ever leaving the Cholesky form. Chatty that it is, GPT even suggested a function for the rank-one updates (also attached below), and I think it does the right thing for the last two components of the sum (kappa_0 * m_0[cat] * m_0[cat]’ and - kappa_n[cat, group] * m_n[cat, group] * m_n[cat, group]'). But I also need to add the sum of squares matrix x_ss. Either I’m missing something or I’d have to further decompose x_ss into a sum of vectors in order to update the Cholesky with rank-1 updates, if I’d like to avoid leaving the Cholesky form?
/*
* Fit multinomial response data using a belief-updating model to infer prior
* parameters.  A normal-inverse-Wishart prior is used, and it's assumed
* that the subject knows the true labels of all the input stimuli (e.g.,
* doesn't model any uncertainty in classification.
*
* This version has a lapse rate parameter (probability of random guessing)
*
* Input is in the form of raw data points for observations.
*/

data {
  int M;                                                   // number of categories
  int L;                                                   // number of exposure groups (e.g. subjects)
  int K;                                                   // number of features

  array[M,L] int<lower=0> N;                               // number of observations per category (M) and exposure group (L)
  array[M,L] vector[K] x_mean;                             // means for each category (M) and exposure group (L)
  array[M,L] cov_matrix[K] x_ss;                           // sum of uncentered squares matrix for each category (M) and group (L)

  int N_test;                                              // number of unique combinations of test locations & exposure groups
  array[N_test] vector[K] x_test;                          // locations (in cue space) of test trials
  array[N_test] int y_test;                                // exposure group indicator of test trials
  array[N_test,M] int z_test_counts;                       // responses for test trials (for each category M)

  int<lower=0, upper=1> lapse_rate_known;
  int<lower=0, upper=1> mu_0_known;
  int<lower=0, upper=1> Sigma_0_known;
  array[lapse_rate_known ? 1 : 0] real<lower=0, upper=1> lapse_rate_data;       // optional: user provided lapse_rate
  array[mu_0_known ? M : 0] vector[mu_0_known ? K : 0] mu_0_data;               // optional: user provided expected mu_0 (prior category means)
  array[Sigma_0_known ? M : 0] cov_matrix[Sigma_0_known ? K : 0] Sigma_0_data;  // optional: user provided expected Sigma_0 (prior category covariance matrices)

  /* For now, this script assumes that the observations (cue vectors) are centered. The prior
     mean of m_0 is set to 0. In the future, one could automatically adjust the location based
     on the input data (e.g., by placing the mean of the Normal prior in the center of the
     exposure data).
  */
  vector<lower=0>[mu_0_known ? 0 : K] tau_scale;           // separate taus for each of the K features to capture that features can be on separate scales
  real<lower=0> L_omega_eta;                             // scale of LKJ prior for correlation of variance of m_0

}

transformed data {
  /* Scale for the prior of kappa/nu_0. In order to deal with input that does not contain observations
     (in which case n_each == 0), we set the minimum value for SD to 10.
  */
  real<lower=0> sigma_kappanu = min(max(to_array_1d(N)) * 4, 10);
}

parameters {
  // these are all shared across groups (same prior beliefs):
  real<lower=K> kappa_0;                                   // prior pseudocount for category mu
  real<lower=K+1> nu_0;                                    // prior pseudocount for category Sigma

  array[lapse_rate_known ? 0 : 1] real<lower=0, upper=1> lapse_rate_param;

  array[mu_0_known ? 0 : M] vector[K] m_0_param;           // prior mean of means
  vector<lower=0>[mu_0_known ? 0 : K] m_0_tau;             // prior variances of m_0
  cholesky_factor_corr[mu_0_known ? 0 : K] m_0_L_omega;    // prior correlations of variances of m_0 (in cholesky form)

  array[Sigma_0_known ? 0 : M] vector<lower=0>[K] tau_0_param;          // standard deviations of prior scatter matrix S_0
  array[Sigma_0_known ? 0 : M] cholesky_factor_corr[K] L_omega_0_param; // correlation matrix of prior scatter matrix S_0 (in cholesky form)
}

transformed parameters {
  real lapse_rate = lapse_rate_known ? lapse_rate_data[1] : lapse_rate_param[1];  // lapse rate
  array[M] vector[K] m_0 = mu_0_known ? mu_0_data : m_0_param;                    // prior mean of means m_0
  array[M] cholesky_factor_cov[K] L_S_0;                                          // Cholesky factor of prior scatter matrix S_0

  // updated beliefs depend on input and group
  array[M,L] real<lower=K> kappa_n;             // updated mean pseudocount
  array[M,L] real<lower=K> nu_n;                // updated sd pseudocount
  array[M,L] vector[K] m_n;                     // updated expected mean
  array[M,L] cholesky_factor_cov[K] L_S_n;      // Cholesky factor of updated expected scatter matrix
  array[M,L] cholesky_factor_cov[K] L_t_scale;  // Cholesky factor of scale matrix of predictive t distribution

  array[N_test] simplex[M] p_test_conj;
  array[N_test] vector[M] log_p_test_conj;

  // update NIW parameters according to conjugate updating rules are taken from
  // Murphy (2007, p. 136)
  for (cat in 1:M) {
    /* Get S_0 from expected Sigma given nu_0

       Using Cholesky factor representation of S_0 in order to avoid issues with numerical
       instability that would otherwise arise for S_0s with large values (leading to exceptions
       caused by non-symmetrical S_0). The use of Cholesky factors should also make the
       computation faster.

       Following advice by Bob Carpenter:
       https://discourse.mc-stan.org/t/exception-s-n-cov-matrix-not-symmetric-for-large-value-elements/39184/2
    */
    L_S_0[cat] = Sigma_0_known ? cholesky_decompose(Sigma_0_data[cat] * (nu_0 - K - 1)) :
                            diag_pre_multiply(tau_0_param[cat], L_omega_0_param[cat]);
    for (group in 1:L) {
      if (N[cat,group] > 0 ) {
        kappa_n[cat,group] = kappa_0 + N[cat,group];
        nu_n[cat,group] = nu_0 + N[cat,group];
        m_n[cat,group] = (kappa_0 * m_0[cat] + N[cat,group] * x_mean[cat,group]) /
          kappa_n[cat,group];
        // S_n[cat,group] = S_0[cat] +
        //   x_ss[cat,group] +
        //   kappa_0 * m_0[cat] * m_0[cat]' -
        //                 kappa_n[cat,group] * m_n[cat,group] * m_n[cat,group]';
        L_S_n[cat, group] = cholesky_decompose(
          symmetrize_from_lower_tri(
            multiply_lower_tri_self_transpose(L_S_0[cat]) +
            x_ss[cat, group] +
            kappa_0 * m_0[cat] * m_0[cat]' -
            kappa_n[cat, group] * m_n[cat, group] * m_n[cat, group]'
          )
        );

      } else {
        kappa_n[cat,group] = kappa_0;
        nu_n[cat,group] = nu_0;
        m_n[cat,group] = m_0[cat];
        L_S_n[cat,group] = L_S_0[cat];
      }

      // Instead of computing t_scale as in Murphy, we calculate the Cholesky factor of that
      // scale, and then use the Cholesky-based form of the multivariate T-density below:
      L_t_scale[cat, group] = L_S_n[cat, group] * sqrt((kappa_n[cat, group] + 1) /
                               (kappa_n[cat, group] * (nu_n[cat, group] - K + 1)));
    }
  }

  // compute posterior probabilities of all categories for each of the test stimuli
  for (j in 1:N_test) {
    int group;
    group = y_test[j];
    /* calculate un-normalized log posterior probability for each category. Under the assumption
       of uniform prior probabilities for each category, the log probabilities identical to the
       normalized log likelihoods. If we ever were to change this assumption, we'd have to add
       the log prior probabilities of categories here. */
    for (cat in 1:M) {
      // Use Cholesky form of multivariate T-density
      log_p_test_conj[j,cat] =
          multi_student_t_cholesky_lpdf(
              x_test[j] |
              nu_n[cat,group] - K + 1,
              m_n[cat,group],
              L_t_scale[cat,group]);
    }
    // normalize to get actual posterior probs in simplex
    p_test_conj[j] = exp(log_p_test_conj[j] - log_sum_exp(log_p_test_conj[j]));
  }
}

model {
  vector[M] lapsing_probs;

  /* Assuming unifom bias, so that lapsing_prob = probability of each category prior to
     taking into account stimulus is 1/M */
  lapsing_probs = rep_vector(lapse_rate / M, M);

  kappa_0 ~ normal(0, sigma_kappanu);
  nu_0 ~ normal(0, sigma_kappanu);

  if (!mu_0_known) {
      m_0_tau ~ cauchy(0, tau_scale);
      m_0_L_omega ~ lkj_corr_cholesky(L_omega_eta);
      m_0_param ~ multi_normal_cholesky(rep_vector(0, K), diag_pre_multiply(m_0_tau, m_0_L_omega));
  }

  // Specifying prior for components of S_0:
  if (!Sigma_0_known) {
    for (cat in 1:M) {
      tau_0_param[cat] ~ cauchy(0, tau_scale);
      L_omega_0_param[cat] ~ lkj_corr_cholesky(L_omega_eta);
    }
  }

  for (i in 1:N_test) {
    z_test_counts[i] ~ multinomial(p_test_conj[i] * (1 - lapse_rate) + lapsing_probs);
  }

}

generated quantities {
  /* Compute and store pointwise log-likelihood, in order to allow computation of LOOIC.
     Doing so in generated quantities block, following help(rstan::loo). Note that currently,
     each unique combination of test location and exposure group is treated as an observation
     (rather than each individual response).
  */
  vector[M] lapsing_probs = rep_vector(lapse_rate / M, M);
  real log_lik_sum = 0;
  vector[N_test] log_lik;

  for (n in 1:N_test) {
    log_lik[n] = multinomial_lpmf(z_test_counts[n] | p_test_conj[n] * (1-lapse_rate) + lapsing_probs);
    log_lik_sum += log_lik[n];
  }

  // Get S_0 and S_n from their Cholesky factors for post-processing in R
  array[M] cov_matrix[K] S_0;         // prior scatter matrix S_0
  array[M,L] cov_matrix[K] S_n;              // Updated expected scatter matrix
  for (cat in 1:M) {
    S_0[cat] = multiply_lower_tri_self_transpose(L_S_0[cat]);
    for (group in 1:L) {
      S_n[cat,group] = multiply_lower_tri_self_transpose(L_S_n[cat,group]);
    }
  }
}

ChatGPT proposed function for cholesky_rank_update:

functions {
  matrix cholesky_rank_update(matrix L, vector v, real alpha) {
    int K = rows(L);
    vector[K] w;
    real r;

    // Step 1: Copy the input matrix to L_copy to avoid modifying the original matrix
    matrix[K, K] L_copy;  // Create a copy of the input matrix to modify
    L_copy = L;

    // Step 2: Compute w = L⁻¹ v using forward substitution
    w = mdivide_left_tri_low(L_copy, v);

    // Step 3: Update diagonal elements using numerically stable sqrt updates
    for (k in 1:K) {
      r = sqrt(L_copy[k, k]^2 + alpha * w[k]^2);  // Stable sqrt update
      L_copy[k, k] = r;

      // Step 4: Use Givens rotations to zero out the off-diagonal elements
      if (k < K) {
        real c = r / L_copy[k, k];
        real s = (alpha * w[k]) / L_copy[k, k];

        for (j in (k + 1):K) {
          real temp = L_copy[j, k];
          L_copy[j, k] = c * temp + s * w[j];
          w[j] = -s * temp + c * w[j];  // Update for future iterations
        }
      }
    }

    return L_copy; // Return updated Cholesky factor
  }
}

I wished that worked better when I was in grad school in the Edinburgh. Scots English could sound more like Norwegian than like American English when they got going.

I’m one of the few people who does not find DAG notation useful. Maybe it’s because every one’s different. I’m not seeing plates in that figure 19, and there’s some distinction between diagonal boxes, circles, filled-in circles, and simple points, boxes, and triangles. I just never have the patience to try to figure out what all the notation means. Similarly in the figure S5, there are boxes now with odd universal qualifiers, then I can’t parse what the solid squares mean—it seems to repointing off to the paper somehow? This also has circles, filled circles, bolded circles, bolded circles within tilted boxes, etc.

I’m not sure what a “Bayesian belief-updating model” is. Do you just mean Bayes as usual, so that when you get new data you can update your estimates?

It’s fine theoretically, but those Cauchy priors on scales can be super painful in practice because of the long tails. If you don’t expect 10,000 to be a reasonable value, don’t use a Cauchy. You can also move rep_vector(0, K) to transformed data to not have to allocate each time it’s evaluated.

I don’t have a lot of time to go through the details, but why are you trying to update Cholesky factors within the model code? The whole reason we have things like Stan is to avoid having to do things like that manually. Is there some reason you need to do that computation within Stan rather than letting Stan do it on the outside?

Oops, I seem to have forgotten to post my reply!

Ok, point taken about DAGs (fwiw, white circles = latent, gray circles = observed, diamond = deterministic operation; angle of lines does not matter) and about the struggles of adapting to Scottish English ;).

By Bayesian belief-updating model, I just meant Bayes as usual (but as model of the listener’s learning).

As to why I am doing the updating for listeners’ beliefs within the stan model: It’s that update that links the observed variables during test—the inputs during test (x summarized in x_test) and listeners’ responses to those inputs (r summarized in z_test_counts)—to the latent variables of interest—listeners’ prior beliefs as characterized by the latent NIW parameters m_0[cat], S_0[cat], kappa_0, and nu_0—via the belief-updating that is hypothesized to occur during exposure (given the category mean x_mean[cat,group], sums-of-squares x_ss[cat,group] and counts N[cat,group] observed for the category cat and the exposure condition group).

In short,

  • listeners’ observed responses during test depend on →

  • listeners’ updated posterior beliefs after exposure in a given exposure group (m_n[cat, group], S_n[cat, group], kappa_n[cat,group], and nu_n[cat, group]), which depend on →

  • listeners’ unknown prior beliefs (m_0[cat], S_0[cat], kappa_0, and nu_0) and the observed statistics in the listeners’ exposure condition (x_mean[cat,group], x_ss[cat,group], and N[cat,group]).

So unless I’m mistaken, I have no choice than to update listeners’ prior beliefs within the Stan code?

Anyway, the changes you proposed on day 1 already improved stability, and got rid of most exceptions. Next, I’ll turn to scaling inputs, as you suggested.

I was more wondering if you could do it using Stan’s sampling. This isn’t always possible if you need a reductive estimate like a mean rather than a sample, which will, by construction, propagate uncertainty in whatever uses it.

FWIW, I love Scottish English. Especially on the East Coast—the English spoken between St. Andrews and Aberdeen is my absolute fave dialect of English.