Dealing with dependencies between parameters?

Hello,

I am making progress with my mixture model: it now fits the data pretty nicely:

(the histogram shows the data, while the lines show samples from the posterior)

I still have some warnings from the sampler though:

Warning: 83 of 8000 (1.0%) transitions ended with a divergence.
See https://mc-stan.org/misc/warnings for details.

Warning: 2008 of 8000 (25.0%) transitions hit the maximum treedepth limit of 10.
See https://mc-stan.org/misc/warnings for details.

Warning: 4 of 4 chains had an E-BFMI less than 0.3.
See https://mc-stan.org/misc/warnings for details.

My current guess is that this due to the dependency between my two parameters:

  • theta - weight of component (i.e, I am running a mixture model)
  • kappa - von Mises concentration.

As can be seen in the plot below, if kappa is very small, then theta is relatively unconstrained, and vice versa.

Are there any “standard tricks” I should know about for dealing with such situations?

For now, I am experimenting with different priors to see if that will help.

As always, and hints, tips and advice very welcome.

Alasdair


Model below:

data {
  int<lower = 0> N;
  int <lower = 1> K; // number of experimental conditions  
  array[N] int <lower = 1, upper = K> X; //  which condition are we in
  array[N] real phi;
  int<lower = 0> L; // number of participants
  array[N] int<lower = 1, upper = L> Z;
  real prior_theta;
  real prior_sigma_u;
}

transformed data {

  array[8] real mu;
  
  mu[1] = 0 * pi()/4;
  mu[2] = 1 * pi()/4;
  mu[3] = 2 * pi()/4;
  mu[4] = 3 * pi()/4;
  mu[5] = 4 * pi()/4;
  mu[6] = 5 * pi()/4;
  mu[7] = 6 * pi()/4;
  mu[8] = 7 * pi()/4;
}

parameters {
  // mixing proportions
  // 8 von Mises + 1 uniform 

  // fixed effects
  array[K] simplex[9] theta; 
  array[K] vector[8] log_kappa;

  // random effects
  array[K, L] vector[9] u; 
  array[K, 9] real<lower = 0> sigma_u;
}

transformed parameters {
  
  // create a vector for each participant's theta values
  array[K, L] vector[9] log_theta_u;
  // transform kappa
  array[K] vector[8] kappa = exp(log_kappa);

  for (k in 1:K) {

    for (l in 1:L) {

      log_theta_u[k, l] = log_softmax(log(theta[k]) + u[k, l]);

    }
  }  
}

model {

  //////////////////////////////////////////////
  // setting priors/////////////////////////////
  //////////////////////////////////////////////

  // priors for theta

  for (kk in 1:K) {

    theta[kk] ~ dirichlet(rep_vector(prior_theta, 9)); 
    log_kappa[kk] ~ normal(4, 0.5);
    
      for (psi in 1:9) {

        sigma_u[kk, psi] ~ exponential(prior_sigma_u);

        for (ll in 1:L) {

          u[kk, ll][psi] ~ normal(0, sigma_u[kk, psi]); //

      }         
    }    
  }

  //////////////////////////////////////////////
  // fitting model /////////////////////////////
  //////////////////////////////////////////////
  
  int kk, ll; // which condition are we in and which person is it
 
  for (n in 1:N) {

    kk = X[n];
    ll = Z[n];

    vector[9] lps;
    //print(log_theta_u[kk, ll]);

    for (psi in 1:8) {
      //lps[psi] = log(theta[kk, psi]) + von_mises_lpdf(phi[n] | mu[psi], kappa);
      lps[psi] = log_theta_u[kk, ll][psi] + von_mises_lpdf(phi[n] | mu[psi], kappa[kk, psi]);
    }

    lps[9] = log_theta_u[kk, ll][9] + uniform_lpdf(phi[n] | -pi(), pi());

    target += log_sum_exp(lps);
  }
  
}

The likely culprit is the centered parameterization of u:

It’s generally better to define a u_std ~ normal(0, 1) and then define u as a transformed parameter that multiplies by the relevant scale.

You can clean up a lot of the code with vectorization to improve speed.

theta ~ dirichlet(rep_vector(prior_theta, 9));
log_kappa ~ normal(4, 0.5);  

Then you can make it faster by moving the rep_vector to define a variable in transformed data that only gets computed once.

You could define kappa with a lower bound of 0 and give it a lognormal prior to avoid having to manually transform.

It’s not easy to find, but you can code that evenly spaced array as a vector,

vector[8] mu = lin_spaced_array(7, 0, 7) * pi() / 4;

or

vector[8] mu  = [0, 1, 2, 3, 4, 5, 6, 7]' * pi() / 4;
1 Like

Thanks a lot!

That has helped a lot - I continually undervalue the power of centred v uncentered priors (and find the terminology confusing, as I think of centred as being about the mean, while scaling is about the standard deviation! Anyway, my bad for forgetting about this!)

Most of the warnings have evaporated and I know just have two transitions ending with a divergence which I suspect will be easy enough to tackle. I have plotted the posterior below.

Overall, things seem to work pretty well but it would be nice to somehow build in the knowledge that theta → 0 implies that kappa → 0. I suppose there might be a way to do this with a multivariate prior?

Or define kappa as b x theta?

I don’t think it’s a big deal for what I am trying to do with this model, but it would be quite satisfying to learn the best way to implement this.

Anyway, thanks a huge amount.

Glad it helped.

If you really have \kappa = b \times \theta, then by all means build that in. If it’s a soft relationship and you have a reasonable prior on \theta, you can define a bivariate prior conditionally as something like p(\theta, \kappa) = p(\theta) \cdot p(\kappa \mid \theta) where p(\kappa \mid \theta) is something like a lognormal with location \log \theta (for median \theta) and scale \sigma \ll 1. That’ll keep the scales in line because the error on lognormal is multiplicative.