Issues when expressing parameters as factors

Hi all,

I’ve recently experienced an issue with the way of expressing model parameters, maybe someone can shed some light.

Basically, I have data nested in groups and two instances of measuring data (T1, T2), so I have a 3-dimensional data structure: N rows (groups), M cols (measurements per group) and 2 shelves (time points). I want to fit a hierarchical exponential regression model y = a * e^{bx} , but include a change effect of time. No problem so far, the following model converged without any flaws:

model{
  for (n in 1:N) {   //groups
    for (m in 1:M) {   //measurements per group
      for (t in 1:2) {   //time points
        
        target += normal_lpdf(y[n, m, t] |  (a[n] + (delta_a[n] * (t-1))) * exp((b[n] + (delta_b[n] * (t-1))) * x[n, m, t]), sigma);
        
      }
    }
  }
}

, whereas the “delta” effects are the time effects. Priors are pretty “standard” - Half-Cauchy for variances and Normal for the hypermeans. Group levels are sampled using a covariance matrix to account for correlation.

Now here’s the issue: If I reformulate the model a little, to have the time effects being expressed as factor changes, no convergence will be reached, regardless to whether I adapt the respective hyperpriors or not:

target += normal_lpdf(y[n, m, t] |  (a[n] * (1 + delta_a[n] * (t-1))) * exp((b[n] * (1 + delta_b[n] * (t-1))) * x[n, m, t]), sigma);

Any idea what might cause this issue?

1 Like

Hi,

I think seeing the whole model is needed here. But I think a possible problem is that after the change, the role of b[n] becomes partially interchangeable with delta_b[n] (both represent a multiplicative effect of x), i.e. increasing b[n] and decreasing delta_b[n] will (if I read the model correctly) produce exactly the same likelihood, making the model non-identified - which can than cause convergence problems.

Does that make sense?

1 Like

Hi Martin,

thank you for your help! Here is the full model with the likelihood expression that works perfectly:

data {
  int<lower=1> N;
  int<lower=1> M;
  real<lower=0> x[N, M, 2];
  real<lower=0> y[N, M, 2];
}


parameters {
  // group-level parameters
  real<lower=0> sigma;
  real alpha;
  real beta;
  real delta_alpha;
  real delta_beta;
  
  // cholesky factorization (2 matrices)
  matrix[2, N] z_u;
  cholesky_factor_corr[2] L_u;
  vector<lower=0>[2] tau_u;
  
  matrix[2, N] z_v;
  cholesky_factor_corr[2] L_v;
  vector<lower=0>[2] tau_v;
}


transformed parameters {
  // matrix of correlated parameters (Cholesky factorization)
  matrix[N, 2] u = (diag_pre_multiply(tau_u, L_u) * z_u)';
  matrix[N, 2] v = (diag_pre_multiply(tau_v, L_v) * z_v)';
  
  // subject level parameters; 
  vector[N] a = alpha + u[,1];
  vector[N] b = beta + u[,2];
  vector[N] delta_a = delta_alpha + v[,1];
  vector[N] delta_b = delta_beta + v[,2];
} 


model {
  // group-level prior
  target += cauchy_lpdf(sigma | 0, 10);

  target += normal_lpdf(alpha | 100, 1); // informative prior
  target += normal_lpdf(beta | -0.05, 0.005); // informative prior
  target += normal_lpdf(delta_alpha | 0, 5);
  target += normal_lpdf(delta_beta | 0, 5);
  
  // generate uncorrelated vectors for Cholesky factorization
  target += std_normal_lpdf(to_vector(z_u));
  target += std_normal_lpdf(to_vector(z_v));
  
  // prior for Cholesky factorization
  target += lkj_corr_cholesky_lpdf(L_u | 2);
  target += lkj_corr_cholesky_lpdf(L_v | 2);
  
  // priors for scaling of correlated parameters
  target += cauchy_lpdf(tau_u[1] | 0, 10); 
  target += cauchy_lpdf(tau_u[2] | 0, 10); 
  target += cauchy_lpdf(tau_v[1] | 0, 10); 
  target += cauchy_lpdf(tau_v[2] | 0, 10); 
  
  
  // likelihood
  for (n in 1:N) {
    
    for (m in 1:M) {
      
      for (t in 1:2) {
        
          target += normal_lpdf(y[n, m, t] |  (a[n] + (delta_a[n] * (t-1))) * exp((b[n] + (delta_b[n] * (t-1))) * x[n, m, t]), sigma);
        
      }
    }
  }
  
}

The idea was to use “t” in the model as a binary dummy to address the two instances of measurement, whereas the delta effects portray the change of parameters between the two instances.


Concerning your suggested reason: Wouldn’t the interdependence between b and delta_b also affect the sampling process when delta_b is not expressed as a factor of b, but as an additive effect, like in the model above? I’m asking because this way the model converges well. The issues seem to only arise when changing the likelihood as described in my initial comment (expressing delta effects as factors of the parameters a and b).

Thank you again for your help,

Beni

1 Like

So I was wrong, you are correct that actually both models allow for a constant and multiplicative factor and the multiplicative model is still - at least in theory - identified. I misread the code. What could however happen is that this introduces a correlation between the parameters that may cause sampling problems. I would look at the pairs plot with the underlying representation of b and delta_b variables (i.e. u[,2] and v[,2] if I read the code correctly) for both models and if you see some weird shapes after the change, that could suggest where the problem lies. (the same applies to a[n] and delta_a[n])

I would also inspect the pairs plot for alpha and some elements of u[,1] (and similarly for b/ delta_a/delta_b vs. u[,2]/v[,1]/v[,2]` respectively) as this might also be source of some problems (which could have been tolerable in the first version of the model but then in combination with the additional issues after the change could become intolerable).

Also the cauchy(0,10) priors for the tau_XX and sigma are extremely wide are you sure they represent your domain knowledge well? (but most likely this is not the cause of the issues you have).

Other than that I don’t see any obvious problem in the full model.

1 Like

The issue is really tricky.

I ran the model again, checked all the pairs you suggested and the only one that kind of looks the way one would expect is the b vs. u[,2] (which shows a positive linear trend).

Then I ran the model a second time, with new random initial values and there was only 1 divergent transition out of 8000 iterations. Rhat fine, ESS fine. Posterior shapes fine.

I ran it a third time with a selected seed and we are back to almost 2000 divergent transitions, once again with weird posteriors (some just with heavy tails, others are bimodal with divergent transitions being concentrated in one of the two bulks)

It appears as if the divergences are somewhat influenced by initial values that eventually cause single chains to go way off. Does this indicate a prior issue after all?

1 Like

In this situation my best guess would be multimodality. If you have a well behaved high posterior density mode and some other mode with lower density and weird geometry, you can easily get this type of behavior: sometimes all the chains init in the neighbourhood of the well behaved mode and you see no problems, sometimes a chain or two gets stuck in (one of) the alternative mode(s), encounters difficulties and you get a lot of divergences from that chain only. Still possible that the problem is different, but I would definitely investigate in this direction. If that turns out to be correct, the next step is to try to understand why you have multiple modes - i.e. why is there a local maximum of posterior density that some of the chains get stuck in.

And yes, this can unfortunately be quite hard :-/