Do I need Jacobian adjustment?

Hi everyone,

I’m developing this model, and I’m wondering if I need to include a Jacobian adjustment. The model currently runs fine and appears to give meaningful results. My current code is included below (I’m sorry it’s a bit long and messy). As you can see, vhat0 is a matrix defined in the transformed parameters block as a combination of data and parameters. In the model block, I put priors on this matrix (which makes a lot of sense in the context of the data). I assume this means I need to include a Jacobian adjustment?

(Any advice in how to do so in this case, will also be greatly appreciated.)

Thanks for any help!

data {
  int<lower = 1> N;                       // dim1 of data
  int<lower = 1> J;                       // dim2 of data
  int<lower = 1> B;                       // bounds of data
  int<lower = -B, upper = B> Y[N, J];     // main data
  vector<lower = -B, upper = B>[N] V;     // more data
  int<lower = 0, upper = 1> M[N, J];      // indicator of missing values
}

parameters {
  matrix[N, 2] alpha_raw;                 // constant, raw
  matrix<lower = 0>[N, 2] beta_raw;       // factor loading, raw
  real theta[J];                          // latent factor
  vector<lower = 0>[2] gamma;             // sd of alpha
  vector<lower = 0>[2] sd_beta;           // sd of beta
  real<lower = 0> sd_vhat;    
  real mean_vhat;                         
  real<lower = 1> nu;                     // hyperparameter
  real<lower = 0> tau;                    // hyperparameter
  vector<lower = 0>[N] eta;               // i's average variance x J^2
  simplex[J] rho;                         // j's share of variance
  vector<lower = 0, upper = 1>[N] delta;  // mixing proportion
  real<lower = .65, upper = 1> phi;       // mean of prior on delta
  real<lower = 2> lambda;                 // concentration of prior on delta
}

transformed parameters {
  real<lower=0> alpha_delta = lambda * phi;
  real<lower=0> beta_delta = lambda * (1 - phi);
  matrix[N, 2] vhat0;                     // rescaled vectors
  matrix[N, 2] alpha;                     // constant
  matrix[N, 2] beta;                      // factor loading
  matrix<lower = 0>[N, J] sigma;          // sd's of errors
  matrix[N, J] log_lik;                   // pointwise log-likelihood
  alpha[, 1] = alpha_raw[, 1] * gamma[1]; 
  alpha[, 2] = alpha_raw[, 2] * gamma[2];       
  beta[, 1] = (beta_raw[, 1] + 1 - mean(beta_raw[, 1])) * sd_beta[1] - sd_beta[1] + 1; 
  beta[, 2] = (beta_raw[, 2] + 1 - mean(beta_raw[, 2])) * sd_beta[2] - sd_beta[2] + 1; 
  sigma = sqrt(eta) * to_row_vector(rho);  

  for (i in 1:N) {
    for (j in 1:J) {
      if (M[i, j] == 0) {
        log_lik[i, j] = log_mix( delta[i], 
          normal_lpdf(Y[i, j] | + alpha[i, 1] + beta[i, 1] * theta[j], sigma[i, j]), 
          normal_lpdf(Y[i, j] | - alpha[i, 2] - beta[i, 2] * theta[j], sigma[i, j]) );
      }
      else {log_lik[i, j] = 0;}
    }
  }  
  vhat0[, 1] = ((V - alpha[, 1]) ./ beta[, 1]);   // DEFINING vhat0
  vhat0[, 2] = ((V + alpha[, 2]) ./ - beta[, 2]); // DEFINING vhat0
}

model {
  alpha_raw[, 1] ~ normal(0, 1); 
  alpha_raw[, 2] ~ normal(0, 1); 
  beta_raw[, 1] ~ normal(1, 1); 
  beta_raw[, 2] ~ normal(1, 1); 
  gamma ~ cauchy(0, B);  
  sd_beta ~ cauchy(0, .5);
  sd_vhat ~ cauchy(0, B / 2.0);
  mean_vhat ~ normal(0, B / 5.0); 
  theta ~ normal(0, 10);  
  nu ~ cauchy(0, 50);
  tau ~ cauchy(0, J * B);
  eta ~ scaled_inv_chi_square(nu, tau);
  rho ~ dirichlet(rep_vector(5, J)); 

  vhat0[, 1] ~ normal(mean_vhat, sd_vhat); // CALLS FOR JACOBIAN ADJUSTMENT?
  vhat0[, 2] ~ normal(mean_vhat, sd_vhat); 

  delta ~ beta(alpha_delta, beta_delta);
  lambda ~ pareto(2, 1.5);

  target += sum(log_lik);
}

generated quantities {
  vector[N] vhat;
  vhat = (delta .* vhat0[, 1]) + ((1 - delta) .* vhat0[, 2]);
}

1 Like

Hey! I’m not going to pretend knowing what’s going on in your model… Haha. But, as I see it, the transformation you do on vhat0 is linear, right? I think linear transformation don’t need a Jacobain correction – the Jacobian correction accounts for unequal squishing and stretching in different locations of the parameter space and is this needed for non-linear transformations. Linear transformations are kinda defined by equal stretching/squishing of space. Sorry for the loose terminology, I’m not a particularly math-y guy, haha.l – so, you should also verify this. In short: I think you don’t need a Jacobian correction here.

I hope this gives you something to start off from.

1 Like

Let’s say everything subscripted with a 1 denotes the first column of whatever quantity it is indexing, i.e. the first column of \hat{v}_{0} I’ll call \hat{v}_{0, 1}. Your model is

\hat{v}_{0, 1} = (V - \alpha_{1}) \mathop{/} \beta_{1} \\ \alpha_{1} = \alpha_{\text{raw}, 1} * \gamma_{1} \quad \beta_{1} = (\beta_{\text{raw}, 1} + 1 + \mu_{\beta_{\text{raw}, 1}}) * \sigma_{\beta, 1} - \sigma_{\beta, 1} + 1 \\ \alpha_{\text{raw}, 1} \sim \text{N}(0, 1^2) \quad \beta_{\text{raw}, 1} \sim \text{N}(1, 1^2) \quad \sigma_{\beta, 1} \sim \text{Cauchy}_{+}(0, 0.5) \quad \gamma_{1} \sim \text{Cauchy}(0, B)

for some fixed B.

Initial Questions:

  1. I don’t understand the transformation you use to define \beta_{1}. Can you elaborate on why you are doing this? It almost looks like some kind of non-centred parameterisation; subtracting \sigma_{\beta, 1} in particular is strange.
  2. Those Cauchy priors are heavy, \gamma_{1} is going to require a large amount to data to identify, and will be heavily influenced by the prior. Set up a prior predictive check pipeline to investigate the effect of those Cauchy priors in particular

Your specific question:

  1. You have priors for \alpha_{\text{raw}, 1}, \beta_{\text{raw}, 1}, \sigma_{\beta, 1} and \gamma_{1}, which induce a prior on \hat{v}_{0, 1}. Specifying an additional prior on \hat{v}_{0, 1}, as you do in line 70 of you Stan code, is somewhat incoherent. It could theoretically be made coherent via pooling, which @maxbiostat has talked about on these forums before here and here, but as written you are effectively “specifying two priors” for \hat{v}_{0, 1}.
  2. If you were to coherently specify a “joint prior” for \text{p}(\alpha_{\text{raw}, 1}, \beta_{\text{raw}, 1}, \sigma_{\beta, 1}, \gamma_{1}, \hat{v}_{0, 1}) by specifying priors for the marginals of the aforementioned prior that admit a normalised probability density for the joint prior, then I think a Jacobian correction would be necessary, as the relationship between \hat{v}_{0, 1} and \alpha_{\text{raw}, 1}, \beta_{\text{raw}, 1}, \sigma_{\beta, 1} and \gamma_{1} is non-linear, but I’m still waiting for the caffeine from my coffee to kick in, so I could be wrong there.
2 Likes

Thank you for answering!

  1. You are right, it is a non-centred parameterization with some complicating features. The model was originally identified by giving the \beta's the following prior: \beta_1 \sim N(1,\sigma_{\beta,1}). This avoids reflection invariance and scaling invariance in the latent space that the model estimates. However, this prior creates funnels, which a non-centred specification fixes. The expression defining \beta is the result of a need to retain its mean of 1, while using a non-centred specification and letting its sd being estimated from the data. Subtracting \sigma_{\beta,1} and adding 1 keeps the mean close to 1. The current specification goes further and strictly forces the mean of the \beta's to be 1, which is a bit stronger than the original prior \beta \sim N(1,\sigma_{\beta,1}). (I could potentially drop that last part, i.e. remove + 1 - \mu_{\beta_{raw,1}} from the first parenthesis.)

  2. In this case, the prior on \sigma_{\beta,1} is weakly informative given the role this parameter plays in the model. (The additional “prior” on \hat v_{0,1} also means the that \sigma_{\beta,1} \sim \text{Cauchy}^{+}(0,.5) is only one piece of the joint prior for \alpha_{raw,1}, \beta_{raw,1}, \sigma_{\beta,1}, \gamma_1.) But prior predictive checks are of course always a good idea, so that’s a good suggestion.

  3. I agree it may be a bit incoherent to specify the priors through two different routes, but I also find it desirable in this case. The goal is indeed to specify more nuanced priors for \alpha_{raw,1}, \beta_{raw,1}, \sigma_{\beta,1}, \gamma_1 by placing a prior on \hat v_{0,1}. The thing is that we know \hat v comes from a population distribution, and some combinations of \alpha_{raw,1}, \beta_{raw,1}, \sigma_{\beta,1}, \gamma_1 will give meaningless values for \hat v. If we model \hat v_{0,1} hierarchically, as part of a population distribution, then we effectively rule out combinations of \alpha_{raw,1}, \beta_{raw,1}, \sigma_{\beta,1}, \gamma_1 that produce meaningless results for \hat v_{0,1}. One could try to achieve the same by changing the priors on \alpha_{raw,1}, \beta_{raw,1}, \sigma_{\beta,1}, \gamma_1 but this is also tricky and may not guarantee the desired result. The goal here is to use the information about the population distribution of the \hat v_{0,1}'s to rule out unreasonable results. (Tests on fake data suggest it works great in this regard.)

  4. Yep, this where I’m stuck for now. I will appreciate any further thoughts!

You should definitely adjust the prior on \sigma_{\beta,1} then, a \text{Cauchy}(0, 0.5) prior places substantial probability mass above 0.5 (A sanity check in R: 1 - pt(1, df = 1) = 0.25).

Re 3.: I would look at this. The setting is more complicated, but it is (I think) the same idea.

Re 4.: Given the discussion here, it would seem that you probably don’t need a Jacobian. The prior predictive check then becomes crucial in ensuring your prior actually reflects your intention.

Thanks for the references. I see Daniel Lakeland has also made some interesting points on his blog, here and here. As he notes in this third post, when a transformation collapses several parameters into one quantity, the Jacobian matrix does not have a determinant, so the typical adjustment of the target density is impossible. He argues that placing “priors” on such transformed, collapsed quantities is still a useful way to specify nuanced joint priors, where this works as a “scaling factor” that upweights or downweights certain regions of the parameter space. (I guess the risk is that one may be sampling from a somewhat different distribution than one might think by looking at the model.)

I think I will go with the Lakeland-approach, as long as the model gives me reasonable results in all kinds of model checks, and outperforms existing models in recovering latent parameters in fake data. (It looks good so far.)

If anyone has further thoughts on this matter, I am very interested to hear them.

1 Like