Multivariate Likelihoods

Hey all,

I understand how to sample from multivariate densities, using something like: M + SLZ, where Z is some standard random variate from the distribution(s), L is a cholesky correlation matrix, and S is a diagonal scale matrix. One can use this to sample parameters from various combinations of multivariate densities, or multivariate densities not yet included in Stan.

However, I’m curious whether one could use a similar approach to modeling multivariate likelihoods. Assume for the moment that the multivariate normal doesn’t exist in Stan, but we’re predicting multiple normally distributed outcomes. We want to assume that residual covariance exists, and is accounted for. Is it possible to do something like:
M = XB (M is an NxJ matrix of predicted values for N observations and J variables; B is a coefficient matrix; X is a design matrix).
Mhat = M + SLZ; where Z is a standard normal variate matrix; L is a cholesky correlation matrix; S is a scale matrix.
y[,1] ~normal(Mhat[,1], S[1,1]); ?

In other words, can you use the same trick for simulating from multivariate densities to create “multivariate likelihoods” that are really univariate likelihoods that account for covariance?

My future usecase would be to use multivariate likelihoods for things like the skew-normal, skew-t, etc; sampling from those multivariate forms is easy, but the multivariate density is terrible, so if I could model the covariance of the residuals AND still use the defined univariate skew likelihoods, that would be fantastic; I just have no idea if that’s possible, or if I should suck it up and code up the multivariate density.

In the interest of avoiding possible confusion, in Stan we never sample from distributions in the model. For example,

x ~ normal(0, 1);
y ~ normal(x, 1);

does not mean that we actually sample a random N(0, 1) variate and then plug that sample into the density for y. Instead we are building up a big joint density for the parameters and the data, and x ~ normal(0, 1) just means that we are defining x to be distributed according to a Gaussian distribution and adding the corresponding density, normal_lpdf(x | 0, 1) to the big joint density.

The trick that you reference, what we call a non-centered parameterization, is a way to specify the same model using different parameters and densities. This changes the joint density and, most importantly, the posterior geometry, sometimes making posterior computations easy and sometimes making posterior computations hard.

Okay, with all of that in mind, we can move onto your question. Just how general is the non-centered parameterization? Well it applies only to those distributions that behave sufficiently well under translation and multiplication.

If x ~ normal(0, 1) and we translate by mu then we get a new random variable distributed as x' ~ normal(mu, 1). Similarly if we scale by sigma then we get a new random variable x' ~ normal(0, sigma). The same holds for the multivariate case – if x ~ multi_normal(0, 1) then x' = mu + S * L * x will be distributed according as x' ~ multi_normal(mu, S L^{T} L S). Hence we can use a non-centered parameterization for any model with latent Gaussian.

Unfortunately these closure properties don’t hold for many other distributions, and as far as I don’t know they don’t hold for anything outside of the exponential family.

For example, you can’t arbitrarily translate a Gamma variate but you can scale it. If x ~ gamma(alpha, beta) then scaling by sigma gives x' ~ gamma(alpha, sigma * beta). That means that

y ~ gamma(alpha, beta);

is equivalent to

x ~ gamma(alpha, 1);
y = beta * x;

You’ll have to work out if non-centering is right for you by working through the scaling properties of your desired distribution.

Hey Michael,

Thank you for the fast response. I should’ve clarified my statements about simulating/sampling, as you suggest, since yes, it is really contributing to the joint density. Sometimes I think in terms of ‘sampling’ because it’s intuitively appealing (and my first foree into Bayes was years ago with bugs/jags), but then switch to thinking about the joint density to better understand posterior pathologies.

So let’s say I do this; y is multivariate data, assumed to be distributed skew-normally [and let’s assume that Azzalini is correct, and multivariate samples of skew-normal can be achieved using the y = mu + SLZ method, where Z ~ skew_normal(0,1,alpha)].
My goal, intuitively, might be y ~ multi_skew_normal_cholesky(mu, SL), so I want to estimate: mu, S, and L’L.
If I understand you correctly, this would be equivalent to:
x = (y - mu)
(SL)^{-1}
x ~ skew_normal(0,1,alpha)
Would this be an incorrect method of implementing the equivalent of a multi_skew_normal_cholesky likelihood?

Or, let’s try another example; going simpler. Let’s say we had no multivariate normal density defined.
We assume that Y = Mu + SLZ, where Z ~ normal(0,1).
We want to know Mu, S, and LL’.

Could we do the following to obtain the equivalent of a multivariate likelihood on data Y, to infer about parameters Mu, S, and L?:

Z = (Y - Mu)/(SL)
Z ~ normal(0,1)

No, that works only when Mu and SL are constant. You can see where problems arise if you try to work out the density piece by piece. If Y is distributed according to your likelihood and mu is distributed according to a prior then what is the distribution for (Y - mu)? Once you have distribution, what is the distribution for (Y - mu) / SL? To do that you’ll need to find a clever 1-1 transformation and them marginalize and I bet you’ll quickly find that you recover…exactly the nasty density that you’re trying to avoid.

The non-centered parameterization is used for latent parameters that we can swap out for auxiliary parameters from which we can recover the latent parameters with a deterministic transform. That’s all find when everything is unobserved but is not applicable if any of the random variables are to be conditioned on later in the construction of the posterior.

I’m just curious why this may not work.

My thinking is that you can do the following already:

Z = (y - mu)/s, where mu is a mean and s is a vector. Then Z ~ normal(0,1), and it gives the correct answer (with a jacobian adjustment of target += 1/s) for mu and s (with s being a real scale parameter).

I did something like that in a multivariate manner and get an close answer, though not perfectly correct – I imagine my jacobian is incorrect. I purposefully used only the mu, S, L and ~normal(0,1) and got nearly the same results for mu, S, and L as the multi_normal_cholesky, though still wrong. Something like this:

data{
  int N;
  int J;
  matrix[N,J] y;
}
parameters {
  vector[J] xi;
  vector<lower=0>[J] S;
  cholesky_factor_corr[J] L;
}
transformed parameters {
  
}
model {
  matrix[N,J] x;
  for(n in 1:N){
    x[n] = (y[n] - xi') / diag_pre_multiply(S,L);
  }
  xi ~ normal(0,10);
  S ~ cauchy(0,10);
  for(n in 1:N){
    x[n] ~ normal(0,1);
    target += log_determinant(inverse(diag_pre_multiply(S,L))); ///May be incorrect, I never formally took a course involving partial derivatives.
  }
}

generated quantities{
  corr_matrix[J] R = multiply_lower_tri_self_transpose(L);
  cov_matrix[J] V = quad_form_diag(R,S);
}

In the univariate form, the analogy is the following, and it provides the correct estimates in the univariate case (excuse the inefficient code, it was modified from the above):

data{
  int N;
  vector[N] y;
}
parameters {
  real xi;
  real S;
}
model {
  xi ~ normal(0,10);
  S ~ cauchy(0,10);
  for(n in 1:N){
    (y[n] - xi)/S ~ normal(0,1);
    target += log(1/S);
  }
}

Nevermind! I got it to work! Code is below. I wasn’t postmultiplying where I should’ve been.

I was able to estimate a multivariate normal distribution without actually using the multivariate normal distribution.
It uses the logic of Y = Mu + SLZ. Essentially, it is finding a mu and cholesky-covariance matrix such that when applied to the Y matrix, yields uncorrelated standardized variates. This is the inversion of the process of sampling from a multivariate normal without actually using the multivariate density function, so to speak.

Given Y, find Mu and SL such that (Y - Mu)/(SL)’ ~ Normal(0,1). This should let people use multivariate forms fairly easily, so long as one can express a multivariate distribution as M + SLZ, which /is/ the case for a skew-normal according to Azzalini (and skew-T for that matter).
Beautiful.

 data{
  int N;
  int J;
  matrix[N,J] y;
}
parameters {
  vector[J] xi;
  vector<lower=0>[J] S;
  cholesky_factor_corr[J] L;
}
model {
  matrix[N,J] x;
  for(n in 1:N){
    x[n] = (y[n] - xi') / diag_pre_multiply(S,L)';
  }
  for(n in 1:N){
    x[n] ~ normal(0,1);
    target += log_determinant(inverse(diag_pre_multiply(S,L)));
  }
}

generated quantities{
  corr_matrix[J] R = multiply_lower_tri_self_transpose(L);
  cov_matrix[J] V = quad_form_diag(R,S);
}

Hi Stephen,

doesn’t Stan require the absolute value in you log_determinant … statement?
What if you take the cholesky invert it back backward-solving and take the absolute value
of the product diagonal entries, since the determinant of a triangular matrix is just its product
of the diagonals?

Andre

Hey Andre,

Unless I misremember or misread the manual, log_determinant gives the log absolute determinant anyway.

The code above your post actually seems to work; it estimated three separate MVNs very precisely, with estimates equal to that of using multi_normal_cholesky).

The issue now is using a similar method for other multivariate distributions. I tried it on the SN and it didn’t estimate the parameters well at all, though I’m reasonably sure it’d still work on a multivariate student distribution, or other symmetric densities whose affine transformations are known to work.

Stephen,
you are true about the log_determinat. Did something similar like you before,
and it worked out. But wasn’t sure about the internals of Stan.
To whom it may interest:

x<-rWishart(1,df=19,diag(14))[,,1]
> sum(log(diag(chol(x))))*2
[1] 32.43718
> determinant(x,log=TRUE)
$modulus
[1] 32.43718
attr(,"logarithm")
[1] TRUE
$sign
[1] 1
attr(,"class")
[1] "det"

https://proofwiki.org/wiki/Determinant_of_Inverse_Matrix

Stephen,
taking Michael’s answers seriously, we still have the unsolved question, where our
code work, but in a some cases Stan wlll fail. (But I don’t believe).