Log likelihood and loo for multinomial regression

I want to calculate loo for a multinomial regression model. This particular case is a linear model in isometric logratio coordinates, with overdispersion modelled by a multivariate normal in logratio coordinates. We have 125 observations, each of which can be described by a multinomial on 10 categories, with a total count of about 100 for each observation. We have 72 parameters (3 9-dimensional regression coefficient vectors and a 9-dimensional covariance matrix).

I’m getting log_lik values typically around -10. If we had a uniform model, the log_lik would be 100 log(1 / 10), which is about -230. So our model appears to be much better than a uniform model (and graphically, it looks like a good description of the data). But the Pareto k diagnostics are bad:

Pareto k diagnostic values:
Count Pct
(-Inf, 0.5] (good) 0 0.0%
(0.5, 0.7] (ok) 0 0.0%
(0.7, 1] (bad) 37 29.6%
(1, Inf) (very bad) 88 70.4%

Am I doing something wrong, or do I have too many parameters for loo to work well on this size of data set? There is obvious scope for constraining the covariance matrix, if so.

functions {
/** inverse ilr transformation of a vector x, using the inverse of the transpose of the V matrix of the ilr (tVinv)
  vector ilrinv(matrix tVinv, vector x, int ntaxa) {
         vector[ntaxa] z;
         vector[ntaxa] y;
         z = exp(tVinv * x);
         y = z / sum(z);
         return y;

data {
  int<lower = 0> ntaxa; // number of taxa
  int<lower = 0> nstills; // number of stills 
  matrix[ntaxa, ntaxa - 1] tVinv; //back-transformation matrix for ilr transformation
  int counts[nstills, ntaxa]; //observed counts
  vector[nstills] depth; //depth in metres
  vector[nstills] squareddepth; //squared-depth in metres

transformed data {
  int<lower = 1> s;
  s = ntaxa - 1;

parameters {
  vector[s] beta0; //intercept 
  vector[s] beta1; //depth effect
  vector[s] beta2; //squared-depth effect
  vector[s] z[nstills]; //transform into predicted logratio coordinates
  cholesky_factor_corr[s] LOmega; //Cholesky factor of prior correlation
  vector<lower=0>[s] tau; //prior scale on covariances

transformed parameters {
  cholesky_factor_cov[s] LSigma;
  vector[s] x[nstills]; //predicted logratio coordinates
  vector[ntaxa] rho[nstills]; //predicted relative abundances

  LSigma = diag_pre_multiply(tau, LOmega);
  for(i in 1:nstills){
    x[i] = beta0 + beta1 * depth[i] + beta2 * squareddepth[i] + LSigma * z[i];
    rho[i] = ilrinv(tVinv, x[i], ntaxa);


model {
  for(i in 1:nstills) {
    counts[i] ~ multinomial(rho[i]); // observation model
    z[i] ~ normal(0, 1);
  tau ~ cauchy(0, 2.5);
  LOmega ~ lkj_corr_cholesky(2);
  beta0 ~ cauchy(0, 2.5);
  beta1 ~ cauchy(0, 2.5);
  beta2 ~ cauchy(0, 2.5);

generated quantities{
  vector [nstills] log_lik;
  for (j in 1:nstills)
    log_lik[j] = multinomial_lpmf(counts[j] | rho[j]); //log likelihood for WAIC

Followup: given those apparent problems with loo, to what extent can I trust the results of model comparisons? I compared linear, quadratic and cubic models using compare() in loo. For all three models, more than 70% of the Pareto k were greater than 1. Nevertheless:

cubic vs quadratic:
elpd_diff se
0.6 4.1

quadratic vs linear:
elpd_diff se
-14.3 6.8

Plotting the posterior means (with 95% HPD intervals for the quadratic model) suggests that the quadratic model is clearly better than the linear model, but there isn’t much to choose between cubic and quadratic. That seems consistent with the results of compare().

linearquadraticcubic.pdf (82.6 KB)

Correction: it’s 9 categories, not 10 (I’d forgotten which problem I was working on), so 60 parameters, and a uniform model would give log_lik around 100 log(1 / 9), which is about -220. Doesn’t change things much.


I had feared that might be the case. The context is that a reviewer asked us to determine whether quadratic and cubic terms are needed. I think it’s fairly obvious from the plots that a linear model alone is not adequate, and that a cubic isn’t much better than a quadratic. We also have biological reasons for expecting at least a quadratic. Nevertheless, reviewers may not think that just looking at graphs is enough. Is there any easy way to satisfy them? For example, we could probably compute Bayes factors using the bridgesampling package, although the documentation suggests that we would need to rewrite our stan code to include constants.

I’d also like to understand why loo didn’t work well in this case.

The whole linear vs. quadratic vs. cubic thing is a bit of a forced choice and one where it is hard to be confident any one of them is true in order to justify calculating model probabilities via a Bayes Factor. I would do a spline and then see if it is acceptably close to a polynomial. I also don’t do Cauchy distributions.

Also, in your case it seems as if your observations are stills and the LOO concept you are using means “What if I were to omit still i?” If you are thinking about the predictive ability of the model for an unobserved still — as opposed to future data on an already observed still — then you should be integrating the still-specific parameters out to form a new likelihood function. Details at

Yes, predicting a new still is much more relevant than new data from an already-observed still (here, a still is a single frame from a video transect, with the organisms present at about 100 randomly-chosen points on each still forming a multinomial sample).

The StanCon talk was great. I think I can see what I need to do now, although I’m not sure that I will actually be able to do it in the limited time available. It does seem as though approximating the marginal density might be quite a lot of work, and easy to get wrong.

That is always the case.

Yes. You have more parameters than observations and only a weak prior in those parameters. For each still you have own z, and so the model can fit very well, but if you remove an observation the prior z[i] ~ normal(0, 1); is too much different than the posterior. Thus PSIS-LOO estimates are way too optimistic and Pareto-k diagnostic is correctly seeing the problem.

The model I think I’m fitting is hierarchical, with still-specific intercepts drawn from a multivariate normal(0, Sigma) distribution to deal with overdispersion. I wanted an LKJ prior on Sigma having hyperparameters tau (scale factor) and LOmega (Cholesky factor of covariance matrix). I thought that such a model was identifiable (if you treat the still-specific intercepts as latent variables rather than parameters), and I have checked with simulated datasets that we can recover Sigma as well as the coefficients Beta. If I integrate out the still-specific intercepts to get a marginal likelihood (as Ben suggested), so that the number of parameters does not increase with the number of stills, is PSIS-LOO likely to work?

Yes, other parameters than the the still-specific intercepts can be well identified and there is no problem of using your model for predictions etc. but IS-LOO has the weakness that it doesn’t work well if observations are highly influential for some parameters (as now is the case of for still-specific intercepts).

Yes. IS has trouble if we remove all the local observations for a parameter (in this case there is one local observation per still-specific intercept) and if the prior on such parameters is weak. You might be able to analytically integrate out still-specific intercepts already in the model, or you can do the marginalization afterwards using, e.g., quadrature as in [1802.04452] Bayesian comparison of latent variable models: Conditional vs marginal likelihoods