How to compute the predictive likelihood using stan

No this is not Pareto-k warning. Diagnostic for WAIC is described in the section 2.2 of Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. See paragraph starting “The effective number of parameters”. This diagnostic is not as good as Pareto k , and I often think that maybe we should compute Pareto k also in case when people try to compute WAIC.

Yes, but that is not the value you want to know! That value is overoptimistic.

It means you should not trust your estimates and you should use loo instead. Sometimes loo works, when WAIC fails, but sometimes they both fail, but at least loo gives better diagnostic.

I guess you have error how you do the computations, but since I haven’t seen your code I’m not able to tell you why this happens.

I think you have mixed things, as this doesn’t make sense. If you want me to be able to help you, please show the code.

That I can confirm, that you can’t trust the results.

1 Like

@avehtari Do you want to see my entire stan code of the model? Or do you want to see other code (I only use output from my stan object directly into the loo package)?

Yes. Seeing your Stan model helps to check that you compute log_lik correctly. Seeing the code you use to extract things from the fitted model and how do you compute waic, loo and test set lpd helps to check that these are correct.

@avehtari My stan code looks like the following (a two-level model with only a random intercept):

data {
  // Define variables in data
  // Number of level-1 observations (an integer)
  int<lower=0> Ni;
  // Number of level-2 clusters
  int<lower=0> Nj;
  
  // Number of fixed effect parameters
  int<lower=0> p;
  
  // Variables with fixed coefficient
  matrix[Ni,p] fixedEffectVars;
  
  // Level 3 look up vector for level 2
  int<lower=1> mLookup[Ni];
  
  // Continuous outcome
  int y[Ni];
  
  //Test data
  int<lower=0> Ntest;
  matrix[Ntest,p] fixedEffectVars_test;
  int<lower=1> mLookup_test[Ntest];
}

parameters {
  // Define parameters to estimate
  // Population intercept (a real number)
  real beta_0;
  
  // Fixed effects
  vector[p] beta;
  
  // Level-1 errors
  real<lower=0> sigma_e0[Nj];
  
  // Level-2 random effect
  real u_0jkz[Nj];
  real<lower=0> sigma_u0jk;
}

transformed parameters  {
  // Varying intercepts
  real beta_0j[Nj];
  real u_0jk[Nj];
  
  // Individual mean
  real mu[Ni];
  
  for (s in 1:Nj) {
    u_0jk[s] = sigma_u0jk * u_0jkz[s];
  }
  
  for (j in 1:Nj) {
    beta_0j[j] = beta_0 + u_0jk[j];
  }
  
  // Individual mean
  for (i in 1:Ni) {
    mu[i] = beta_0j[mLookup[i]]  +  fixedEffectVars[i,]*beta;
  }
}

model {
  // Prior part of Bayesian inference
  // Flat prior for mu (no need to specify if non-informative)
  
  // Random effects distribution
  u_0jkz ~ normal(0,1);
  sigma_u0jk ~ student_t(3, 0, 10);
  
  sigma_e0 ~ cauchy(0,25);

  // Likelihood part of Bayesian inference
  // Outcome model N(mu, sigma^2) (use SD rather than Var)
  for (i in 1:Ni) {
    real eta_n = mu[i];
    
    y[i] ~ normal(eta_n, sigma_e0[mLookup[i]]); //sigma_e0
  }
}

generated quantities{
  vector[Ntest] y_test;
  real log_p_new;
  vector[Ntest] log_p_news;
  real y_rep[Ni];
  vector[Ni] log_lik;
  for (i in 1:Ni) {
    real eta_n = mu[i];
    y_rep[i] = normal_rng(eta_n, sigma_e0[modelLookup[i]]);
    log_lik[i] = normal_lpdf(y[i]| eta_n, sigma_e0[modelLookup[i]]);
  }
  for (i in 1:Ntest){
    real eta_n = beta_0j[mLookup_test[i]] + fixedEffectVars_test[i,]*beta;

    y_test[i] = normal_rng(eta_n, sigma_e0[mLookup_test[i]]);
    log_p_news[i] = normal_lpdf(y_test[i]| eta_n, sigma_e0[mLookup_test[i]]);
   
  }
  log_p_new = sum(log_p_news);
}



Afterwards, I compute the waic values using: waic_insample = waic(as.matrix(stan_object, pars = "log_lik")) or pred_likelihood_outsample = waic(as.matrix(stan_object, pars = "log_p_news"))

This is wrong. You need waic or loo only when using log_lik which is computed using the data used to condition the posterior. logMeanExp(log_lik) would be overoptimistic and waic and loo can correct that by taking into account the effect of not including one observation at time. If activations_test is not included in activations, then you need only to compute logMeanExp(log_p_news) to get the log predictive density for the test data. If activations and activations_test are not from the same distribution (or just by chance quite different), then waic/loo based on activations can be very different from test-lpd based on acitivations_test.

1 Like

@avehtari thanks for your explanation!

Last question is what the motivation could for the fact that the out-of-sample predictions are more accurate in terms of RMSE and MAE, but the average predictive likelihood out-of-sample (using your approach) is lower

Can you rephrase the question? Can you provide values?

@avehtari I estimated the model previously mentioned (with normal distribution) and this model where the dependent variable follows a poisson distribution instead of a normal distribution. For the model with normal likelihood I obtain values of WAIC of 5000 (and RMSE and MAE values of 130 and 90 out-of-sample) and for the model with poisson likelihood a WAIC value of 20000 (and RMSE and MAE values of 160 and 110 out-of-sample). So, my normal model seems to do better job at predictions. However, the average predictive likelihood for out-of-sample predictions equals -5.5 for the normal model and -3.9 for the poisson model, which is the exact opposite of what I would expect.

You can’t compare normal and Poisson directly as the first one is continuous and the second one is discrete. You have been comparing log densities to log probabilities. You need to discretise the normal model so that you are comparing log probabilities to log probabilities. RMSE and MAE comparisons are valid directly between continuous and discrete model.

@avehtari This is really a very good remark where I did not think of yet. Thanks! Do you have any idea of doing this:

Yes, but before going there, there are still some unclear issues. In your code you have

  // Continuous outcome
  int y[Ni];

If y is continuous why Poisson?

I had another look and this can’t be your code. Generated quantities has activations which is not defined. Looking at the values you reported it seems you are doing something with y or activations or whatever which you are not showing. I would like to help, but I’m not able to help if you keep hiding information and posting code which can’t be even compiled, and I don’t want to spend my time forcing the information out of you piece by piece. If you want me to keep helping, show the actual Stan code for both models and tell whether you are using exactly the same data for them (best to show the code).

@avehtari I see what you mean. In my application my dependent variable is activations, but for clarity I replaced it with y but forgot to update it in the generated quantities block. Now , I updated it everywhere and my code looks like the one I am using.
The code for my poisson model looks exactly the same, only with the normal_rng and normal_lpdf replaced with poisson_log_rng and poisson_log_lpmf respectively, and without the sigma_e0 terms.
SInce my dependent variable is a count variable I want to compare a poisson and normal model.

It’s likely that normal model works better than Poisson because the mean and variance are not coupled. You could try negative-binomial model which is discrete and has parameter for additional overdispersion (and if that doesn’t work there are less common underdispersed options, too). This way you would stay in discrete model comparison.

One way to discretize continuous models for count data is to compute the probability of y_i as \mbox{cdf}(y_i+0.5)-\mbox{cdf}(y_i-0.5). In Stan

log_lik[i] <- log(exp(normal_lcdf(y[i] + 0.5 | mu, sigma)) - exp(normal_lcdf(y[i] + 0.5 | mu, sigma)))

There are further computational tricks to compute log(exp(a)-exp(b)) computationally stable (Stan has function for log(exp(a)+exp(b)), but not for minus)

Alternatively we can approximate these probabilities by density times the width of the interval. In count data this interval is 1, so if the scale of normal is much larger than 1 and there is no significant mass below -0.5, then log densities are decent approximations of log probabilities. This holds only if use the original values, and don’t, for example, normalize the data.

Since your values are that far away there is likely to something else going on, but without the code (including the code outside of Stan) and some small version of your dataset.

2 Likes

I’m just dropping the math for the record here. Aki had to explain this to me for my case study. Averages of logs don’t correspond to logs of averages becuase logarithm isn’t linear.

Now if you want to evaluate p(\tilde{y} \mid y) and have draws \theta^{(1)}, \ldots, \theta^{(M)} from the posterior, \theta^{(m)} \sim p(\theta \mid y), then

p(\tilde{y} \mid y) \approx \frac{1}{M} \sum_{m=1}^M p(\tilde{y} \mid \theta^{(m)}).

But that won’t be arithmetically stable. Instead, we want to work with the log densities everywhere and stay on the log scale. This works out to

\begin{array}{rcl} \log p(\tilde{y} \mid y) & \approx & \log \frac{1}{M} \sum_{m=1}^M p(\tilde{y} \mid \theta^{(m)}) \\[8pt] & = & \log \frac{1}{M} \sum_{m=1}^M \exp \left( \log p(\tilde{y} \mid \theta^{(m)}) \right) \\[8pt] & = & -\log M + \log \sum_{m=1}^M \exp \left( \log p(\tilde{y} \mid \theta^{(m)} \right) \\[8pt] & = & \textrm{log_sum_exp}_{m=1}^M \log p(\tilde{y} \mid \theta^{(m)}) - \log M. \end{array}

where log_sum_exp is the usual log-sum-of-exponents operation, which Stan supports for pairs or containers of inputs.

4 Likes

@avehtari thanks for your reply.

Indeed it seems logical in my case to expect that the normal model performs better than the poisson model. Therefore, I find it strange that my average log predictive likelihood for the poisson model is higher for which I cannot find a clear motivation. Furthermore, part of my research is indeed to compare to also construct a negative binomial model, which works much better than the poisson one. Unfortunately, I cannot share my data via this forum, do you have any other ideas on how to do this?

For 1) train data, 2) test data, or 3) waic/loo corrected? In some previous message you wrote

I would expect your training data and test data to come from different distributions. Without showing whole data, can you make ppc plots and corresponding distribution plot for test data?