Alternative to LOO for simulation studies

abeeisnotabug · October 23, 2020, 1:19pm

I am testing various parameterizations of a stanmodel on simulated data. All of those models are “true” in the sense that they reflect the data generation process. However, besides parameter recovery, I am interested in the predictive performance of those parameterizations.

As I can gather from BDA, LOO is an estimator for the expected log predictive density. However, since I know the true data generating process, I am able to generate new data and can calculate the predictive density on this new data.

The process in R looks like this representative code:

my_data <- my_data_generator(n)
my_model <- stan_model("my_model.stan")
# ...where my_model.stan has a generated quantities block with the log_lik
my_stanfit <- sampling(my_model, my_data)
my_draws <- as.matrix(my_stanfit)

# Generate new datasets
my_new_data <- replicate(n_rep, my_data_generator(n))

clppds <- sapply(
  my_new_data,
  function(dataset) {
    this_gqs <- gqs(my_model, data = dataset, draws = my_draws)
    this_log_lik <- as.matrix(this_gqs)
    
    sum(log(colMeans(exp(this_log_lik))))
  }
)

mean(clppds)

With the command sum(log(colMeans(exp(this_log_lik)))) I am calculating equation (7.5) from BDA on page 169:

\displaystyle\text{computed lppd} = \sum_{i=1}^n \log\left(\frac{1}{S}\sum_{s=1}^S p(y_i | \theta^s)\right).

And then I average over those clppds to get an estimate for

\displaystyle E_f(\log p_{\text{post}}(\tilde{y}))=\int\log p_{\text{post}}(\tilde{y})f(\tilde{y})d\tilde{y}=\int\log\left[\int p(\tilde{y} | \theta)p(\theta | y)d\theta\right]f(\tilde{y})d\tilde{y},

where f is the true data generating process, y is my original data and \tilde{y} is my new data.

My question now obviously is:

Does that make sense?
How can I compute a SE of that estimate, i.e. which distribution does it follow or how can I work this out?
If I do this for multiple models, how do I compute the difference (naive, i.e. clppd_1 - clppd_2 or pointwise?) and the SE of this difference?

I did this for some models and compared it with LOO - the differences are quite big (more than 10*SE), so either they don’t estimate the same thing or something is wrong (at least both somewhat agree on the ordering).

model               my_thing   elpd_loo  se_elpd_loo
bp_rep1             -7140.384  -6545.76  67.39
bs_rep1             -7221.498  -6537.59  67.34
bstz_rep1           -7253.259  -6545.77  67.46
bstz_dt_ll2         -7512.211  -7817.66  68.46
bp_dt_ll2           -7515.083  -7832.83  68.90
bs_dt_ll2           -7515.736  -7837.60  68.53

I would be really happy if someone could give an insight.

Thank you

avehtari · October 23, 2020, 2:58pm

Yes.

For fixed my_data, you can use the same equation as for LOO, but unlike in LOO you can make that SE arbitrarily small by making my_new_data bigger and bigger (and eventually MCSE dominates, but you can make also arbitrarily small by increasing S).

It doesn’t matter if you make SE arbitrarily small.

But if you want to study also the variation over different my_data realizations, then that will add another source of variation that you can’t make arbitrarily small unless n goes to infinity. Again the same equations can be used.

Section 2 and Table 2 in Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison might help to recognize the different sources of variation and uncertainty.

EDIT: fixed the link

abeeisnotabug · October 23, 2020, 3:43pm

Thank you for the swift reply and the ressources you provided.

Do you have an idea, where the differences in the value of LOO and the aforementioned method might stem from? Admittedly, the diagnostics for LOO are terrible (exemplary output):

Computed from 16000 by 1000 log-likelihood matrix

         Estimate    SE
elpd_loo -11471.1 122.8
p_loo       991.1   4.5
looic     22942.2 245.6
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)       0    0.0%   <NA>      
 (0.5, 0.7]   (ok)         3    0.3%   248       
   (0.7, 1]   (bad)      853   85.3%   12        
   (1, Inf)   (very bad) 144   14.4%   5         
See help('pareto-k-diagnostic') for details.

Can I learn something from these results about the reliability of my estimates as well?

avehtari · October 23, 2020, 3:51pm

Yes, the diagnostic tells that your model is just memorizing the data and it’s not able to generalize at all from one observation to another (p_loo almost the same as the number of observations), and Pareto smoothed importance fails so that elpd_loo is likely to be too optimistic (99.7% of khats > 0.7).

abeeisnotabug · October 23, 2020, 4:38pm

Interesting. The model is a mixture model (in particular a mixture Partial Credit Model), where for every new observation (a vector) there is one new parameter but K new conditionally independent data points. Besides the LOO diagnostics, the model works well however, in the sense that specificity and sensitivity for the classification of individuals (given the true populations they stem from) are both around 95%. Therefore I was wondering if these types of predictive checks have a problem with mixture models in general?

And just to clarify - seeing that, as you say, elpd_loo might be too optimistic, is it safe to assume that elpd calculated using real new data from the same data generating process will give a better estimate than LOO?

avehtari · October 23, 2020, 6:46pm

Yes, But now that you told more about the problem then my question is that are you generating new data for the same individuals as in my_data or for new not yet seen individuals? This will have a very big difference in your case.

abeeisnotabug · October 24, 2020, 9:06am

Indeed. It also depends which part of the model is considered as “likelihood”, correct? As it is a mixture model, there has to be a distributional assumption, otherwise we have no chance at identification.

So I calculate the likelihood as follows (the same as in the model block where instead of log_lik[i]= it would be target+=):

  for (i in 1:I) {
    vector[C] ll_by_pers = rep_vector(0, C);
    
    for (c in 1:C) {
      for (k in 1:K) {
        ll_by_pers[c] += categorical_logit_lpmf(
          y[i, k] |
          append_row(0, cumulative_sum(theta[i] - beta[c, k]))
        );
      }
      
      ll_by_pers[c] += normal_lpdf(theta[i] | mu_theta[c], sigma_theta);
    }
    
    log_lik[i] = log_sum_exp(log(lambda) + ll_by_pers);
  }

I generate data for the same individuals, since I figure it does not make sense to keep the person parameters (\theta_i) and evaluate a new k-vector with the likelihood in which there are parameters estimated via the previous k-vectors. In other words, as I mentioned above, for every new individual there is a new parameter and in my opinion it would not make sense to evaluate one individual’s likelihood with another individual’s parameter.

If I wanted to do that, I would have to draw a new group membership and \theta_i for every new individual (depending on the posterior draws for lambda, mu_theta[c]and sigma_theta).

avehtari · October 25, 2020, 6:06pm

I didn’t find what is definition of I and i, but in cross-validation you remove the contribution of log_lik[i], and predict for the corresponding left out data. Does that match your new data generation?

abeeisnotabug · October 26, 2020, 8:42am

No, but I think I am lacking some understanding of how LOO works with this kind of data, where every new observational unit has its own parameter that depends on this unit’s realization of the experiment.

I would gladly appreciate if you could help me understand. This is the full stan code:

data {
  // Counts
  int<lower=1> I; // No. persons
  int<lower=1> K; // No. items
  int<lower=1> J; // No. parameters per item
  int<lower=1> C; // No. Classes
  
  // Data
  int<lower=1, upper=J+1> y[I, K];  // Data as Persons by Items Matrix
  
  // Priors
  // Mixture Parameters
  int<lower=1> lambda_prior;
  
  // Scale Parameter
  real scale_prior_mu;
  real<lower=0> scale_prior_sigma;
}

parameters {
  // Mixture Parameters
  simplex[C] lambda;
  
  // Item Parameters
  vector[J] beta[C, K];
  
  // Person Parameters
  ordered[C] mu_theta;
  real<lower=0> sigma_theta;
  vector[I] theta;
}

model {
  // Priors
  for (c in 1:C) {
    real beta_sum = 0;
    
    for (k in 1:K) {
      target += std_normal_lpdf(beta[c, k]);
      beta_sum += sum(beta[c, k]);
    }
    
    target += normal_lpdf(beta_sum | 0, 0.01 * J * K);
    
    target += std_normal_lpdf(mu_theta[c]);
  }
  
  target += dirichlet_lpdf(lambda | rep_vector(lambda_prior, C));
  
  target += lognormal_lpdf(sigma_theta | scale_prior_mu, scale_prior_sigma);
  
  // Likelihood
  for (i in 1:I) {
    vector[C] ll_by_pers = rep_vector(0, C);
    
    for (c in 1:C) {
      for (k in 1:K) {
        ll_by_pers[c] += categorical_logit_lpmf(
          y[i, k] |
          append_row(0, cumulative_sum(theta[i] - beta[c, k]))
        );
      }
      
      ll_by_pers[c] += normal_lpdf(theta[i] | mu_theta[c], sigma_theta);
    }
    
    target += log_sum_exp(log(lambda) + ll_by_pers);
  }
}

generated quantities {
  matrix<lower=0, upper=1>[I, C] cond_class_prob;
  vector[I] log_lik;

  for (i in 1:I) {
    vector[C] ll_by_pers = rep_vector(0, C);
    
    for (c in 1:C) {
      for (k in 1:K) {
        ll_by_pers[c] += categorical_logit_lpmf(
          y[i, k] |
          append_row(0, cumulative_sum(theta[i] - beta[c, k]))
        );
      }
      
      ll_by_pers[c] += normal_lpdf(theta[i] | mu_theta[c], sigma_theta);
    }
    
    log_lik[i] = log_sum_exp(log(lambda) + ll_by_pers);
    
    cond_class_prob[i] = to_row_vector(
      exp(log(lambda) + ll_by_pers - log_lik[i])
    );
  }
}

And this is what the data y look like:

   item_1 item_2 item_3 item_4 item_5 item_6 item_7 item_8 item_9 item_10
     <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>   <int>
...
 90      4      5      5      5      5      5      4      2      4       4
 91      5      5      4      5      4      5      2      5      4       4
 92      3      5      4      5      4      5      2      3      5       4
 93      5      5      4      4      4      5      5      3      4       4
 94      5      5      5      5      5      4      5      5      5       5
 95      4      2      3      5      3      5      3      1      2       2
 96      5      5      5      3      5      3      4      2      2       1
 97      5      4      5      2      5      4      3      5      3       1
 98      3      1      2      3      2      1      3      1      3       1
 99      5      5      5      4      4      5      5      2      3       5
100      5      5      5      5      5      5      5      5      4       4

Now assume we do the explicit, computationally expensive LOO, i.e. we leave out one observational unit, fit the model to the remaining data y_{-n} and predict for the left out data y_{n}.

If we forget about the mixture for a second, then the model can be factorized down to the item level (in other words the independent likelihood terms are constituted by a singular person’s response to a singular item y_{ik}, an integer). Then we can leave out a person’s response to one particular item and predict this person’s response to that item. All parameters are already calculated (drawn), so this is unproblematic.

But now, introducing the mixture, the likelihood cannot be factorized further than the person and then one observational unit is a vector y_i (i.e. a row in y). How can we predict for y_i if we don’t have \theta_i? We would need to calculate it or draw it, but in the first case it is not a prediction and in the second, how can we know that the drawn \theta will match the person’s \theta? To me it seems like a proper prediction could only be for one person’s future (independent) responses to the same items and that is how my_data is generated.

I hope what I am trying to say is somewhat clear, my question is how does LOO work in a case like this?

Edit: I just realized this is closely related to the partial pooling version of the 8 schools example in BDA. My question could thus be formulated as follows: If we leave out school j and calculate the fit for the remaining 7 schools, estimating 7 \theta_i which in turn inform \mu and \tau, how do we get an estimate for \theta_j then when calculating the lpd for the left out school j?

avehtari · October 26, 2020, 7:41pm

The way you compute log_lik in generated quantities corresponds to leaving out school j / student i and having 0 data for that school / student, and predicting only with the prior.

There are two issues here. One is whether cross-validation would work and whether importance sampling approximation for cross-validation would work.

If we simplify the issue a bit and drop the mixture model, and assume there would be just one observation per student and one parameter for each student, if the prior is proper for those parameters cross-validation would work, and if the those parameters are similar so that the hyperprior can be informative for the parameter without data the cross-validation could show that.

For importance sampling C, and with one observation per parameter , the prior has to be very informative so that each observation is not so influential that prior and posterior are very different.

Back to the more complex case. Observation in mixture could be handled in cross-validation using missing data approach (similar to how we implemented exact LOO in Efficient leave-one-out cross-validation for Bayesian non-factorized normal and Student-tt models, but importance sampling works only if the likelihood has factorized form.

abeeisnotabug · October 29, 2020, 9:33am

Thank you for continuing to answer my questions and for the material you provided.

I typed up a question and started to be confused about the nature of the log_lik I wrote down. The original question was:

Does the value of elpd_loo I get from the LOO package correspond to doing the explicit LOO calculation via the following Stan code (that means running it for all i_pred in 1:I and then summing the posterior means of the ll_i_pred)?

data {
  // Counts
  int<lower=1> I; // No. persons
  int<lower=1> K; // No. items
  int<lower=1> J; // No. parameters per item
  int<lower=1> C; // No. Classes
  
  // Data
  int<lower=1, upper=J+1> y[I, K];  // Data as Persons by Items Matrix
  int<lower=1, upper=I> i_pred; // Index of person to leave out/predict
  
  // Priors
  // Mixture Parameters
  int<lower=1> lambda_prior;
  
  // Scale Parameter
  real scale_prior_mu;
  real<lower=0> scale_prior_sigma;
}

parameters {
  // Mixture Parameters
  simplex[C] lambda;
  
  // Item Parameters
  vector[J] beta[C, K];
  
  // Person Parameters
  ordered[C] mu_theta;
  real<lower=0> sigma_theta;
  vector[I] theta;
}

model {
  // Priors
  for (c in 1:C) {
    real beta_sum = 0;
    
    for (k in 1:K) {
      target += std_normal_lpdf(beta[c, k]);
      beta_sum += sum(beta[c, k]);
    }
    
    target += normal_lpdf(beta_sum | 0, 0.01 * J * K);
    
    target += std_normal_lpdf(mu_theta[c]);
  }
  
  target += dirichlet_lpdf(lambda | rep_vector(lambda_prior, C));
  
  target += lognormal_lpdf(sigma_theta | scale_prior_mu, scale_prior_sigma);
  
  // Likelihood
  for (i in 1:I) {
    if (i != i_pred) {
      vector[C] ll_by_pers = rep_vector(0, C);
    
      for (c in 1:C) {
        for (k in 1:K) {
          ll_by_pers[c] += categorical_logit_lpmf(
            y[i, k] |
            append_row(0, cumulative_sum(theta[i] - beta[c, k]))
          );
        }
      
        ll_by_pers[c] += normal_lpdf(theta[i] | mu_theta[c], sigma_theta);
      }
    
      target += log_sum_exp(log(lambda) + ll_by_pers);
    }
  }
}

generated quantities {
  real ll_i_pred;
  
  {
    vector[C] ll_by_pers = rep_vector(0, C);
    
    for (c in 1:C) {
      for (k in 1:K) {
        ll_by_pers[c] += categorical_logit_lpmf(
          y[i_pred, k] |
          append_row(0, cumulative_sum(theta[i_pred] - beta[c, k]))
        );
      }
      
      ll_by_pers[c] += normal_lpdf(theta[i_pred] | mu_theta[c], sigma_theta);
    }
    
    ll_i_pred = log_sum_exp(log(lambda) + ll_by_pers);
  }
}

However, if I consider the mathematics behind my code I am not so sure what log_lik in the above code actually represents. The whole model block is the joint probability of the data and the parameters where the discrete class-membership parameters have been marginalized out, i.e.

\displaystyle\prod_{i=1}^I\left[\sum_{c_i=1}^C\ p(y_i | \theta_i, \beta, c=c_i)p(\theta_i | \mu_c, \sigma, c=c_i)p(c=c_i)\right]p(\beta)p(\mu)p(\sigma)p(\lambda)

The term inside the brackets thus evaluates to p(y_i, \theta_i | \beta, \mu, \sigma) and this is what is calculated inside the loop

for (i in 1:I) {
  vector[C] ll_by_pers = rep_vector(0, C);
    
  for (c in 1:C) {
    for (k in 1:K) {
      ...
    }
    
    ll_by_pers[c] += normal_lpdf(theta[i] | mu_theta[c], sigma_theta);
  }
    
  target += log_sum_exp(log(lambda) + ll_by_pers);
}

So if we want to estimate

\displaystyle\log\int p(\tilde{y}_i|\theta_i,\beta,...)p(\theta,\beta,...|y)d(\theta,\beta,...)\approx\log\left(\frac{1}{S}\sum_{s=1}^Sp(\tilde{y}_i|\theta_i^s,\beta^s,...)\right)

then using the above formula, i.e.

\displaystyle\log\int p(\tilde{y}_i,\theta_i|\beta,...)p(\theta,\beta,...|y)d(\theta,\beta,...)\approx\log\left(\frac{1}{S}\sum_{s=1}^Sp(\tilde{y}_i,\theta_i^s|\beta^s,...)\right)

would give the wrong estimate the way I understand it. Shouldn’t we then need to scratch line

ll_by_pers[c] += normal_lpdf(theta[i] | mu_theta[c], sigma_theta);

from the log_lik calculation?

Another way to do this I found in the paper Bayesian Comparison of Latent Variable Models: Conditional Versus Marginal Likelihoods. There they use use what they call the marginal likelihood where the latent variable (\theta_i) has been integrated out for every i, i.e. we approximate

\displaystyle\log\int p(\tilde{y}_i|\beta,...)p(\beta,...|y)d(\beta,...)=\log\int\left[\int p(\tilde{y}_i,\theta_i|\beta,...)d\theta_i\right]p(\beta,...|y)d(\beta,...),

but then my question is: What are the conceptual differences between both alternatives? In both cases we predict for another observational unit from a new person since \theta_i either has to be predicted or is integrated out.

Furthermore, is there a way to compute LOO (either explicitly or through importance sampling) for the way I did it in the initial post (i.e. predicting new observations for the same clusters/persons with the \theta_i we already calculated)?

If you have the time to shed some light on my confusions, I would gladly appreciate it. Thank you.

avehtari · November 3, 2020, 6:29pm

The code is complex enough that I can’t be sure there are no mistakes, but if loo packages diagnostics say that PSIS-LOO works then PSIS-LOO estimate by loo package is likely to be close to exact leave-one-out cross-validation.

Above or in the earlier post?

target += log_sum_exp(log(lambda) + ll_by_pers);

these terms such that they can be easily left out in cross-validation. The loop is i in 1:I so it’s over persons. And if log_lik[i] would be this term, you can use loo package to do PSIS-LOO approximation of exact leave-one-person-out. Leaving the likelihood term for ith person out, means there are no data informing theta[i], and then distribution theta[i]is just the prior for theta[i], and predictions for the left out person i are like for a new person.

Due to the term log_sum_exp(log(lambda) + ll_by_pers) it’s not as easy to do PSIS based leave-one-c-out or leave-one-k-out (c and k are the indeces you use). You can still do the exact leave-one-c-out or leave-one-k-out by modeling the left out parts as missing data (as we do in [1810.10559] Efficient leave-one-out cross-validation for Bayesian non-factorized normal and Student-t models for another non-factorized model).

This is not related to whether predicting for a new or or seen person. It’s about making the computation better behaving by marginalization when all the information related to group specific parameter has been left out.

Not exactly. There is one possible way to get closer, but as there is no paper showing the theory or experiments it’s currently just guessing and I need to make some checks before recommending it.

abeeisnotabug · November 9, 2020, 10:56am

Thank you for your detailed answer. I would like to expand on this point:

I did the following experiment with my data: I fitted the Model with 128.000 and calculated the running LOO (with loo::loo) and log(mean(exp(log_lik_*))) both for the conditional (log_lik_c) and marginal (log_lik_m) likelihood. The results are plotted here (in log-scale):

elpd_plots.pdf (28.9 KB)

We can see that the loo for log_lik_m is almost immediately stable, whereas the loo for log_lik_c doesn’t stabilize at all, even with 128.000 posterior samples. What has driven me up the wall, however, is that elpd_same_data is not the same for the two likelihoods and don’t appear to ever come close to each other, even though the marginal likelihood should just facilitate computation.

Then I looked at the math. Let’s drop the mixture/hyperprior complications and focus on a model that has two sets of parameters, person-specific \theta_i and global \beta and point-wise person-level (conditional) likelihood f_c(y_i|\theta_i,\beta). Consider the following integral for old data y and a new datapoint from person i, \tilde{y_i}:

\displaystyle\int f_c(\tilde{y_i}|\theta_i,\beta) p(\theta_i,\beta|y) d\theta_i d\beta.

Since the new data are assumed to be conditionally independent given the parameters, this integral evaluates to

\displaystyle\int f_c(\tilde{y_i}|\theta_i,\beta) p(\theta_i,\beta|y) d\theta_i d\beta=\int f_c(\tilde{y_i}|\theta_i,\beta, y) p(\theta_i,\beta|y) d\theta_i d\beta
\displaystyle=\int f_c(\tilde{y_i},\theta_i,\beta|y)d\theta_i d\beta=p(\tilde{y_i}|y).

On the other hand, if we consider the marginal point-wise likelihood f_m(\tilde{y_i}|\beta)=\int f_c(\tilde{y_i}|\theta_i,\beta)d\theta_i and try to evaluate

\displaystyle\int f_m(\tilde{y_i}|\beta) p(\theta_i,\beta|y) d\theta_i d\beta,

the \tilde{y_i} are not longer conditionally independent from the y (since they come from the same persons, so if, generally speaking, y_i is high, it is more likely that \tilde{y_i} is high as well). Therefore

\displaystyle\int f_m(\tilde{y_i}|\beta) p(\theta_i,\beta|y) d\theta_i d\beta=E_{\beta|y}[f_m(\tilde{y_i}|\beta)]\neq p(\tilde{y_i}|y).

Thus using the marginal likelihood, we have used another utility function. Of course, this only applies if we use new observations from the same persons, because if we consider new observations \tilde{z_j} from a different person j with different (and independent!) \theta_j, then the new data are conditionally independent, given only the global parameters:

\displaystyle\int f_m(\tilde{z_j}|\beta) p(\theta,\beta|y) d\theta d\beta=\int f_m(\tilde{z_j}|\beta) p(\beta|y) d\beta
\displaystyle=\int f_m(\tilde{z_j}|\beta,y) p(\beta|y) d\beta=p(\tilde{z_j}|y).

Thus I would argue, it is related to wether we predict for new or the same persons - using the marginal log likelihood, we are necessarily predicting for new persons (otherwise we’re calculating a different quantity) and using the conditional log-likelihood, we are necessarily predicting for the same persons (since the \theta_i from the posterior distribution are no longer exchangeable). Do you agree?

However, as the plots above show, the conditional log likelihood is useless, since the loo estimate based on it doesn’t stabilize with increasing posterior sample size.

avehtari · November 9, 2020, 3:47pm

Did you compute also the elpd with the simulated new data? And do you simulate the new data for the same persons or new persons?

Matches the behavior also reported by Furr et al (and discussed also in Vehtari et al). Marginalization can change the focus (see the discussion in Furr et al).

I agree that predicting for the same persons and for new persons is in general a different quantity. I did get confused 1) which one you want to estimate, 2) are you happy with a) the simulation result with simulated new data or b) exact computation for cross-validation, and 3) do you need importance sampling approach for something. This is interesting but complex topic, and it might be easier to have a video call and then summarize the discussion here.

Not just any loo estimate, but importance sampling is the one that fails, so it seems the full posterior and cross-validation posterior are too different, and distribution of the importance ratios has too thick tail (earlier post showed almost all khat>0.7, which indicates that increasing sample size doesn’t help unless it’s bigger than the number of atoms in the universe).

abeeisnotabug · November 17, 2020, 1:17pm

Yes, I computed running elppd for 128 new datasets, both for the same and for new persons (I haven’t gotten around yet to implement the formulas for the SE you pointed me to, but it is on the agenda). The plot looks as follows:

elpd_plots_new_data.pdf (12.1 KB)

We can also compare the numerical values:

The “final” elpd_loo estimate with 128000 posterior samples:

  Type        Iterations Parameter Estimate    SE
  <chr>            <int> <chr>        <dbl> <dbl>
1 conditional     128000 elpd_loo    -5728.  71.2
2 marginal        128000 elpd_loo    -5982.  70.1

The elppd_new_data averaged over all 128 repetitions for data from the same persons…

  log_lik_type running_mean repetition data_type 
  <chr>               <dbl>      <int> <chr>     
1 log_lik_c          -5505.        128 same_theta
2 log_lik_m          -5966.        128 same_theta

…and 128 sets of new persons:

  log_lik_type running_mean repetition data_type
  <chr>               <dbl>      <int> <chr>    
1 log_lik_c          -6323.        128 new_theta
2 log_lik_m          -5987.        128 new_theta

We see that the conditional elppd for the same persons is considerably higher than the marginal one and also than for new persons, which makes sense. Moreover, the marginal elppd for new data with new persons, -5987, matches pretty well the elpd_loo estimate obtained with PSIS, -5982 (70.1). On the other hand, the marginal elppd for new data with the same persons is further from this value, and, looking at the plots, stably so. This inequality, to me, is explained by my calculations earlier.

I am very sorry for the confusion, I will try to summarize:

To keep things simple, let’s say I have models with varying restrictions whose performance I want to compare in a simulation study.
When I started out I wasn’t aware of the conceptual differences relating to the different quantities, this I learned through this discussion. Now that I am, of course I am interested in estimating both quantities (new/same persons) and investigating if there are differences between the models (i.e. can it be that some models would predict better in one circumstance and vice versa?).
Since I want to squeeze out a maximum number of repetitions in a minimum number of time, I would like to use PSIS wherever I can so I don’t have to recalculate 100-or-so generated data blocks each time. From this discussion it seems like I can rely on the PSIS elpd_loo in the case of marginalized likelihoods, but not for the conditional ones, predicting future outcomes for the same persons.

This last point is a bit of a bummer, since, even though I don’t have to recalculate the generated quantities in every repetition hundredfold, I still have to do numerical integration in every iteration, which takes quite some time. So naturally I am interested in learning more about what you are referring to here:

Finally:

I greatly appreciate the offer and would love to take you up on it. If it still stands, I would be truly grateful if you would send me a date and time via PM whenever it suits you.

Thank you.

Topic		Replies	Views
ArviZ loo_i Interfaces loo , arviz	19	1581	May 11, 2021
How to calculate log_lik in generated quantities of a multivariate regression model CmdStan cmdstan , loo	15	3765	March 9, 2022
Calculating LOO-CV for a multinormal regression model Modeling loo	32	2285	April 7, 2020
Two-armed bandit hierarchical reinforcement learning model - interpreting conflicting loo and posterior predictive check results Modeling specification , loo , posterior-predictive , hierarchical-model , reinforcement-learning	7	655	January 17, 2024
LOO Model Comparison Alternative Modeling rstan , techniques , loo , cmdstanr	3	94	March 27, 2025

Alternative to LOO for simulation studies

Related topics