How to calculate log_lik in generated quantities of a multivariate regression model

Dear Stan community,

This question might be silly. I am learning loo package for model comparison. To do so, one needs to calculate log-lik in generated quantity block. Most of the examples I could find, for instance Writing Stan programs for use with the loo package • loo, Bayesian data analysis - CmdStanR demos and Extract pointwise log-likelihood from a Stan model — extract_log_lik • loo deal with univariate model.

I wonder how do I calculate log_lik for my model which is multivariate regression and parameters follow a hierarchal structure. Let’s say my data is Y[M\times N], so M variates and N observations. Should I simply calculate log_lik for each variate like below (I simplified hierarchal for beta here)

for(m in 1:M) {
     for (n in 1:N) {
     log_lik[m, n] = normal_lpdf(y[m, n] | X[n, ] * beta, sigma);
     }
}

Doing this I will have M log_lik vectors, then I lost on how to conduct model comparison, do I compare model for each variate? It seems does not make sense. Could you please guide me a bit? Thanks a lot.

Michelle

I think the loo package expects log_lik to be a vector, so I’d do:

generated quantities{
     vector[N*M] log_lik ;
     { \\ "local" environment so temp doesn't get saved to file
          matrix[M,N] temp ;
          for(m in 1:M) {
               for (n in 1:N) {
                    temp[m, n] = normal_lpdf(y[m, n] | X[n, ] * beta, sigma);
               }
          }
          log_lik = to_vector(temp) ;
     }
}
1 Like

Thanks a lot @mike-lawrence

Hi @mike-lawrence , do you mind if I ask more questions here? Why don’t we use normal_id_glm_lpdf for each variate m? Such that we will have a log_lik vector with size m and each element is the log-likelihood calculated by normal_id_glm_lpdf using n observations of a specific variable m. Is it because log_lik has to be done for each single data point?

Would you be able to post your whole model?

1 Like

Yes, here it is. It is just a linear regression with hierarchical \beta and I wrote the log_lik as you suggested

data {
  int<lower=0> n_group;
  int<lower=0> n_indiv;      
  int<lower=0> n_obs;      
  int<lower=0> n_des;
  int<lower=1, upper=n_group> idx_group[n_indiv];
  int<lower=1, upper=n_indiv> idx_indiv[n_indiv];
  array[n_indiv] vector[n_obs] Y;     
  matrix[n_obs, n_des] X;
}

parameters {
  vector[n_des] mu;
  matrix[n_group, n_des] z_group;
  matrix[n_indiv, n_des] z_indiv; 
  real<lower=0> sigma[n_indiv];
  real<lower=0> sigma_group;
  real<lower=0> sigma_indiv;
}

model {
  mu ~ normal(0,10);
  to_vector(z_group) ~ std_normal();
  to_vector(z_indiv) ~ std_normal();
  sigma_group ~ normal(0,10);
  sigma_indiv ~ normal(0,10);
  sigma ~ normal(0, 10);       
  for (n in 1:n_indiv) {
    Y[n] ~ normal_id_glm(X, 0,
    mu + sigma_group * z_group[idx_group[n]]' + sigma_indiv * z_indiv[idx_indiv[n]]',
    sigma[idx_indiv[n]]);
  }
}

generated quantities{
  vector[n_indiv * n_obs] log_lik;
    matrix[n_indiv, n_obs] temp;
      for(n in 1:n_indiv) {
        for(t in 1:n_obs) {
        temp[n, t] = normal_lpdf(Y[n, t] | X[t, ] *
        (mu + sigma_group * z_group[idx_group[n]]' + sigma_indiv * z_indiv[idx_indiv[n]]'),
        sigma[idx_indiv[n]]);
        }
      }
    log_lik = to_vector(temp);
}

This is silly but I wonder if I can write log_lik as below, it’s like calculating only one log_lik for each variate (one row of the data matrix Y) instead of two for loops calculating one log_lik per element of Y

generated quantities{
  vector[n_indiv] log_lik;
    vector[n_indiv] temp;
      for(n in 1:n_indiv) {
        temp[n] = normal_id_glm_lpdf(Y[n, ] | X, 0, 
        (mu + sigma_group * z_group[idx_group[n]]' + sigma_indiv * z_indiv[idx_indiv[n]]'),
        sigma[idx_indiv[n]]);
      }
}

Ok , so first two small things:

  1. I don’t think your variable idx_indiv is actually doing any “work” in the model; it’s a vector of length n_indiv consisting of the the numbers 1 through n_indiv and it’s used to index both z_indiv and sigma inside the loop, but the looping variable n would suffice in both cases, so you can have just:
for (n in 1:n_indiv) {
    Y[n] ~ normal_id_glm(X, 0,
    mu + sigma_group * z_group[idx_group[n]]' + sigma_indiv * z_indiv[n]',
    sigma[n]);
  }
  1. As I understand it, this is a pretty standard hierarchical model, similar to SUG1.13 with the exception that you don’t have any individual-level covariates, so you could actually use the code there (use the one at the bottom, it’s much faster) and pass in a n_indiv \times 1 matrix as that model’s data variable x (and k=1 too). You could also hand-edit things to remove the individual-level predictors and their correlation structure, but I guess that’ll get you pretty close to what you already have. The benefit of taking a look at that code I guess then is seeing how it achieves vectorization of the likelihood (i.e. no loop) via a data variable used for indexing.

To your core question as to whether your proposed log_lik makes sense, what I had will compute a log likelihood for each observation within each individual. What you’re proposing yields the equivalent of taking a sum these values within each individual. I’m not sure what the theoretical consequence of that kind of summation implies; maybe a loo expert like @avehtari could chime in?

1 Like

Thanks a lot @mike-lawrence I will read the code example you suggested, I was also looking for some demo to vectorize the code.

I agree that this individual-level index is not necessary from modelling point of view. I used it because I did not arrange the rows in Y with a specific order, but assigned an ID to each of them. Then using the individual-level index I could retrieve the parameter for each ID more conveniently for posterior predictive simulations.

It looks like you are not assigning anything to log_lik

Using such log_lik with PSIS-LOO corresponds to leave-one-row-out cross-validation.

1 Like

When you have multiple observations per individual, using the univariate approach for log_lik is essentially assessing the extent to which the model can predict each observation for an individual separately. On the other hand, the multivariate (leave-one-row-out) approach is assessing the ability to predict all observations for an individual at once.

Am I on the right track with that @avehtari ?

3 Likes

Thanks. I think I forgot to add log_lik = temp

Do you think leave-one-row-out would be similar to leave-one-point-out in terms of model comparison of this simple multivariate linear regression, since we may not assume heterocadasity and beta parameter is for the whole row not each single point.

If I may ask one more question here, imagine I have two models and each fitted with multiple data set. To do the model comparison, can I just sum up elpd_loo of the one model fitted on each data set, and for se of elpd_loo I just do \sqrt{se(elpd_loo)_{data1}^2 + se(elpd_loo)_{data2}^2 + \cdot \cdot \cdot}? I ask this because sometimes the models are fitted to different data set separately and each data set can be considered being independent, merging all of them into one big data and then fit once is computationally expensive. Thanks you very much.

Even then, there can be a big difference in how easy it is to predict one left out item given other items in that row, than when predicting the whole row given only the other rows. This depends on how much variation there is between rows and within row. Some discussion here Cross-validation for hierarchical models

It is possible you would be able to see model misspecification also with leave-one-item-out, or get the same ranking when comparing predictive performance and could see that one model is much better than the other (and if one model is not clearly better than other, then don’t use cross-validation for model selection).

The way to proceed depends on the exchangeability assumptions on these data sets, e.g., do you assume all observations are exchangeable or do you assume they are exchangeable only within each data set.

2 Likes

Thanks a lot @avehtari

I would assume that they are exchangeable only within each data set. I am sorry to ask so many questions, to not waste your time, do you have any reading material recommendations so that I can learn some theory and do not ask one after the other :) I noticed you were using the update() function in some loo examples for subsets of data, I haven’t fully understand, but do you think I should go for this direction (using update() function)

And what do you assume about the new data? If we assume that each data set presents some group, do you assume the not yet observed data comes from the same groups you have already observed or from new groups? And can these groups be assumed to be exchangeable? You can tell more about your data collection process and the aim of the modeling, so that can also provide information what to recommend.

No problem asking. I would have already provided a link if I would know something that would answer your question, but unfortunately I’m not aware of material that would dicuss different options so that you could easily figure out which one corresponds to your case and then follow from there. I’m slowly writing material that is trying to provide answers to all the questions I’ve seen here and elsewhere, and it’s great that you ask if something is not clear so I know what things need more attention.

Sorry for my previous vague information. My data were collected simultaneously for ~3k individuals, from our domain knowledge we can group them into 20 groups so each has ~150 individuals, and groups are independent to each other. Within each group, the variable of interest follows a hierarchical structure, so I modeled them using the code shown above. There are also different ways to represent the hierarchical structure so I built several models following different assumptions, one even without multilevel. I would like to evaluate which model predict out sample data the best such that I have some insights to select and interpret the corresponding hierarchical structure. Or maybe model comparison will tell me that I don’t need a hierarchical structure. To do so, I just repeat the same model fitting to all 20 groups and I can conduct all diagnoses, posterior predictive simulations, and model comparisons group by group. However, I would also like to combine all groups’ elpd_loo so I have one evaluation for the whole data. It is just a way to summarise the model comparison.

Thank you so much for these works. To be honest, I think it is still not clear for me about the difference between leave-one-out and leave-one-row-out. I thought the Pareto k calculation would consider the number of parameters and the effective sample sizes. In multivariate regression, the number of parameters is the same whether we do leave-one-out or leave-one-row-out, but the number of data samples considered in log_lik can be so different depending on the row length. For instance, if I have 1000 variate and each contains 5k data points in Y, leave-one-out will consider 5\cdot 10^6 data points, whereas leave-one-row-out will only have 1000 elements in log_lik. In my case, when I do leave-one-out, all Pareto k are < 0.5, but when leave-one-row-out, 95% is larger than 1.

Your explanation clarified the background, and it’s good that you think it as a summary. LOO for each group corresponds to assuming new out of sample data is from the same groups. If you sum elpd’s together, you are assuming equal weight for all groups in the summary (you could have different weights if you would assume that some groups are more likely in the future). Now when you want consider whether the difference in the summary between hierarchical and non-hierarchical is big, there is uncertainty as the summaries for each group and the total summary for all groups is based on finite number of observations and finite number of groups. As you are assuming hierarchical structure of groups and individuals in groups, you could also use a hierarchical model to analyse the uncertainty in the full population mean. If the between group variation dominates, you could approximate the uncertainty in the summary by using the point estimates of elpds for each group and compute SE from those. Or if the within group variation dominates, you can pool all individual epld values from all groups to one vector and compute the variance from there.

Pareto-k is diagnostic for how variable the importance sampling weights are.

Nice thing about the cross-validation is that we don’t need to count parameters. loo package is reporting p_loo, but it’s not by counting but by comparing the full posterior predictive log score to cross-validation predictive log score.

When you leave more observations out, the posterior changes more and the full data posterior is not anymore a good importance sampling proposal distribution for the leave-one-row-out posterior. If 95% of Pareto-k’s are >1, it hints that you also have one parameter per row and when you remove all data that is directly influencing that parameter, its posterior changes a lot. In such cases K-fold-CV is the easiest option for reliable computation. Instead of cross-validation you can also look at the posterior of the hierarchical model parameters as that can also sometimes tell enough whether data is providing useful information about the hyperparameters and then you don’t need cross-validation (unless the prediction task is important)

1 Like