Following the discussion in “Target += normal_lpdf(eta|0.1); vs. eta ~ normal(0,1)” and reading Aki’s tutorial in the loo package (Writing Stan programs for use with the loo package"; http://mc-stan.org/loo/articles/loo2-with-rstan.html), I am still unclear how the use of explicit “target +=” vs. a sampling statement in a Stan model’s model block affects the accuracy of a model comparison using loo.
I think that I understand that the sampling form of the model statement omits a constant offset in calculation of the model total log_likelihood (which is irrelevant for optimizing the given model’s parameters), whereas the explicit "target += " includes this constant. Therefore the sampling statement may be marginally faster, but may not give an absolutely accurate value of final model log_likelihood needed for model comparison using loo. However jscolar commented that either form of the model statement could be used for model comparisons using loo (but not when working with Bayes factors), and pointed to Aki’s tutorial. Aki’s tutorial computes a final log_likelihood separately in an added “generated quantities” block.
So does this mean that loo actually will use only a separately calculated “log_lik” calculated this way in an added “generated quantities” block," and that a corresponding block has to be added at the end of the model definition to be able to use loo accurately?
Thanks in advance to aki and jscolar for their clarification of my confusion.
To use loo you need to compute the pointwise log likelihood (conditional on the parameters), which is a function of the parameter estimates but not (directly) a function of target. This computation is generally done either in generated quantities or in R/python/whatever using the fitted model object. Dropping the normalizing constants during model fitting doesn’t affect the fitted parameter estimates, which can still be used in the usual way to calculate the pointwise log likelihood.
You need a computation of the pointwise log-likelihood to be able to use loo at all. The pointwise log-likelihood is different from the log-posterior (aka target aka lp__) in a Stan model.
Thanks, jscoclar. This is very helpful. I now know what I have to do. Could you point me to a reference that will explain the difference between the pointwise loglikelihood and the log posterior likelihood?
I think a brief explanation will be more helpful than a reference.
There are two key differences between the pointwise log-likelihood and the log posterior. One is the likelihood versus posterior part. The other is the pointwise part.
target/lp__ is the logarithm of something that is proportional to the product of the prior probability of the parameter values times the likelihood of the data conditional on the parameter values. Note that “something that is proportional to” means that we can freely add or drop any constant multiplicative terms, including normalizing constants.
As I said, the pointwise log-likelihood differs in two key ways. First, it doesn’t include anything about the prior probability–it’s just the likelihood of the data. This is the likelihood versus posterior part.
Second, loo requires the likelihood evaluated for each data point separately. To get target we calculate the likelihood (conditional on parameters) for each data point, then we multiply all these likelihoods together, and then we multiply that by the joint prior likelihood, and we get a single number per-iteration that is proportional to the posterior, and we take the logarithm (actually, under the hood we work with logarithms from the beginning, adding rather than multiplying).
To get the pointwise log likelihood, we compute the likelihood for each point, and then we just take the logarithm and stop! (note that we typically work on the log scale from the very beginning, so you won’t see the step of “take a logarithm” as an explicit line in the Stan code.) At each iteration we get a vector of log likelihoods, with one element corresponding to the likelihood of each data point (conditional on the parameter estimates at that iteration). This is what we mean by pointwise and it’s what loo needs to work its magic.
So to recap: the pointwise log-likelihood differs from the log posterior because it doesn’t multiply by the prior density and it doesn’t multiply together the log-likelihoods for each point; instead it returns the pointwise likelihoods for each point.
Ah! I get it! Likelihood vs. posterior probability. I should have picked up on that. And the pointwise means that it is the entire distribution of pointwise likelihoods (actually, log-likelihoods) at each data point that Aki’s loo needs.
I think that your brief explanation is opening up my understanding of what stan is doing. Thinking for the moment in terms of observed values of a single variable assumed to be normally distributed, an essentially frequentist calculation finds the optimal likelihood estimate of mean and sd given the data. The calculated likelihood model is invariant with respect to the probabilities of the successive N random draws from the stipulated prior, so the samplewise likelihood can be calculated a single time. (And I guess that it is this that Aki’s loo needs.) Similarly, the prior probabilities of each of the actual samples can be computed from the prior assumptions (at least so long as the priors are proper). The sample-wise product of these two, when normalized, gives the posterior probability of each sample, the distribution of which is the posterior probability distribution. I guess that if the prior distribution is not proper but is defined and finite, it can be normalized and the above will still work. I am foggy about how the above would work with an infinite or undefined prior probability distribution such as dunif[-Inf. Inf]. Does the above more or less reflect what stan does?
In a frequentist calculation there is no prior at all. Additionally loo is intrinsically Bayesian and requires a posterior distribution for the pointwise log-likelihood, so you will never get what loo needs out of a frequentist analysis of a model.
When you say “sample-wise” do you mean at each iteration of the MCMC? If so, you’re exactly right. This product, summarized across many MCMC iterations, gives the posterior distribution for (something proportional to) the posterior probability.
But if you say “sample-wise” to mean “pointwise” (i.e. the probabilities associated with each observation or “sample” in the data), then you still have a significant conceptual confusion that would be worth clearing up. (There’s no such thing as the pointwise prior, and it would be a big mistake to multiply the prior probability by the likelihood of each point; the prior probability needs to enter the computation of the posterior once, not N times). Happy to elaborate further if you were indeed using “sample-wise” to mean “pointwise”.
Sorry for the slow response. Sometimes “the need to earn” trumps “the need to learn.” And it took me a few days to be sure that I understood what I had said in order to understand what you said. I know that frequentist calculations have no prior. I gather, then, the loo needs BOTH the posterior probability distribution and the “pointwise log_likelihood.” The latter is what I need to understand.
First I have to correct something that I said earlier – that “the samplewise likelihood can be calculated a single time.” I was still thinking of the model likelihood at the single set of parameter values that maximize the model likelihood. But the likelihood is a function of the proposed parameters, not a constant. So what is needed is to calculate the model likelihood at each set of proposed parameters – that is at each “draw” from the prior. The posterior probability distribution is then the product of the probability of the sample parameters times the likelihood of the model at those parameter values. That would be the lp__ for that draw, calculated in the model block.
But the overall likelihood of the model given the drawn parameters is the product of the likelihood of the model with the given parameters at each data point. The data extracted from the wells model in Aki’s totorial includes a log_lik value matrix with N draws times M data points, and it is these that are collected in the “generated quantities” block and from which the summary values for model likelihood at each data point given in the printed summary of the stan model are computed. Aki’s loo requires either the summary, or more likely, the matrix of log_liks, and I now assume that these are the required “pointwise log_likelihood.”
Yeah, you’ve got it. Just a couple of minor tightenings of the language:
Actually, loo just needs the pointwise log-likelihood. This is related to the original point about not needing the normalizing constants in the sampling statements to do loo. These constants would be needed to calculate the lp__ in a way that is consistent across models, but they are not needed to calculate the pointwise log-likelihood.
These aren’t draws from the prior, but rather draws from the posterior. The point of MCMC sampling is that it gives us samples from the posterior.
This is the posterior probability evaluated at that set of parameter values. It’s a single probability, not a full probability distribution. When we talk about the posterior probability distribution, we typically mean the posterior distribution over the model parameters themselves (i.e. the distribution that characterizes our a posteriori knowledge of the parameter values).
Yes, exactly. The pointwise log-likelihood is the likelihood for each point individually, and it changes at each posterior iteration, yieldin gthe N x M matrix that you describe.