Sample steps

Hello,

I had a few questions and wanted to avoid multiple posts:

(1) Can you outline the code steps when running a sampling statement for a normal model with just two parameters (mu and sigma) like y~Normal(mu, sigma). When HMC trajectory stops at a parameter set like mu=5, sigma=2 I think:
(a) the log likelihood of the prior for mu is added to the target
(b) the log likelihood of the prior for sigma is added to the target
(c) the log likelihood likelihood calculated at each value of y will be added to the target
(d) then steps a, b and c will be repeated for every iteration. Is that correct? I’ve seen it stated with of the data the prior will have no weight so how much data is needed to outweigh the prior? Do you have a reference for that? Before running step (a) is there a do you have a reference for what takes place?
(2) On page 2 of this LOO paper: https://arxiv.org/pdf/1507.04544.pdf formula 3 refers to the “computed log pointwise predictive density” and that is used in elpd loo calculation but the loo package uses a variable named “log lik” and also refers to calculating the log likelihood on page 8: “Thus, to compute these measures of predictive fit in Stan, the user needs to explicitly code the factors of the likelihood (actually, the terms of the log-likelihood) as a vector”. It seem like both “predictive density” and “log likelihood” are used to refer to the same thing in the paper. Can you clarify if the elpd is calculated using a log predictive density or log likelihood?
(3) What is the acceptable number for effective sample size? Do you have a reference for this?
(4) For reproducibility, when using the stan() function, if ones sets the seed argument does that guarantee the posterior vectors will be identical when rerunning the model?

Thank you!

The log priors and log likelihood are both added to the posterior (the equivalent of multiplying the priors and the likelihood).

Whether or not the prior or likelihood contributes more to the posterior depends on how much data there is, how explanatory the likelihood model is, and how strong and informative the priors are.

You can use a formula for the minimum number of effective samples, such as the one found here. You’ll still have to decide on your significance and tolerance level (probably 0.05 for both) and then it depends on how many parameters are in your model (i.e. the number of parameters returned on each MCMC step in Stan).

The returned posterior is not guaranteed to be the same for chains seeded in the same place because the seed could end up being somewhere where the posterior is very flat or has a saddle point. Generally speaking, however, for a sufficiently “nice” posterior (not degenerate, acceptable curvature), chains should converge on the parameter set that gives the global likelihood maximum.

(1) I’m not sure what you mean by "stops at a parameter set like mu = 5 and sigma = 2. HMC just keeps sampling indefinitely. Each sample will have a value for mu and sigma. If you want to know how HMC works, check out Radford Neal’s intro paper. If you want to know how Stan works, there’s the system paper and also the reference manual, the latter of which has all the gory details. It’s not strictly the (a)–(d) you outline. Instead it starts with the change of variables adjustment for constrained parameters—that’s part of the density. Then we add all of the target increments in the transformed parameter and model blocks. That just defines a differentiable density which we then use for Hamiltonian Monte Carlo (the no-U-turn sampler by default). Also, the steps (a)–(c) may be calculated hundreds of times per iteration. They get calculated each leapfrog step in the simulation of the Hamiltonian dynamics. The number of evals is reported in the output.

(2) The ELPD is the expected log posterior density of new data, which is calculated using posterior predictive inference for the likelihood. Specifically, if we have observed data y and have a new data point y', we are trying to estimate the expected value of \log p(y' \mid y) for new data points y', where p(y' \mid y) = \int p(y' \mid \theta) \cdot p(\theta \mid y) \, \textrm{d}\theta. You might find this chapter of the User’s Guide useful, which goes into more details of the calculation.

(3) This is a matter of researcher preference. Standard error (i.e. standard deviation of the estimator) for MCMC is sd / sqrt(ESS), where sd is the standard deviation of the quantity being estimated. We usually recommend around ESS = 100, because that point the standard error is 1/10th of the posterior standard deviation.

We generally recommend at least ESS 50 per chain so that our ESS estimates are stable. You can find people recommending thousands and you can find David MacKay recommending a dozen in his book.

(4) Reproducibility conditions are discussed in this chapter of the reference manual. You need to keep the hardware and all the software and compiler settings the same, too, because C++ doesn’t guarantee numerics other than to within IEEE standards.

I’m not exactly sure what this means because we get a sample, not a point estimate out of MCMC. And most of our models are hierarchical and have no maximum likelihood estimate because the density is unbounded (it grows as hierarchical variance shrinks to zero and low-level parameters shrink to the population mean; or as David MacKay said, EM goes boom!).

While we should get mixing with well behaved posteriors, the result is MCMC samples, not a point estimate. An MCMC sampler will not sample around the posterior mode in high dimensions even in cases where there is such a mode. What MCMC does is samples in the region of highest posterior probability mass (which is where density times volume is highest, not where density is highest). I wrote a small case study to try to explain why this is: Typical Sets and the Curse of Dimensionality

If the posterior is symmetric and unimodal, then the posterior means will converge to the posterior mode, which is the maximum a posteriori estimate, not the maximum likelihood estimate.

If the model is “well behaved” then in the infinite data limit the posterior will converge to a delta function around the maximum likelihood estimate.

Sorry, I meant to say the global posterior probability maximum, not the global likelihood maximum, by which I meant the tallest peak in the posterior rather than a maximum point. And I didn’t word it correctly, but the samples are what I was referring to when I said the “parameter set.” I can see how the word set could be taken to imply the set of point estimates that best capture the data for every parameter in the model, but I meant the set of values for each parameter around which the samples converge.

Thank you for the response!

For #1 I will check the references. When you say, “Then we add all of the target increments in the transformed parameter and model blocks. That just defines a differentiable density which we then use for Hamiltonian Monte Carlo”. The sum of the target increments is the _lp out by stan correct? The target increment sum itself is not used by Hmc correct?

Question 2 is really about semantics. It appears the loo paper uses both “predictive density” and “likelihood” to refer to same thing but I thought Density implies the parameters are fixed and Likelihood implies the parameters are random. I was thinking Likelihood should be the language used since the parameters are random in the Bayesian framework. Are there really 2 distinct objects being referenced, one with random parameters and one with fixed?

Correct.

Not correct. It is used both to represent the potential energy in the hamiltonian and to probabilistically select a point along a simulated hamiltonian trajectory to use as the next sample.

If you’d like to dig up specific passages in the loo paper that you’re interested in, I’ll take a look. In the loo context the predictive density is often expressed pointwise, while the likelihood is often take to be a product over all points.

I don’t think that this distinction is intended here. FWIW, I don’t think that the term “likelihood” is typically restricted to settings where parameters are random (e.g. “maximum likelihood estimation”).

Regarding passages:

On page 2 of the LOO paper: https://arxiv.org/pdf/1507.04544.pdf formula 3 refers to the “computed log pointwise predictive density” and that is used in elpd loo calculation but the loo package uses a variable named “log lik” and also refers to calculating the log likelihood on page 8: “Thus, to compute these measures of predictive fit in Stan, the user needs to explicitly code the factors of the likelihood (actually, the terms of the log-likelihood) as a vector”.

The abstract also refers to the log likelihood “Leave-one-out cross-validation (LOO) and the widely applicable information criterion (WAIC) are methods for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulations of the parameter values”. Here the “log-likelihood evaluated at the posterior simulations of the parameter values” is formula 3 which uses is titled “log pointwise density”.

Again, this is semantics. Likelihood and density are conceptually two different things (random parameters vs. non-random) and it seems in the paper uses both to refer to the formula on page 3.

Yes, these are all referring to the same thing. Each element of the response has an associated leave-one-out likelihood given, iteration-wise, by evaluating a probability density function (corresponding to the leave-one-out predicted sampling distribution of the response) at the location of the observed response. In this sense, the likelihood is a density.

I’m not familiar with the semantic tradition you mention of restricting “likelihood” to settings where parameters are random (frequentists use likelihoods all the time), nor of restricting “density” to settings where parameters are fixed (a bayesian posterior is a probability density function).

Edit: I think I understand the semantic issue now. We talk about the probability density at the location of the data as a shorthand for the following: construct a probability density function for the expected sampling distribution of unknown response data (which is the prediction task in the leave-one-out context, as the datum is withheld), and then evaluate that function at the location of the withheld data. Note also that “likelihood” need not imply that the parameters are treated as random; it just implies that we evaluate the density as a function of hypothetical parameter values. Again, frequentists treat the parameters as fixed, but have no qualms about discussing the likelihood function.

The value of lp__ (two final underscores) includes all the target increments, but it also includes all the change-of-variables adjustments for constrained parameters.

Convergence for MCMC means convergence to the stationary distribution, meaning that draws converge to marginal draws from the target distribution. It does not mean that the draws themselves approach the posterior modes. You can explore this using an N-dimensional multivariate standard normal distribution. In this case, the squared distances of the draws from the origin (the posterior mode) will have a chiSquare(N) distribution (because the sum of squared draws is a sum of standard normals). As N increases, this is increasingly bound away from the origin, which is the posterior mode. On the other hand, the posterior mean of the draws will converge to the mode at a rate of 1 / sqrt(M) for M draws, assuming we have geometric ergodicity (i.e., fast enough convergence that an MCMC central limit theorem holds). The posterior means only converge to the posterior mode in the case of symmetric posteriors.

In asymmetric distributions, the modes and means and medians don’t line up (look up the gamma distribution, for example).

This is a bit confusing because loo requires the pointwise likelihoods for its computation.

The expected log posterior predictive density (ELPD) is just what it says on the tin—an expectation of log predictive density, i.e., \mathbb{E}[\log p(\widetilde{Y} \mid y)], where y is the observed data and \widetilde{Y} is a random variable representing new data (over which the expectation is taken), and where the posterior predictive density is given by

p(\tilde{y} \mid y) = \mathbb{E}[p(\tilde{y} \mid \theta) \mid y] = \int_{\mathbb{R}^D} p(\tilde{y} \mid \theta) \cdot p(\theta \mid y) \, \textrm{d}\theta.

The likelihood is \mathcal{L}(\theta) = p(y \mid \theta), which is a function of the parameters for a fixed data set y. It means the same thing in Bayes and frequentist stats. But if you’re a frequentist, you never treat the parameters as random! If \theta is fixed, then p(y \mid \theta) defines a density over y which we call the “sampling distribution” in Bayes. It’s also the data generating distribution in frequentist analyses, but they’d be careful to write a likelihood function as \mathcal{L}(\theta) = p(y; \theta) so as not to imply it’s a conditional distribution, which would only be defined if \theta were treated as random.

Yes it’s semantics. But in order to communicate, it helps if everyone’s on the same page as to conventional interpretations of technical terms. (I love semantics and literally wrote a book on linguistic semantics back when I worked on linguistics). The paper’s being technically precise here in its terminology, but I suspect the use of a variable named log_lik in Stan is causing confusion. The log likelihood function \mathcal{L} applied to a parameter vector \theta returns the log density of the data given that parameter, \mathcal{L}(\theta) = p(y \mid \theta).

The ELPD is not the likelihood, it’s the expected posterior log density. You can see from the formula defining it that it is an expectation over the posterior predictive distribution. It involves the likelihood as a term. The variable log_lik in Stan is called that because its value is the pointwise likelihood evaluated at posterior draws of the parameter vector. sum(log_lik) is the value of the log likelihood function at the posterior draw.

What loo does is simulate a kind of cross-validation to figure out the log_lik of one held out point based on the other points. That is, it’s a kind of leave-one-out cross-validation estimator of ELPD.