But I don’t see a multi_normal_lcdf in the docs. Is there a reason for that? I think I could write multi_normal_lcdf in terms of normal_lcdf (c.f. this stackoverflow answer for the bivariate case). Is that the best way? Thanks.
Unfortunately, multi_normal_lcdf doesn’t exist in Stan. I’m working (slowly) on getting a bivariate normal cdf and lcdf into the math library. The algorithm is more complex than the stack overflow answer you linked to, we’re implementing something like the algorithm presented in section 2.2 here http://www.math.wsu.edu/faculty/genz/papers/bvnt.pdf
Essentially, you’ll declare a parameter vector<lower=0,upper=1>[k] u[N] that you’ll be able to connect to your observations vector[k] x[N] with some clever inverse transform sampling laid out in that google groups post.
I hate to necro an old thread, but I’ve been playing with Ben’s solution and while it works great to sample from a truncated normal, Ben suggests in the PDF you can use it to learn the parameters of the truncated normal, but that doesn’t work because you still need the normalizing constant of the truncation. You can prove this mathematically, but the easiest way to see it is to take the example and add priors to mu and Sigma, so that make_stuff specifies the conditional density of y | mu, Sigma. You’ll have to make the priors tight to prevent divergent transitions (I used K=2 with mu ~ N(0, 0.1), diag(Sigma) ~ N(1, 0.1), and assumed Sigma was diagonal). Then the marginal of mu is just its prior, but Stan will subtly sample from a slightly different distribution - in my examples you can only detect this with a large sample size, and it looks OK according to the naked eye, but a KS test will catch it.
Hi, @Matthew_Simpson, do you have the solution to correct for what you describe? Like math or stan code? Indeed when I sample from mu ~ std_normal(), the marginal of mu should be its prior but it’s not. I don’t observe any problem in the HMC diagnostics.
There is no known solution @samlevy. You need the normalizing constant of the multivariate truncated normal as a function of mu and sigma, and this is simply not analytically tractable for 3+ dimensions. Or at least no one knows how to use NUTS/HMC in this case, AFAIK. If you manage to figure it out, you could have a nice publication - this would have applications to multinomial probit discrete choice models, for example.
What you can do is use NUTS/HMC for the latents conditional on the parameters (y|mu, sigma), and for the parameters conditional on the latents (mu, sigma | y), and construct a NUTS-within-Gibbs sampler in this way. But this is entering original research territory - e.g. I’m not sure how well the tuning algorithms Stan and other NUTS implementations use will work in the context of a Gibbs sampler, and I’m not sure if you can use Stan in this way.
I see. I was trying to use @bgoodri code to infer the mean and/or covariance of a truncated multivariate normal distribution. My ultimate goal was to use NUTS to sample from a constrained Gaussian process in which I write the covariance function as a joint kernel of the GP and its derivative GP, and imposing a certain number of constraints on the derivative GP part (to have a shape-constrained original GP) is akin to sampling from a large truncated multivariate normal. The parameters mu and the hyperparameters of the covariance function are unknown and should be inferred. But as far as I understand what you said, it’s not feasible, unless I place mu and cov as data and not as parameters.
@Mark_Shipman / others interested: here’s a general case where one runs into this problem.
Take simple linear normal regression, y ~ N(mx + c, sigma), and extend it to panel data (y versus x separately for each of many groups) by including in m and c random effects that are normally distributed between groups. If we include the group-specific values of m and c as parameters to be numerically sampled, then the set of y values within each group are condionally independent of each other and our full likelihood looks like y ~ N(m_group x + c_group, sigma) as usual. However, as this is a linear normal model we can analytically marginalise over all the random effects, which becomes particularly desirable when there are many groups and/or very few measurements in some groups (e.g. this allows us to include groups that have only a single measurement). This marginalisation breaks the aforementioned conditional independence, making the set of y values for the same group jointly normally distributed with an arbitrary covariance matrix. I show how to do that in Stan here. If a group has N observations and M of these are censored, we need the N dimensional normal with M of the dimensions integrated over the censoring range. It’s easy to imagine that among some of the applications of group-specific simple linear normal regression, N and M could get large.
That case study is really beautiful (and concise!). Thanks for sharing. Do you have plans for the case study or might you be OK with our including it in our list of official Stan case studies? We should really write something up along these lines for the Stan User’s Guide.
Next release, we should have nested Laplace approximation built-in, but in a black-box way that will work best for normals, but not as well as doing things analytically.
Thanks Bob. Will follow up by email for the case study.
On the use of [priors with Hard Interval Boundaries that are not part of the physical constraints of the problem], HIBs henceforth, I’ll try to argue their corner.
I think your topic illustrates that HIBs are an issue if they cut into where there would otherwise be an appreciably large fraction of the posterior mass, but not if they cut into where there would otherwise be a negligible fraction of posterior mass. My takeaway from that is not to avoid HIBs in general, but to plot the posterior and prior overlaid to compare them, as part of checking model fit (i.e. a posterior that bunches at a non-physical boundary is an example of prior-posterior tension, motivating a less strict prior if on reflection you believe the data more than your initial model choice). I make such plots in all real analyses and most of my case studies, just skipped it in that one to save time.
I take the point made elsewhere that HIBs are not uninformative: one could argue that it’s quite a strong assumption that the mass outside any finite range is exactly zero instead of just very small. However depending on the range chosen and the context, it may be an acceptable assumption. I consider this analogous to the choice of any one likelihood completely excluding every alternative likehood, notably more general/flexible likelihoods that contain your current one within it (like priors over an infinite range encompass any finite range). Prior and likelihood are both models and both wrong, but if they have an acceptable fit to the data and are useful, being as simple as possible is desirable. I find uniform distributions simpler to communicate to a wider audience - a more rudimentary understanding of probability distributions suffices to get the idea that we restricted possible parameter values to this range, all values equally likely, instead of e.g. we used a prior of exp(1). And HIBs can be useful provided the range includes some values in each of the qualitative regimes between which we want to distinguish. e.g. a parameter for how much benefit or harm some treatment causes might theoretically take any real value, but if we truncate the prior to a finite range including implausibly large harms and implausibly large benefits and the posterior concentrates narrowly at modest benefits, our finite range has provided sufficient flexibility for the inference needed.
One issue is when you use a too-narrow interval. That is solved by using wider intervals.
But if you know the edges are implausible, don’t you want to build that into your model rather than saying they’re just as likely as more plausible values? On the other hand, it might not make much difference in practice if you have a lot of data.
An even simpler option would be improper flat priors. Computationally, flat priors have the advantage of not requiring a scaled and shifted inverse logit transform and associated change of variables adjustment.
I feel your pain here. I’m curious how your audience handles the rest of the model if they aren’t familiar with parametric distributions. Or are they choking on the prior because they’re frequentists who don’t use shrinkage who are afraid of putting their thumb on the scale?
Anyway, the bottom line is that we built Stan so that users could fit the models they wanted, so I’m glad you can do that.
On the other hand, it might not make much difference in practice if you have a lot of data.
This is indeed the happy view from where I sit in the Big Data Institute. I agree that when data is scarce, priors should be more carefully chosen to reflect actual beliefs. When data is rich, we have the luxury of choosing the prior that most accessibly or persuasively makes our point. (I think a related point is choosing symmetrical-about-the-null priors for effects even when our actual prior is shifted asymmetrically from the null, provided the data are informative, because this is more persuasive to an audience whose prior is more strongly towards the null. e.g. we may believe a priori that our favourite treatment is helpful not harmful, but some others don’t, and if we use their prior rather than our own, the resulting posterior is more persuasive.)
Thanks, I didn’t appreciate that there was any computational hit from interval priors instead of improper flat priors. I think this will always be outweighed by my preference for a proper prior.
how your audience handles the rest of the model if they aren’t familiar with parametric distributions
I usually aim to include in my audience people with no training in statistics at all - people with an interest in the research question and our answer, even if they can’t follow the statistical modelling. For me a/the key output of Bayesian inference is a plot like one of the attached two examples. I’m not basing this on experience, but imagining myself in the shoes of someone unfamiliar with Bayesian inference, I think the normal(0, 5) prior example here looks a little more mysterious than the uniform(-10, 10) prior.