New marginalisation section for the Stan User's Guide

Hi all,

Recently, we wrote a paper describing an R package for fitting version of the Dawid-Skene model using Stan (see here for a previous post on the forums publicizing the paper). In the paper we also provided a mathematical description of the process of marginalizing out the discrete latent class parameter of the Dawid-Skene model, which was required to allow Stan to fit the model.

We believe that the description in the paper may be useful to the broader Stan community who are interested in fitting models with discrete parameters. Therefore, we have written a summarized version of that section from the paper, focusing on the parts most relevant to Stan and we would like to contribute this to the Stan User’s Guide. As this is a relatively large contribution, we wanted to make sure the Stan community and development team were happy with the idea before we submit it via a pull request. We also wanted to allow interested members of the community to comment on and improve our contribution.

You can read and edit the contribution on overleaf here. We welcome any constructive criticism and feedback on the draft.

Best,

Jeffrey Pullin, Lyle Gurrin and Damjan Vukcevic

10 Likes

In general I feel pretty positive about this (not that my opinion should count for much). I do have several suggestions that are major enough that I’m reluctant to dive in an implement them as edits without discussion first:

1. There’s a lot of math to wade through before we get to any real statement of the main point, which as I see it is this: To recover posterior expectations for discrete parameters that have been marginalized out, we need to condition on any relevant continuous parameters as well as on the data.
2. Personally, I don’t find it useful to write in abstract mathematical terms about how to recover posterior estimates for the continuous parameters. I prefer to think about it this way: we have a marginalized model with exclusively continuous parameters upon which we can do inference in the completely usual way. No need to muddle or confuse this. The subtlety arises iff we want to do inference on the discrete quantities that are no longer parameters in our model.
3. As written, I think that emphasizing conditioning estimates for marginalized quantities on the (continuous) model parameters overshadows the equally important need to condition the estimates on the data. For example, we can formulate a zero-inflated Poisson using a latent discrete parameter Z \sim Bernoulli(p) that tells us whether we are in the zero-inflation regime or not. It’s not good enough to condition posterior inference about Z on p. We also need to condition on the data. In particular, if the observed count is greater than zero, then we are certain that Z = 1 (where Z = 0 corresponds to the zero-inflated state). And if the observed count is zero, then Z will be zero with probability greater than 1 - p.
4. To me the title is ambiguous, because it could just as well apply to a section on the mathematics of how to marginalize out discrete parameters. A better title, IMO, would be something like “Recovering (or maybe Obtaining) posterior inference on marginalized discrete parameters”

Thus, my overall suggestion would be to begin with a simple statement like:

Inference on the parameters of a marginalized model is straightforward, but recovering inference on the discrete quantities that have been marginalized out is more subtle. In particular this inference must be properly conditioned on both the continuous covariates and the data.

Then, I would drastically condense everything prior to line 79.

Finally, I would prominently include a simple worked example.

6 Likes

Hi @jsocolar,

Thanks so much for your detailed feedback!

It’ll be a little while before we have time to digest and respond to your thoughts, so in the meantime I just wanted to say thanks for taking the time to respond.

Best,
Jeffrey

3 Likes

Hi Jacob,

We have now updated our draft contribution based on your feedback.

To respond to your feedback we have:

1. That the purpose of the section is to provide a mathematical explanation of recovering marginalized parameters, and,
2. That understanding the mathematics in the section is not necessary to fit models with marginalized parameters in Stan.
• Changed the title to better reflect the focus on recovering marginalized parameters.
• Added text clarifying that we are suppressing the conditioning of data purely for ease of notation

We have also:

• Added section headings, which we hope will make the contribution easier to follow, and
• Corrected some minor typos

We agree that adding a adding a simple worked example would improve the guide, but we feel it is not needed to explain the mathematics and therefore would be better inserted elsewhere.

Thank you once again for your detailed thoughts!

If there is no other substantial feedback on this new version of the contribution we plan to submit it by PR to the Stan User’s Guide sometime in the next few days.

Best,

Jeffrey Pullin, Lyle Gurrin and Damjan Vukcevic

4 Likes

I’m interested in reading but it says I do not have permission when I click on the link.

1 Like

Hi @spinkney,

Edit: I think I’ve now fixed the link in my reply to Jacob.

Best,
Jeffrey

1 Like

Works, thanks!

1 Like

Hi @jeffreypullin

I do still worry that the current draft will mislead people in practice. Recall that the users guide is going to be read, and translated into applied statistical practice, by people who don’t necessarily have much background in math or probability. Perhaps I am oversensitive precisely because I very nearly made (a mistake analogous to) the following mistake a while back in my own applied work:

Suppose I am a Stan user and I want to fit (for the sake of a simple example), a zero-inflated Poisson which I write mathematically as:

Z \sim Bernoulli(p)
Y \sim Poisson(Z*\lambda)

Marginalized Stan code (not optimized) is

data {
int N; // number of data
int<lower=0> Y[N] // response
}
parameters {
real<lower = 0, upper = 1> p;  // probability that we are *not* in the zero-inflation regime
real<lower = 0> lambda; // Poisson parameter
}
model {
for (i in 1:N) {
if (Y[i] == 0) {
target += log_sum_exp(bernoulli_lpmf(0|p), bernoulli_lpmf(1|p) + poisson_lpmf(0|lambda);
} else {
target += poisson_lpmf(y[i]|lambda) + bernoulli_lpmf(1|p);
}
}


Now suppose the fitted values of p and lambda are p = 0.5 and lambda = 10. And suppose we want to recover a posterior for Z[1].

If I’m a user who is uncomfortable with marginalization (and therefore I am reading through the users guide to do this properly), my main takeaway would be that I can estimate Pr(Z = 1) (in your document this is Pr(Y=k)) by evaluating Pr(Z = 1 | p, \lambda). So what do I make of this? I’m accustomed to generative modeling and I know how to simulate data for given values of p and \lambda, and I know that Z is independent of \lambda when I simulate the non-marginalized model from the generative process. So naturally, I take the posterior for Z[1] to be bernoulli_rng(p).

The problem is that this is wrong. In this example, it’s very wrong. If Y[1] is one, then Z[1] is one with probability one. And if Y[1] is zero, then Z[1] is one with probability near zero.

In my personal experience, this is the main confusion about recovering the marginalized parameters, and I think that without some kind of tweak, this draft deepens, rather than ameliorates, the confusion. What is missing is a reminder (or several) that in order to properly condition on the continuous parameters, we need to also condition on the data. And not just in the sense of conditioning the continuous parameter estimates on the data. (edit to add: We need to notice that our likelihood itself conditions on the data with its if statement). The conditional expectation for \mathbb{E}(Z|p,\lambda) (as in the current draft), read in a vacuum, is simply p. What we need is \mathbb{E}(Z|p,\lambda,Y). I think that explaining this carefully is the main thing that a users guide section ought to accomplish here.

Again, I think adding this section to the users guide will be really useful, and I feel really positive about the direction you’re taking. Thank you!

7 Likes

Hi @jsocolar,

Thanks once again for your feedback!

We agree that our contribution could be a little misleading in the way you describe; we will update it soon to clarify the need to condition on data.

Best,
Jeffrey

1 Like

Hi @jsocolar and all,

Thank you for your continuing feedback!

We have now updated the contribution to explicitly mention the need to condition on the observed data when recovering the latent parameters.

Best,
Jeffrey

Things for clarity and Stan users

At the end of 2.2, I think it’s nice to reference that expectation property as the law of iterated expectations so someone this is new to can look it up.

In 2.2 and 2.3 and possibly elsewhere. There is plenty of confusion about whether Stan “draws from” a distribution when one writes x ~ std_normal(); (it does not). I think it would be helpful to say how using the law of iterated expectations allows us to write a valid Stan program on the continuous Euclidean space that the Stan sampler can explore. However, you can recover these by actually drawing from the distribution in the generated quantities block. I think it would be useful to add which part of the Stan program that these things would go. For example, I’d remove the “Stan as usual” and be more explicit about what that means.

In the summary you mention Rao-Blackwellization but this is not mentioned anywhere else. Can you add a definition or explanation either earlier (better) or in a footnote (sufficient)?

3 Likes

Hi @spinkney,

Best,
Jeffrey

2 Likes

Hi all,

Thanks to everyone who provided feedback on the contribution!

A PR to the Stan docs repository adding the final version of the contribution has now been created here.

Best,
Jeffrey

2 Likes