Why an equal-weight average in change point detection?

Hello, I am reading 8.2 Change point models | Stan User’s Guide (mc-stan.org) and confused by this part: “Posterior distribution of the discrete change point”. There it says

   p(s∣D)∝q(s∣D)=1/M∑exp(lp[m,s]).

But why every draw has an equal weight 1/M? Why not this:

  p(s|D) = ∑p(s|e, l, D)p(e, l | D) for all parameters e and l? 

Thanks,
Hongbo

This is right, but marginalizing over e and l yields the first expression in your post.

This is how inference from MCMC samples works in general. For example, if we want to estimate the mean of a posterior distribution from a set of MCMC samples, we take the mean of the samples with equal weighting.

Thanks! The problem might be that p(e,l | D) computed by mcmc is not a true distribution, as it doesn’t sum to 1. So p(e,l | D) cannot be used for integration. But I still don’t quite understand why 1/M is used. Maybe it’s an approximation?

1/M is how you take an average across draws

Edit: an important point here is that if we have a bunch of iteration-wise probabilities of a single binary outcome (i.e. a posterior distribution for the probability), then we can summarize these to a single posterior probability of the binary outcome by taking the (arithmetic) mean of the posterior.

Thanks!

Found that the Stan user manual has a later section explaining the math behind: 8.5 The mathematics of recovering marginalized parameters | Stan User’s Guide (mc-stan.org).

For this specific Change Point Dection model in 8.2 Change point models | Stan User’s Guide (mc-stan.org), based on Section 8.5, I think we can reason as follows:

\Pr[S = s\mid D] = \mathbb{E}[I(S = s) \mid D] where I(S = s) is the indicator function, which equals 1 only when S = s

= \mathbb{E}[\mathbb{E}[I(S = s) \mid D, e, l]] by the law of iterated expections

= \mathbb{E}[\Pr(S=s \mid D, e, l)]

=\mathbb{E}[\frac{\Pr(S=s, D\mid e, l)}{\Pr(D\mid e, l)}]

=\frac{1}{M}\sum_{m}\frac{\Pr(S=s, D\mid e, l)}{\Pr(D\mid e, l)}

Here M is the number of the parameters (e, l) sampled by MCMC. These parameters, however, are not independent because MCMC uses Markov Chain to go from one parameter to another. To use the above estimation, the assumption is that M is big enough, so that the law of large numbers ensures that the average is close to the true mean.

Hopefully this understanding is correct and help clarify things up.