Hierarchical model, binning, no closed-form integral

I have a difficult time conceptualizing a multilevel model where some observations are binned. The following is a simplification of my actual model for the purposes of asking this question.

I have parameters \theta, \varepsilon, \mu, \Sigma, and latent variables x_i, y_i for i = 1, \dots, N.

I am assuming that

\begin{bmatrix} x_i \\ y_i \end{bmatrix} \sim N(\mu, \Sigma)

come from a bivariate normal distribution.

The data is X_i, and a bracket [a_i, b_i] for each i. X_i is simple:

X_i \sim N(x_i, \varepsilon)

For the bracket, let f(x_i, y_i, \theta) be a parametric function family. For each i, all I know is that f(x_i, y_i, \theta) \in [a_i, b_i] (f is a structural model, and this is survey data). [edit: sometimes b_i = \infty].

Now my likelihood is an indicator function, and it is unlikely to work nicely with a HMC in \mathbb{R}^n.

Is there some kind of standard trick I can use? I thought of adding a noise term, but I am not sure that its magnitude would be identifiable.

There are two possible solutions, but I’m not sure you’ll find the first satisfactory, and I’m not sure you’ll find the second workable.

  1. Approximate your piecewise constant likelihood with a continuous function, i.e. one that is sigmoid in shape near a_i and b_i. The sharper the sigmoid the better the approximation and the worse the likely performance of the sampler as the curvature in the likelihood becomes more extremely variable.
  2. Analytically find the region of parameter space that satisfies the constraints imposed by the bracket and f, find a representation of this region that uses unconstrained parameters, and sample in that space.
1 Like

regarding option 1, I like the inverse logistic function as it only needs one power series evaluation. Control the height h, midpoint \mu and bandwidth \sigma (inversely proportionate to “steepness”) thus:

\frac{h}{1+e^{- \frac{x- \mu}{\sigma}}}

Experiment with \sigma and see how Stan slows down as it shrinks, find a tolerable level. Negative \sigma means a decreasing ramp.

By the way, some people are never happy with this approach and that’s OK too.

1 Like

I guess f() is monotonic or you’re going to be in a much harder place. Also that you have at least some idea of the form of f()?

1 Like

f() comes from a numerical solution applied outside Stan, it is approximated using Chebyshev polynomials. I think it is monotonic in practice, aside from the usual tiny numerical wobbles the approximation may introduce.

@jsocolar’s suggestion (1) may be the practical solution, my understanding is that, with the right distribution, it is equivalent to introducing another error \eta_i and assuming that

f(\dots) + \eta_i \in [a_i, b_i]

I can start with a large variance for \eta_i, see how it works, and just scale it down to the point where I am still satisfied with MCMC convergence. I may just assume that \eta_i \sim N(0, \sigma_\eta) directly, greasing the wheels.

Though this begs the question why I am simply not pretending that the binned data was observed as its midpoint, with a normal noise that has standard deviation around bin width/\sqrt{12} (std of uniform), or adjusted mildly upward.

Is there a downside to doing this? Apart from having to assume something for the last open-ended bin.

I think that’s an acceptable alternative. Some might think it’s not as principled as an interval-censoring approach, but it could be very close, and whatever you choose requires assumptions.

1 Like

Since you have f() you could take another monotonic function, such as a cdf with an inverse cdf function available (not necessarily closed-form since you could use the inverse normal cdf which is numerical), and use the composition of that function with f() to do @jsocolar point 2. Sampling everything in unconstrained space. The downside here is that this makes interpretation of the parameters from the model murky. So in generated quantities or outside of stan, you push through back to the scale you want.

Sorry, I don’t understand, can you please elaborate a bit?

I hope you don’t mind if I ask some clarifying questions.

  1. What’s the different between parameters and latent variables? Are they both unknown and hence coded as “parameters” in Stan?

  2. What does it mean that your likelihood is an indicator function? Do you mean it’s a discrete distribution with 100% of its mass on one point? If so, I don’t see how you could do statistics with it or model survey data with it.

  3. I’m also not sure what you’re calling a likelihood when there are latent variables. There is a standard distinction made in the frequentist literature where they try to finesse unknowns philosophically by calling them “missing data.” In that situation, there’s usually missing data z and observed data a (y was taken). Then you have the complete data likelihood p(a, z | ...) and the likelihood p(a | ...) that marginalizes the “missing data” out.

  4. What is a structural model? It looks like f is a function, but what kind?

It’d be easier to help if you could write down the joint model using mathematical notation.

P.S. The x, y notation is going to be confusing for Stan users since we followed Gelman et al.'s BDA in using x for unmodeled data and y for modeled data.

I appreciate it! It is very easy to fall into the trap of not formulating a question well, and then not getting answers.

  1. of course in Bayesian terminology, my “latent variables” are parameters.

  2. the mental model I have of binned data is that a data point x is reported as being in a bin [a,b] if 1_{x \in [a, b]}, which is an indicator function. If x depends on the model parameters, my likelihood will be an indicator function. Is that not so? (I am more concerned about the inverse image being an irregular subset of \mathbb{R}^n to be honest, for HMC/NUTS)

  3. I am not marginalizing anything, the likelihood is what all Bayesian textbooks call the likelihood. That and the prior give me the posterior that Stan samples from.

  4. f comes from an economic model (a Bayesian Nash Equilibrium in a game) that is solved numerically, then the solution is approximated for the purpose of sampling in Stan. It is differentiable.

I realize that the original example is confusing, so here is a simpified one:

  1. Let g(\cdot \mid \theta) denote the pdf of a parametric distribution family, eg \mathrm{Normal}(\theta, 1) or similar.
  2. Let f: \mathbb{R} \to \mathbb{R} be a continuously differentiable function, nothing else is assumed.
  3. For i = 1, \dots, N, y_i \sim G(\theta), which are not observed. What is observed are bins k_i such that f(y_i) \in [a_k, a_{k+1}]. The bins form a partition of real numbers.

Let D denote the data, ie the bin boundaries a_k and the bin indices k_i.

Then my likelihood is

p(D \mid \theta, y_1, \dots, y_N) = \prod_i g(y_i \mid \theta) \cdot 1_{f(y_i) \in [a_j, a_{j+1}]}

Imposing a proper prior p(\theta) completes the model.

Everything would be easy if f was invertible (just replace z_i = f(y_i) etc), but it is not necessarily.

I experimented with it and found that it has the unpleasant property of the bin width affecting the peak likelihood. For a very simplified example, consider f(x) = x^2, and [a,b]=[3,4], then the likelihoods are


I think I am better off with the “this quantity was binned with a small noise” approach. I have coded it in Stan and now the chains are running, I am curious how it will go.

1 Like