Fitting distribution with data on probability mass in bins

Imagine I ask someone to allocate probability mass to discrete bins of realizations of a variable. For example, I ask them what’s the probability that X \in (-1, 1], that X \in (1, 3], and so on. Imagine I also ask them what is the mean outcome that they expect.

Let us simulate this scenario. A person thinks the correct distribution is a standard normal, so when I ask them to allocate prob to the following bins: (-\infty, -1], (-1, 1], (1, +\infty), they will give me the probabilities: 0.159 0.683 0.159 (r c(pnorm(-1), pnorm(1) - pnorm(-1), 1 - pnorm(1))). They also give me a point estimate for the mean, 0. I now want to infer what plausible values of \sigma would lead to those probability bins.

A way I was thinking of writing this likelihood function is:

L = F(\text{UB} | \text{Point estimate}, \sigma)^\text{C} \cdot (1-F(\text{UB} | \text{Point estimate}, \sigma))^{(1-\text{C})}

where F is the CDF of the normal distribution, UB is the upper bound of the interval (in the example: c(-1, 1, Inf); the last bin would be ignored as it is redudant and to avoid numerical issues); Point estimate is provided by the person; and C is the cumulative proability of the bin (0.159 0.841 1.000; r pnorm(c(-1, 1, Inf)))

The solution to the maximum likelihood problem is straightforward. We need a \sigma such that F(\text{UB} | \text{Point estimate}, \sigma) = \text{C}. Running stan optimize (see below) leads to a very quick solution: \sigma = 1 for this example.

However, when running HMC the sampler struggles hard, exploring all the way to infinity. I know I am not using priors, but even when I do so, the algorithm explores parameters which are clearly unreasonable for the problem.

Below I show the code running this example. In the first approach, no prior is introduced. Notice in the second I introduce a prior just to constraint the most extreme outcomes but it’s still not enough to stop the sampler from going very far into unreasonable values. For example, sigma having the 75th percentile at ~13. If sigma was 13, the cumulative probability of the bins would look very different (0.469 0.531 1.000 instead of 0.159 0.841 1.000). In addition, as you can see in the chart, when I plot sigma against lp__, the maximum is clearly not close to 1, the solution to the maximum likelihood problem solved with r rstan::optimizing.

So I am either doing something very wrong with the modelling and there is a better way to do this, or there is a numerical issue I am not aware. Do you have any ideas?

Thank you for your time.

library(rstan)
#> Loading required package: StanHeaders
#> 
#> rstan version 2.32.3 (Stan version 2.26.1)
#> For execution on a local, multicore CPU with excess RAM we recommend calling
#> options(mc.cores = parallel::detectCores()).
#> To avoid recompilation of unchanged Stan programs, we recommend calling
#> rstan_options(auto_write = TRUE)
#> For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
#> change `threads_per_chain` option:
#> rstan_options(threads_per_chain = 1)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

data_to_pass_stan <- 
  list(
    N = 2,
    point_estimate = c(0, 0),
    upper_bounds = c(-1, 1),
    cumulative_prob = pnorm(c(-1, 1))
    )

example_stan_dist <- rstan::stan_model("infer-distribution-simple.stan")
print(example_stan_dist)
#> S4 class stanmodel 'anon_model' coded as follows:
#> data {
#>   int<lower=0> N;
#>   vector[N] point_estimate;
#>   vector[N] upper_bounds;
#>   vector[N] cumulative_prob;
#> }
#> parameters {
#>   real<lower=0> sigma;
#> }
#> model {
#>   for (n in 1:N) {
#>     target += (cumulative_prob[n]) * normal_lcdf(upper_bounds[n] | point_estimate[n], sigma) +
#>     (1-cumulative_prob[n]) * normal_lccdf(upper_bounds[n] | point_estimate[n], sigma);
#>   }
#> }

rstan::optimizing(example_stan_dist, data = data_to_pass_stan)
#> $par
#>     sigma 
#> 0.9999986 
#> 
#> $value
#> [1] -0.8748665
#> 
#> $return_code
#> [1] 0
#> 
#> $theta_tilde
#>          sigma
#> [1,] 0.9999986

samples <- 
  rstan::sampling(example_stan_dist, data = data_to_pass_stan)
#> Warning: There were 2027 divergent transitions after warmup. See
#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
#> to find out why this is a problem and how to eliminate them.
#> Warning: Examine the pairs() plot to diagnose sampling problems
#> Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
#> Running the chains for more iterations may help. See
#> https://mc-stan.org/misc/warnings.html#bulk-ess

samples
#> Inference for Stan model: anon_model.
#> 4 chains, each with iter=2000; warmup=1000; thin=1; 
#> post-warmup draws per chain=1000, total post-warmup draws=4000.
#> 
#>                mean se_mean   sd         2.5%           25%           50%
#> sigma 9.300592e+307     NaN  Inf 7.87289e+306 5.173977e+307 9.403159e+307
#> lp__   7.074900e+02    0.05 0.86  7.05270e+02  7.071500e+02  7.077500e+02
#>                 75%        97.5% n_eff Rhat
#> sigma 1.349518e+308 1.75899e+308   NaN  NaN
#> lp__   7.081100e+02  7.08370e+02   320 1.01
#> 
#> Samples were drawn using NUTS(diag_e) at Fri Nov 10 10:04:25 2023.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split chains (at 
#> convergence, Rhat=1).

example_stan_dist_prior <- rstan::stan_model("infer-distribution-simple-prior.stan")
print(example_stan_dist_prior)
#> S4 class stanmodel 'anon_model' coded as follows:
#> data {
#>   int<lower=0> N;
#>   vector[N] point_estimate;
#>   vector[N] upper_bounds;
#>   vector[N] cumulative_prob;
#> }
#> parameters {
#>   real<lower=0> sigma;
#> }
#> model {
#>   sigma ~ exponential(.1);
#>   for (n in 1:N) {
#>     target += (cumulative_prob[n]) * normal_lcdf(upper_bounds[n] | point_estimate[n], sigma) +
#>     (1-cumulative_prob[n]) * normal_lccdf(upper_bounds[n] | point_estimate[n], sigma);
#>   }
#> }

rstan::optimizing(example_stan_dist_prior, data = data_to_pass_stan)
#> $par
#>     sigma 
#> 0.9114252 
#> 
#> $value
#> [1] -0.9700811
#> 
#> $return_code
#> [1] 0
#> 
#> $theta_tilde
#>          sigma
#> [1,] 0.9114252

samples_with_prior <- 
  rstan::sampling(example_stan_dist_prior, data = data_to_pass_stan)
#> Warning: There were 31 divergent transitions after warmup. See
#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
#> to find out why this is a problem and how to eliminate them.
#> Warning: Examine the pairs() plot to diagnose sampling problems

samples_with_prior
#> Inference for Stan model: anon_model.
#> 4 chains, each with iter=2000; warmup=1000; thin=1; 
#> post-warmup draws per chain=1000, total post-warmup draws=4000.
#> 
#>        mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
#> sigma  9.17    0.27 9.27  0.59  2.57  6.00 12.97 35.10  1205    1
#> lp__  -0.41    0.01 0.56 -1.97 -0.57 -0.21 -0.03  0.02  1753    1
#> 
#> Samples were drawn using NUTS(diag_e) at Fri Nov 10 10:04:26 2023.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split chains (at 
#> convergence, Rhat=1).

library(tidybayes)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)

samples_with_prior %>% 
  spread_draws(lp__, sigma) %>% 
  ggplot(aes(x = sigma, y = lp__)) + 
  geom_point() + 
  scale_x_log10()

Created on 2023-11-10 with reprex v2.0.2

Have you plotted your target density? You can use rstan to evaluate the log density at values of sigma on a grid and then plot them. Then you know what the sampler should be doing.

I think the bigger problem here is that you’re not defining a generative statistical model of the type sampling is intended to solve. Instead, you have some kind of constraint satisfaction problem, which may introduce an unintended distribution.

My general advice is to think about Bayesian models generatively. Here, you only have a single parameter sigma. With the parameter value and constants (here just the size N), you should be able to generate the modeled data of point_estimate, upper_bounds and cumulative_prob. Or maybe upper bounds are given as constants? It helps to clarify which variables are fixed and which are random.

To help you think about this, what is the model for different respondents returning different values for the location parameter? Presumably they’re generated with some noise around the true value, which I would suggest making latent. That is, introduce a location parameter mu and then assume that the respondents have something like point_estimate[n] ~ normal(mu, tau), where you can fit tau too. Then, given N, mu, tau, and sigma, you can generate the point_estimate. You’d also need an observation model for their cumulative probability estimates (those are also point estimates, which is why I would not recommend names like “point_estimate” for variables—here, I might use estimated_location if I wanted to be verbose or just y if I was being terse).

@andrewgelman often says that adding uncertainty to a deterministic model in this way “greases the wheels of commerce.” I think you’ll find it easier to reason about the model and you’ll find it easier to fit.