Issues with overestimates of the mean of a beta likelihood

Dear all,
I am trying to model probability outcomes using a Beta likelihood with mean/precision parameterization.

Briefly on the data generating process:
I have several models that estimates the probability of a latent binary variable (using other data) and I am currently trying leverage the output of all the models/methods for a large number of events in a datasets (roughly 2k - 12k events in one dataset).
Unfortunately (with my current model setup), I’ve noticed a weird behavior for very small values where the mean is sometimes inferred to be larger than the largest observed value.
I.e., max(y_i)<\mu_i, for some set of observations y_i and their inferred mean \mu_i.

Here is my (current) model setup for the i:th event estimated by the j:th method:
\log(\mu^*_i)\sim\mathcal{N}^-(0, 5)\quad\quad\quad\quad\quad\quad\quad\;\;\,\text{//Means Prior}
\mu_i = 1 - \mu^*_i\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\;\;\quad\quad\quad\quad\text{//Means transform}
t_i\sim Logistic(\eta_i,\sigma_t)\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\;\;\text{//Precision priors}

The choices for modeling the mean was so that smaller means would be closer to the largest prior density (as these are the ones that are problematic).
For the precision, I performed an approximation error analysis as follow:

  1. produce a range of means (\mu_i s) in (0,1).
  2. For each mean, start with some small precision (t_i).
  3. Draw a sample of {2, 3, 4, 5, 6} measurements from \beta(\mu t_i, (1-\mu_i)t_i).
  4. Calculate the sample mean and calculate the approximation error: 100*\left|1-\frac{\ bar{X}}{\mu_i}\right|.
  5. If error less than {1, 10, 50}% break, else do t_i\gets1.1*t_i and repeat 3-5.

Plotting the final precision against \mu_i was roughly linear on a log-log plot.
As such, I’ve tried a power-law regression on the precision using the mean (among a myriad of other things) but this only succeeded if the simulated data had a strong power-law dependency of the precision on the mean and did not generalize well on real data.

In fact, the mean estimate can sometimes be several magnitudes larger than the largest observed probability. I.e., sometimes the largest probability of a method could be 4*10^{-4} and the estimated mean 4*10^{-2}.
This issue seems to come from the fact that the precision also gets drastically underestimated for data with smaller means.

Here are example figures to illustrate the problem.
Comparing the sample mean with the mu parameter to simulate the data:

While when comparing the models inferred mean it produces strong underestimates for smaller means:

And here is just a table comparing the smallest inferred means with the real parameters:

# A tibble: 1,000 × 8
   index   eta     mu mu_star   t_i   mu_true ti_true mu_sample
   <int> <dbl>  <dbl>   <dbl> <dbl>     <dbl>   <dbl>     <dbl>
 1   981 0.798 0.0262   0.974  2.18 1.17e- 11   15.7  1   e-300
 2   666 0.793 0.0263   0.974  2.15 2.39e-  7    1.42 1   e-300
 3   229 0.832 0.0270   0.973  2.11 2.20e- 30    6.81 1   e-300
 4   402 0.823 0.0271   0.973  2.30 1   e-150   16.1  1   e-300
 5   933 0.811 0.0275   0.972  2.22 7.42e-  7    1.16 1   e-300
 6   151 0.798 0.0276   0.972  2.34 9.61e-  4    2.72 1.07e- 95
 7   140 0.812 0.0280   0.972  2.20 1.28e-  5   10.4  2.77e-148
 8    53 0.796 0.0283   0.972  2.31 4.99e-  7   10.4  1   e-300
 9   650 0.789 0.0284   0.972  2.33 1.74e-  4    6.29 6.54e- 43
10   540 0.767 0.0286   0.971  2.12 1.53e-  5   17.8  1   e-300
# ℹ 990 more rows
# ℹ Use `print(n = ...)` to see more rows

Where the smallest inferred mean is 0.03 while the true values is 1.17e-11 though some of these appear have been complete shrunk due to random chance (see random data generator at the end of post) as indicated by the sample mean.
Examining e.g., index 151 then this data is indeed much smaller than the inference:

[1] 7.728962e-238 1.957332e-272  5.335602e-95 5.915391e-115 5.434505e-276

Does anyone have any suggestions to setup this type of problem?
I’ve also tried a normal approximation for log(y) but it tends to have the same problems with overestimates.
Any suggestions are very welcomed!
Also, let me know if there is anything I can do to improve the post.

And here is the R code to reproduce the behavior with simulated data and the plots shown:

# Simulation function
beta_generator <- function(N, M = 5) {
  # Priors
  # Ensure that each row gets roughly the same mean across simulations
  mu_dist <- rbeta(N, .5, 1.5)
  # Reset seed
  # Mean precision
  mu_prec <- 1e1*mu_dist^-.005
  # Mean and precisions
  mu_i    <- rbeta(N, mu_prec*mu_dist, mu_prec*(1-mu_dist))
  t_i <- abs(rnorm(N, 1e1*mu_i^-.001, 5))
  # Likelihood
  y <- matrix(rbeta(N*M, mu_i*(t_i), (1 - mu_i)*(t_i)), N, M)
  # Avoid too large and too small values which causes log(0) error.
  y <- apply(y, c(1, 2), min, 1 - .Machine$double.eps*100000000)
  y <- apply(y, c(1, 2), max, 1e-300)
  # For running SBC
    variables = list(
      t_i = t_i,
      mu = mu_i
    generated = list(
      N = N, M = M, y = y
# Packages
p_load(dplyr, tidyr, ggplot2, patchwork, cmdstanr)
# Sim data
sim_data <- beta_generator(1000)
# Produce stan inputs
stan_input <- list(
  N = sim_data$generated$N,
  M = sim_data$generated$M,
  y = sim_data$generated$y
# Stan model see end of post for model code
cmdstan_model <- cmdstanr::cmdstan_model("mwe_stan.stan")

# Fit the model
fit <- cmdstan_model$sample(stan_input, parallel_chains = 4, iter_warmup = 500, iter_sampling = 500)

# Get the inferred parameters
infered_pars <- fit$summary() %>% 
  select(1, mean) %>% 
  slice(-1:-2) %>% 
  separate_wider_regex(variable, patterns = c("variable" = ".*", "\\[", "index" = "[0-9]*", "\\]")) %>% 
  spread(variable, mean) %>% 
    index = as.integer(index)
  ) %>% 
  arrange(index) %>% 
    mu_true = sim_data$variables$mu,
    ti_true = sim_data$variables$t_i,
    mu_sample = apply(sim_data$generated$y, 1, mean)
  ) %>% 

# Plot them
infered_pars %>% 
  ggplot(aes(mu, mu_true)) +
  geom_point() +

p_sim <- infered_pars %>% 
  ggplot(aes(mu, t_i)) +
  geom_point() +

p_true <- infered_pars %>% 
  ggplot(aes(mu_true, ti_true)) +
  geom_point() +

p_sim + p_true

# Table
infered_pars %>% 
data {
  int<lower=0> N;                      // Number of events
  int<lower=0> M;                      // Number of methods
  array[N, M] real<lower=0,upper=1> y; // Observed probabilities

parameters {
  real<lower=0> sl; // Precision variance
  vector<lower = 0>[N] eta; // Precision mean
  vector<lower = 0>[N] t_i; // Precision
  vector<lower = 0, upper = 1>[N] mu_star; // mu = 1 - mu_star
transformed parameters {
  vector<lower = 0, upper = 1>[N] mu = 1 - mu_star; // mean parameter
model {
  // Priors
  eta ~ std_normal();
  sl ~ exponential(1);
  t_i ~ logistic(eta, sl);
  for (i in 1:N) {
    real inv_log_mu = log(mu_star[i]);
    target += -inv_log_mu; // Jacobian
    inv_log_mu ~ normal(0, 5);
    // Likelihood
    y[i] ~ beta_proportion(mu[i], t_i[i]);

Sorry nobody got back to you over the holidays! Hope this isn’t too late to help.

You’ve included a pretty strong prior here in the form of exponentiating a negative half normal(0, 5). Most of the probability mass is below 0.1. Was that intentional? It’d be more usual to use an unconstrained logistic-scale prior or a beta prior directly on the proportion. Not that there’s anything fundamentally wrong with using a negative half normal on the log scale.

Why is the t_i getting a logistic prior around \eta_i? Something needs to constrain the t_i to be positive, which I see you do in the Stan model.

Why parameterize in terms of 1 - mu instead of interns of mu?

This is just a complicated way to roll your own lognormal distribution. I think this:

real inv_log_mu = log(mu_star[i]);
target += -inv_log_mu; // Jacobian
inv_log_mu ~ normal(0, 5);

can be replaced with:

mu_star[I] ~ lognormal(0, 5);

If you want to keep as is, the first variable should probably be called log_mu_star for consistency. I don’t know where the inv came in. The Jacobian is right, but confusing given the naming. I would suggest instead to use log_mu as the parameter and then transform back to a probability with exp(). This will avoid any manual Jacobian calculations. But I think the lognormal formulation is simpler.

Overall, I’m not sure why you’re defining things this way. I think it’d be easier in the model to define t_i to have a positive distribution of some kind, such as a Pareto (what Gelman suggests in Chapter 5 of BDA, though not by that name). I don’t know why you’d want to give a positive constrained variable a logistic distribution.

None of this explains why you’re getting estimates greater than the values, but I think it’d help to simplify the rest of the code into something more natural for the scale/constraints and then reassess what’s going on. I can’t quite get my head around these priors, but presumably that’s where the issue’s coming in.