Using Inverse-CDFs and Quantile-Parameterized Distributions as Priors

Hi all,

I have some survey data where respondents are asked to guess a median and 90% interval for the probability of X. So for example, a respondent might say:

  • 5th quantile: .001
  • Median: .01
  • 95th quantile: .1

I’m hoping to fit a beta distribution to these data in a way that uses the full uncertainty given by the respondents.

My approach doesn’t work

My first thought was to use each respondent’s quantiles and ask ‘what is the Beta distribution that has these quantiles?’, then use that as the basis for a measurement error/missing data approach to accounting for the uncertainty. But this doesn’t always work, especially when the quantiles are all in one extreme – I can’t always get a Beta distribution with the given quantiles.

This other approach seems like a great alternative

Recently I was pointed towards Ben Goodrich’s 2020 StanCon talk where he talks about the advantages of using an Inverse CDF to define priors using elicited quantiles. This seems to be the paper that the above StanCon talk became? Ben also seems to have implemented some of these distributions in Stan at the time.

My question:

Has anyone been working with Quantile-Parameterized Distributions in Stan in the years since Ben Goodrich’s talk? I’m thinking of using some of the code from the above repo for my analysis, but I’m not sure which make the most sense for my research context. It would also be great to see more examples of Stan programs that use these as priors, if more examples exist.

Thanks in advance!

1 Like

Let’s ping @bgoodri and get a response from him!

I think you could do this with two half normals meeting at the median on the log odds scale. That is, you have values L, M, H for 0.05, 0.5, 0.95 quantiles. Translate to the log odds scale using logit() so everything is on the (-inf, inf) scale.

Step 1: Let normal(y | M, sigmaL) be such that its 0.1 quantile is L. [edit: should be 0.05]

Step 2: Let normal(y | M, sigmaH) be such that its 0.9 quantile is H. [edit: should be 0.95]

Step 3: let p(y | L, M, H) = 0.5 * normal(y | M, sigmaL) if y <= M
and 0.5 * normal(y | M, sigmaH) if y > M

Step 4: Carry out the change of variables back from the logit scale to the x in (0, 1) scale to get a prior on (0, 1). I don’t have the energy to do this right now, but I can simulate.

I asked GPT to write the Python code to figure out the bounds because I can never remember what the quantile functions are called in R or Python. Here’s what it gave me for the lower interval:

import numpy as np
import scipy.stats as stats

def logit(p):
    return np.log(p / (1 - p))

mu = logit(0.01)
X = logit(0.001)
Z = stats.norm.ppf(0.1)  # should be 0.05 --- see note below
sigma = (X - mu) / Z
mu, sigma

That returns (-4.59511985013459, 1.8037783189251733). For the right-hand approximation, GPT gave me (-4.59511985013459, 1.8710899089371236), so it’s really close to just being normal(-4.5, 1.85). But let’s look at the simulations on the unconstrained scale and push them back to the constrained scale and then plot a histogram.
Figure_1

Here’s the Python code for plotting:

simL = np.random.normal(loc=-4.59, scale=1.8, size=10000)
simR = np.random.normal(loc=-4.59, scale=1.87, size=10000)
sim_logit = np.append(simL[simL < -4.59], simR[simR >= -4.59]) 
def ilogit(u): return 1 / (1 + np.exp(-u))
sim = [ilogit(u) for u in sim_logit]
import pandas as pd
from plotnine import ggplot, aes, geom_histogram, geom_vline
df = pd.DataFrame({'sim': sim})
ggplot(df, aes(x = 'sim')) + geom_histogram() + geom_vline(xintercept=[0.001, 0.01, 0.1], color="blue", size=0.5)

Oops. I guess I really wanted 0.05 and 0.95 quantiles here, as this is what I get:

>>> np.quantile(sim, [0.1, 0.5, 0.9])
array([0.00096753, 0.0101658 , 0.10169932])

Anyway, I also don’t have the energy to redo this, but it’s basically Z = stats.norm.ppf(0.1) changing to Z = stats.norm.ppf(0.05). I should have been more careful writing the code the first time.

1 Like

Hi Bob, thanks so much for this!

I think I’ve correctly programmed your approach in Stan. It seems to work well overall, returning identical inferences to a model that uses the ‘true’ model for the measurement error under simulation. Here’s the code:

Simulating the data in R


library(tidyverse)

set.seed(199)  

calculate_quantiles <- function(alpha, beta) {
  quantiles <- qbeta(c(0.05, 0.5, 0.95), shape1 = alpha, shape2 = beta)
  names(quantiles) <- c("q5", "median", "q95")
  return(quantiles)
}

simulated_params <- tibble(
  alpha = c(1, 2, 3, 4, 5),
  beta = c(5, 4, 3, 2, 1)
) |>

mutate(quantiles = map2(alpha, beta, calculate_quantiles)) |>
  unnest_wider(quantiles)

Fit directly to the simulated beta distributions

data {
    int<lower=0> N;  // Number of experts
    vector<lower=0>[N] alpha;  // Alpha parameters for each expert
    vector<lower=0>[N] beta;   // Beta parameters for each expert
}

parameters {
    vector<lower=0, upper=1>[N] true_response;  // The 'true' responses, initially unknown
    real<lower=0, upper=1> mu; // Overall mean (intercept) for the true responses
    real<lower=0> kappa; // Concentration parameter
}

model {
    // Priors for the 'true' responses based on each expert's alpha and beta values
    for (n in 1:N) {
        true_response[n] ~ beta(alpha[n], beta[n]);
    }

    true_response ~ beta_proportion(mu, kappa); // Intercept-only model with beta likelihood

    // Priors for the intercept and concentration parameter
    mu ~ beta(2, 2); // Prior for the overall mean
    kappa ~ normal(5, 1); // Prior for the concentration parameter
}

Fit to the quantiles using the piecewise normals approach

data {
    int<lower=0> N;  // Number of experts
    real L_quantile;
    real H_quantile;
    vector[N] L;     // Value of lower quantile for each respondent
    vector[N] M;     // Median for respondent
    vector[N] H;     // Value of upper quantile for each respondent
}
transformed data {
  vector[N] logit_M; 
  vector[N] logit_L; 
  vector[N] logit_H;
  vector[N] sigma_L;
  vector[N] sigma_H;
  for (n in 1:N) {
    logit_M[n] = logit(M[n]);   
    logit_L[n] = logit(L[n]);
    logit_H[n] = logit(H[n]);
    sigma_L[n] = (logit_L[n] - logit_M[n]) / inv_Phi(L_quantile);
    sigma_H[n] = (logit_H[n] - logit_M[n]) / inv_Phi(H_quantile);
  }
}
parameters {
    vector[N] true_response_logit;  // The 'true' responses on the logit scale
    real<lower=0, upper=1> mu; // Overall mean (intercept) for the beta distribution of the true responses
    real<lower=0> kappa; // Concentration parameter
}
model {
    // Applying the piecewise normal prior around the median
    for (n in 1:N) {
      if (true_response_logit[n] < logit_M[n]) {
        true_response_logit[n] ~ normal(logit_M[n], sigma_L[n]);
      } else {
        true_response_logit[n] ~ normal(logit_M[n], sigma_H[n]);
      }
    }

    inv_logit(true_response_logit) ~ beta_proportion(mu, kappa); // Intercept-only model with beta likelihood

    mu ~ beta(2, 2); // Prior for the likelihood mean
    kappa ~ normal(5, 1); // Prior for the concentration parameter
}

The individual priors themselves end up looking generally similar to the ‘true’ beta priors, but they don’t quite match some of the more skewed true distributions:

And they don’t quite match the quantiles of the true distributions:

Possibly these issues are just related to problems with my implementation though.

Curious whether @bgoodri has any thoughts on this approach, and how it compares to the Quantile Parameterized Distributions approach I’d mentioned initially.