How do I evaluate my model, in terms of bias, coverage, etc.?

Ok, I’ve spent a few days looking for the best practice of evaluating my model in this case.

With a frequentist MLE model, what I do is:

  1. simulate N datasets with known \theta,
  2. find the MLE for \theta.
  3. Calculate biases, variance, MSE, coverage…

Which I can then use to say something about how well my model behaves given parameter \theta.

Now, I tried replicating this procedure with a Bayesian model, but a few things I notice are:

  1. I need to find a prior which can be fit automatically, but simply using the \theta as prior feels like cheating because in the real world I wouldn’t know the actual \theta. I thought of fitting a simple glm in an attempt to replicate having prior knowledge on \theta, and using that as a prior distribution mean.
  2. MCMC produces a distribution, and not an actual point estimate.

So my workflow currently is:

  1. simulate N datasets with known \theta,
  2. find a prior for \theta using GLM
  3. Use the prior in stan, produce posterior distribution
  4. Calculate biases, variance, and MSE based on posterior mean. Calculate coverage based on posterior quantiles.

But I don’t see this workflow mentioned anywhere I looked, and it feels like there should be a more efficient way of doing this.

I read a bit on posterior predictive check, but it seems like it doesn’t do quite what I want, which is ultimately answering the question: “Given N data sets - how does this model perform in actually finding \theta?”

One thing I thought about doing is using stan’s optimization, which will produce a point estimate, and I can then calculate the hessian and get the SEs to use for coverage - Is this a reasonable approach?

1 Like

I think the most comprehensive reference here would be the Bayesian Workflow paper: [2011.01808] Bayesian Workflow

1 Like

Reading this, it seems I am interested in part 4 of the workflow - working with fake data.

What I am doing is basically an SBC, but instead of sampling from a prior distribution - I’m sampling from a single point prior. For example, a simulation would be something like:

sim <- replicate(1000, rpois(N, lambda)

Where I have the same lambda throughout all simulated data. If I understand correctly, an SCB will look more like:

lambda <- rnorm(1, lambda_mean, prior_sd) # defining a normal distribution prior just for simplicity here
sim <- rep(1000, rpois(N, lambda)

I understand how this better connects the posterior distribution with simulations, as the latter are derived from a distribution as well. And reading the Stan user guide, it seems like every iteration of the MCMC both creates fake data and fits the data, so this is way (way) more efficient than what I did previously.

So, my new Stan code will take this form?

transformed data {
 data generation process



generated quantities {
calculate bias/ etc.

Now it’s a question of my prior choice… What I am trying to do, is find the parameters that define the unobserved grey curve which is used from data generation.

My methodology now is using a glm to find the parameters for the red curve, and use these a priors. It’s not perfect but that’s the only way I can think of to simulate a real world scenario. So let’s say \beta is the estimated parameter from the red curve.

The issue I have is that, with not enough data, I will get posterior estimates which are simply my model priors, and hence in past iterations of this work, with real value \theta I got biases of 0 (when using \theta as model prior), \theta (when using 0 as prior mean), and lastly \beta - \theta (I think, need to double check) when using \beta as prior.

Ok, I translated the DGP to Stan and follow the instructions here:

functions {
  array [,] int simulate_introduction_rng(int N, int B_total, array [] int sampling_vector, real b0_sim, real b1_sim){
    array[N,7] int out_mat;
    vector[N] rate_sim;
    vector[N] prob_invasive;
    array[N] int samples;
    array[N] int yearly_introduced;
    array[N] int detected_i_species; 
    array[N] int undetected_i_before;
    array[N] int undetected_i_after;
    array[N] int detected_i_so_far;
    array[N] int detected_n_species;
    array[N] int undetected_n_before;
    array[N] int undetected_n_after;
    array[N] int detected_n_so_far;
    array[N] int t;
    for (i in 1:N) {
      t[i] = i;
      rate_sim[i] = exp(b0_sim + b1_sim * i);
    yearly_introduced = poisson_rng(rate_sim);
    detected_i_species  = rep_array(0, N);
    undetected_i_before = rep_array(0, N);
    undetected_i_after  = rep_array(0, N);
    detected_i_so_far   = rep_array(0, N);
    detected_n_species  = rep_array(0, N);
    undetected_n_before = rep_array(0, N);
    undetected_n_after  = rep_array(0, N);
    detected_n_so_far   = rep_array(0, N);
    undetected_i_before[1] = yearly_introduced[1];
    undetected_n_before[1] = B_total;
    prob_invasive[1] = 1.0 * undetected_i_before[1] / (undetected_n_before[1] + undetected_i_before[1]);
    samples[1] = binomial_rng(sampling_vector[1], prob_invasive[1]);
    detected_i_species[1] = samples[1];
    detected_n_species[1] = sampling_vector[1] - samples[1];
    undetected_i_after[1] = undetected_i_before[1] - detected_i_species[1];
    undetected_n_after[1] = undetected_n_before[1] - detected_n_species[1];
    detected_i_so_far[1] = cumulative_sum(detected_i_species)[1];
    detected_n_so_far[1] = cumulative_sum(detected_n_species)[1];
    for (i in 2:N){
      undetected_i_before[i] = yearly_introduced[i] + undetected_i_after[i-1];
      undetected_n_before[i] = undetected_n_after[i-1];
      prob_invasive[i] =  1.0 * undetected_i_before[i] / (undetected_n_before[i] + undetected_i_before[i]);
      samples[i] = binomial_rng(sampling_vector[i], prob_invasive[i]);
      detected_i_species[i] = min(samples[i], undetected_i_before[i]);
      detected_n_species[i] = min((sampling_vector[i] - samples[i]), undetected_n_before[i]);
      undetected_i_after[i] = undetected_i_before[i] - detected_i_species[i];
      undetected_n_after[i] = undetected_n_before[i] - detected_n_species[i];
      detected_i_so_far[i] = cumulative_sum(detected_i_species)[i];
      detected_n_so_far[i] = cumulative_sum(detected_n_species)[i];
    out_mat[,1] = t;
    out_mat[,2] = yearly_introduced;
    out_mat[,3] = detected_i_species;
    out_mat[,4] = detected_i_so_far;
    out_mat[,5] = undetected_n_before;
    out_mat[,6] = detected_n_species;
    out_mat[,7] = detected_n_so_far;
    return out_mat;

data {
  int <lower = 1> N;  // number of rows in the data
  int <lower = 1> B_total; // assumed number of ind in group B
  array[N] int <lower = 1> sampling_vector; // number of 
  real b0_mu; // prior for betas
  real b1_mu; // prior for betas

transformed data {
  real b0_sim = normal_rng(b0_mu, 0.1); // params for data creation
  real b1_sim = normal_rng(b1_mu, 0.001);   // params for data creation
  array[N,7] int <lower = 0> sampling_mat;
  sampling_mat = simulate_introduction_rng(N, B_total, sampling_vector, b0_sim, b1_sim);

parameters {
  real b0;
  real b1;

transformed parameters {
  vector<lower = 0>[N] unrecorded_invasives;
  vector[N] rate;
  for (i in 1:N){
    rate[i] = exp(b0 + b1 * i);
    unrecorded_invasives[i] = cumulative_sum(rate)[i] - sampling_mat[i,4];

  for (i in 1:N){
    sampling_mat[i,3] ~ beta_binomial(sampling_vector[i], unrecorded_invasives[i], sampling_mat[i,5]);
  b0  ~ normal(b0_mu, 0.1);
  b1  ~ normal(b1_mu , 0.001);

generated quantities {
  array[2] int <lower = 0, upper = 1> lt_sim
      = { b0 < b0_sim , b1 < b1_sim };

I’m left with a rank for this specific run. So, the next step would be something like this:

data_scb <- list(
  N = 150, 
  sampling_vector = rpois(150, 10),
  B_total = 1000,
  b0_mu = 1,
  b1_mu = 0.01

n_trials = 30
ranks <- matrix(NA, n_trials , 2)
for (n in 1:n_trials ){
  samples <- scb_model$sample(data = data_scb, 
                        chains = 3, 
                        parallel_chains = 3, 
                        iter_warmup = 200,
                        iter_sampling = 1000, thin = 2,
                        show_messages = F, refresh = 0)
  specific_rank <- as_draws_rvars(samples$draws(variables = "lt_sim"))
  specific_rank <- draws_of(specific_rank $lt_sim)
  ranks[n,] <- colSums(specific_rank )

If I understand, then I am hoping to get a uniform distribution for the ranks, between 0 and number of iterations. That indicates that the model “works”.

But still, it seems like it doesn’t answer my question, does it?

But the more I think about it, these simulations should assume my prior is correct…

@martinmodrak and @hyunji.moon might have insights for you.


I think at least part of the confusion is the difference in goals for the frequentist and Bayes approaches. In freq world you can construct many different estimators for the same quantity and those will have different properties (bias, variance, …). So you care both whether your estimator is good and whether it’s implemented correctly. In a Bayesian analysis, you seek the posterior distribution, which is uniquely defined by your model. So you “only” care whether the posterior is computed correctly, which is what SBC is good at.

To run SBC in R, I encourage you to look at the SBC R package which I co-develop. An IMHO good intro to the how and why of SBC is also in the introduction to the SBC and test quantities preprint.

Still, even as a Bayesian, you might often want to use the posterior to make a decision. Given a utility function, you can evaluate the expectation of the utility across the decisions reached over whole prior (or just some selected values). You can recover the counterparts to common freq. properties with correct choice of decision + utility (e.g. if the decision is to pick a point estimate, you may take the posterior mean and take negative distance from true value as utility and you get something like bias). But you can choose the decision space and utility to match your actual real world problem. You can also define action space and utility and then find the decision rule that maximizes expected utility.

Does that make sense?

1 Like

Prior-posterior, global-local, hardware-software are all relative concept, so I would just say don’t feel bad about the above feeling.

Be generous in what you perceive as data, by including the byproduct of your workflow (misfit of managerial, statistical and computational model) which informs developing family of models through iterative update.