Poor parameter recovery from simulated sieve fractionation data

Hi all,

This is my first post and my first attempt at using Stan for anything more than the Statistical Rethinking course by the fantastic Richard McElreath. Thus I apologize in advance for the silly problem and the poorly written code.

I am trying to model the effect of an additive to the size distribution of granular sludge. The goal is to verify whether the sludge granules become bigger on average with the additive. However, before doing any of that, I wanted to build a very simple generative model for my data and test if I could write a model that recovers the “true” parameters.

In particular, my data are weights of each size fraction separated with N sieves. I started with 6 size intervals in mm: (+Inf, 3.35]; (3.35, 2.28]; (2.28, 1.18]; (1.18, 0.60]; (0.60, 0.40]; and (0.40, 0]. To generate these fractions weights I simulate 1e6 sludge particles with size distributed according to a log-normal distribution. This is done in the following code.

## load libraries
library("tidyverse")
library("conflicted")

conflicts_prefer(
  dplyr::filter,
  dplyr::lag
)

## Generate data
set.seed(42)

generate_synthetic_sludge_data <- function(mu, sigma, sample_weight, n_particles = 1e6) {
  particle_sizes <- rlnorm(n_particles, meanlog = mu, sdlog = sigma)
  total_weight <- sum(particle_sizes)
  normalized_weights <- sample_weight * (particle_sizes / total_weight)

  sieve_sizes <- c(3.35, 2.28, 1.18, 0.6, 0.4, 0)

  fractions <- numeric(length(sieve_sizes))
  for (i in seq_along(sieve_sizes)) {
    if (i == 1) {
      ## All larger-or-equal particles for the largest sieve
      fractions[i] <- sum(normalized_weights[particle_sizes >= sieve_sizes[i]])
    } else if (i == length(sieve_sizes)) {
      ## All particles smaller than the second-to-last sieve for the smallest sieve
      fractions[i] <- sum(normalized_weights[particle_sizes < sieve_sizes[i - 1]])
    } else {
      ## All particle with size between two sieves for the other sieves
      fractions[i] <- sum(normalized_weights[particle_sizes >= sieve_sizes[i] &
        particle_sizes < sieve_sizes[i - 1]])
    }
  }

  return(fractions)
}

# True parameters
true_mu <- -0.8
true_sigma <- 0.8

# Generate synthetic data for multiple samples
n_samples <- 10
sample_weights <- runif(n_samples, 120, 150)

synthetic_data <- map_dfr(1:n_samples, function(i) {
  fractions <- generate_synthetic_sludge_data(true_mu, true_sigma, sample_weights[i])
  tibble(
    sample_id = paste0("sample_", i),
    starting_weight = sample_weights[i],
    `(+Inf, 3.35]` = fractions[1],
    `(3.35, 2.28]` = fractions[2],
    `(2.28, 1.18]` = fractions[3],
    `(1.18, 0.60]` = fractions[4],
    `(0.60, 0.40]` = fractions[5],
    `(0.40, 0]` = fractions[6]
  )
})

I then try to recover the true parameters using Stan with the following code.

## Prepare data for Stan
fractions_matrix <- as.matrix(synthetic_data[, 3:8])
data_list <- list(
  N_samples = n_samples,
  N_sieves = 6,
  sieve_sizes = c(3.35, 2.28, 1.18, 0.6, 0.4, 0),
  ## Round to integers for multinomial distribution
  fractions = round(fractions_matrix),
  total_mass = rowSums(fractions_matrix)
)

# Compile and fit the model
fit <- stan(
  file = "simple-model.stan", data = data_list,
  iter = 4000, chains = 4, warmup = 2000
)

The simple-model.stan file content is.

data {
  int<lower=1> N_samples;
  int<lower=1> N_sieves;
  vector<lower=0>[N_sieves] sieve_sizes;
  int<lower=0> fractions[N_samples, N_sieves];
  vector<lower=0>[N_samples] total_mass;
}

parameters {
  real mu;
  real<lower=0> sigma;
}

model {
  // Priors
  mu ~ normal(0, 1);
  sigma ~ cauchy(0, 1);

  // Likelihood
  for (n in 1:N_samples) {
    vector[N_sieves] probs;

    for (i in 1:N_sieves) {
      if (i == 1) {
        probs[i] = 1 - lognormal_cdf(sieve_sizes[i], mu, sigma);
      } else if (i == N_sieves) {
        probs[i] = lognormal_cdf(sieve_sizes[i-1], mu, sigma);
      } else {
        probs[i] = lognormal_cdf(sieve_sizes[i-1], mu, sigma) -
                   lognormal_cdf(sieve_sizes[i], mu, sigma);
      }
    }

    fractions[n] ~ multinomial(probs);
  }
}

generated quantities {
  real median_size = exp(mu);
  real mean_size = exp(mu + square(sigma)/2);
  vector[N_sieves] predicted_fractions;

  for (i in 1:N_sieves) {
    if (i == 1) {
      predicted_fractions[i] = 1 - lognormal_cdf(sieve_sizes[i], mu, sigma);
    } else if (i == N_sieves) {
      predicted_fractions[i] = lognormal_cdf(sieve_sizes[i-1], mu, sigma);
    } else {
      predicted_fractions[i] = lognormal_cdf(sieve_sizes[i-1], mu, sigma) -
                               lognormal_cdf(sieve_sizes[i], mu, sigma);
    }
  }
}

When I look at fit I get the correct sigma, but an over-estimated mu:

> print(fit, pars = c("mu", "sigma", "median_size", "mean_size"))
Inference for Stan model: anon_model.
4 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=8000.

             mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu          -0.16       0 0.02 -0.21 -0.18 -0.16 -0.14 -0.12  5506    1
sigma        0.80       0 0.02  0.76  0.78  0.80  0.81  0.83  5707    1
median_size  0.85       0 0.02  0.81  0.84  0.85  0.87  0.89  5532    1
mean_size    1.17       0 0.03  1.11  1.15  1.17  1.19  1.23  5756    1

Samples were drawn using NUTS(diag_e) at Thu Oct  3 13:50:16 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

I tried a parameter recovery study with a parameters grid for mu between -1.5 and 0.5, and sigma between 0.4 and 1.2. (mu_values <- seq(-1.5, 0.5, by = 0.5); sigma_values <- seq(0.4, 1.2, by = 0.2); param_grid <- expand.grid(mu = mu_values, sigma = sigma_values)). And sigma keeps being estimated correctly, but mu is more and more over-estimated as sigma grows. I attached the results of the mu estimate for this study.

The obvious question is: what am I doing wrong? Probably many things, but I would really appreciate any help!! Thanks a lot for even reading all of this. I hope it made some sense.

— Attachments
mu-estimates-for-true-sigma.pdf (7.4 KB)

The predicted_fractions in your Stan model are number of particles per interval, but the simulation calculates sums of sizes; the fractions are mass fractions.
This causes bias: the intervals with larger particles in them are assumed to have more particles than there really are and the intervals with smaller particles appear to have fewer particles.

Instead of CDF you need to calculate the expected mass in the interval. (Complicated, but I believe there’s a closed-form expression)

The part where you multiply the fraction by a random sample_weight, round to integer and then fit a multinomial seems iffy but I don’t know an alternative off the top of my head.