Hi all,
This is my first post and my first attempt at using Stan for anything more than the Statistical Rethinking course by the fantastic Richard McElreath. Thus I apologize in advance for the silly problem and the poorly written code.
I am trying to model the effect of an additive to the size distribution of granular sludge. The goal is to verify whether the sludge granules become bigger on average with the additive. However, before doing any of that, I wanted to build a very simple generative model for my data and test if I could write a model that recovers the “true” parameters.
In particular, my data are weights of each size fraction separated with N sieves. I started with 6 size intervals in mm: (+Inf, 3.35]; (3.35, 2.28]; (2.28, 1.18]; (1.18, 0.60]; (0.60, 0.40]; and (0.40, 0]. To generate these fractions weights I simulate 1e6 sludge particles with size distributed according to a log-normal distribution. This is done in the following code.
## load libraries
library("tidyverse")
library("conflicted")
conflicts_prefer(
dplyr::filter,
dplyr::lag
)
## Generate data
set.seed(42)
generate_synthetic_sludge_data <- function(mu, sigma, sample_weight, n_particles = 1e6) {
particle_sizes <- rlnorm(n_particles, meanlog = mu, sdlog = sigma)
total_weight <- sum(particle_sizes)
normalized_weights <- sample_weight * (particle_sizes / total_weight)
sieve_sizes <- c(3.35, 2.28, 1.18, 0.6, 0.4, 0)
fractions <- numeric(length(sieve_sizes))
for (i in seq_along(sieve_sizes)) {
if (i == 1) {
## All larger-or-equal particles for the largest sieve
fractions[i] <- sum(normalized_weights[particle_sizes >= sieve_sizes[i]])
} else if (i == length(sieve_sizes)) {
## All particles smaller than the second-to-last sieve for the smallest sieve
fractions[i] <- sum(normalized_weights[particle_sizes < sieve_sizes[i - 1]])
} else {
## All particle with size between two sieves for the other sieves
fractions[i] <- sum(normalized_weights[particle_sizes >= sieve_sizes[i] &
particle_sizes < sieve_sizes[i - 1]])
}
}
return(fractions)
}
# True parameters
true_mu <- -0.8
true_sigma <- 0.8
# Generate synthetic data for multiple samples
n_samples <- 10
sample_weights <- runif(n_samples, 120, 150)
synthetic_data <- map_dfr(1:n_samples, function(i) {
fractions <- generate_synthetic_sludge_data(true_mu, true_sigma, sample_weights[i])
tibble(
sample_id = paste0("sample_", i),
starting_weight = sample_weights[i],
`(+Inf, 3.35]` = fractions[1],
`(3.35, 2.28]` = fractions[2],
`(2.28, 1.18]` = fractions[3],
`(1.18, 0.60]` = fractions[4],
`(0.60, 0.40]` = fractions[5],
`(0.40, 0]` = fractions[6]
)
})
I then try to recover the true parameters using Stan with the following code.
## Prepare data for Stan
fractions_matrix <- as.matrix(synthetic_data[, 3:8])
data_list <- list(
N_samples = n_samples,
N_sieves = 6,
sieve_sizes = c(3.35, 2.28, 1.18, 0.6, 0.4, 0),
## Round to integers for multinomial distribution
fractions = round(fractions_matrix),
total_mass = rowSums(fractions_matrix)
)
# Compile and fit the model
fit <- stan(
file = "simple-model.stan", data = data_list,
iter = 4000, chains = 4, warmup = 2000
)
The simple-model.stan
file content is.
data {
int<lower=1> N_samples;
int<lower=1> N_sieves;
vector<lower=0>[N_sieves] sieve_sizes;
int<lower=0> fractions[N_samples, N_sieves];
vector<lower=0>[N_samples] total_mass;
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
// Priors
mu ~ normal(0, 1);
sigma ~ cauchy(0, 1);
// Likelihood
for (n in 1:N_samples) {
vector[N_sieves] probs;
for (i in 1:N_sieves) {
if (i == 1) {
probs[i] = 1 - lognormal_cdf(sieve_sizes[i], mu, sigma);
} else if (i == N_sieves) {
probs[i] = lognormal_cdf(sieve_sizes[i-1], mu, sigma);
} else {
probs[i] = lognormal_cdf(sieve_sizes[i-1], mu, sigma) -
lognormal_cdf(sieve_sizes[i], mu, sigma);
}
}
fractions[n] ~ multinomial(probs);
}
}
generated quantities {
real median_size = exp(mu);
real mean_size = exp(mu + square(sigma)/2);
vector[N_sieves] predicted_fractions;
for (i in 1:N_sieves) {
if (i == 1) {
predicted_fractions[i] = 1 - lognormal_cdf(sieve_sizes[i], mu, sigma);
} else if (i == N_sieves) {
predicted_fractions[i] = lognormal_cdf(sieve_sizes[i-1], mu, sigma);
} else {
predicted_fractions[i] = lognormal_cdf(sieve_sizes[i-1], mu, sigma) -
lognormal_cdf(sieve_sizes[i], mu, sigma);
}
}
}
When I look at fit
I get the correct sigma, but an over-estimated mu:
> print(fit, pars = c("mu", "sigma", "median_size", "mean_size"))
Inference for Stan model: anon_model.
4 chains, each with iter=4000; warmup=2000; thin=1;
post-warmup draws per chain=2000, total post-warmup draws=8000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
mu -0.16 0 0.02 -0.21 -0.18 -0.16 -0.14 -0.12 5506 1
sigma 0.80 0 0.02 0.76 0.78 0.80 0.81 0.83 5707 1
median_size 0.85 0 0.02 0.81 0.84 0.85 0.87 0.89 5532 1
mean_size 1.17 0 0.03 1.11 1.15 1.17 1.19 1.23 5756 1
Samples were drawn using NUTS(diag_e) at Thu Oct 3 13:50:16 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
I tried a parameter recovery study with a parameters grid for mu between -1.5 and 0.5, and sigma between 0.4 and 1.2. (mu_values <- seq(-1.5, 0.5, by = 0.5); sigma_values <- seq(0.4, 1.2, by = 0.2); param_grid <- expand.grid(mu = mu_values, sigma = sigma_values)
). And sigma keeps being estimated correctly, but mu is more and more over-estimated as sigma grows. I attached the results of the mu estimate for this study.
The obvious question is: what am I doing wrong? Probably many things, but I would really appreciate any help!! Thanks a lot for even reading all of this. I hope it made some sense.
— Attachments
mu-estimates-for-true-sigma.pdf (7.4 KB)