Bias in some parameters but not all from simulated data


I uploaded two covariates in file test_rstan_2.txt (header, covariates in columns, one row per time point):
test_rstan_2_data.txt (1.8 KB)

I combined these covariates in the vector q (one value per time point) and then I simulated data like this dat = cumul(q) + noise. I want to estimate three parameters: two used to compute q from the covariates, the third being the error stdev. Only a subset of all time points are used for inference.

Here is the function generating q from the covariates:

M77 <- function(x, params){
  ## extract data
  PAR <- x$PAR
  LAI <- x$LAI
  plantDensity <- x$plantDensity
  epsilon_PAR <- x$epsilon_PAR
  ## extract parameters
  k_B <- params$k_B
  RUE <- params$RUE
  ## compute biomass
 q <- ((PAR * epsilon_PAR) / plantDensity) *
    (1 - exp(-k_B * LAI)) *

Here is how I simulated the data with R:

tmp <- read.table("test_rstan_2_data.txt", header=TRUE, sep="\t")
covars <- list("T"=nrow(tmp), "PAR"=tmp$PAR, "LAI"=tmp$LAI,
               "plantDensity"=160, "epsilon_PAR"=0.95)
true_params <- list("k_B"=0.59, "RUE"=2.8, "sigma_e2"=0.1)
cumulq <- M77(x=covars, params=true_params)
dat <- abs(rnorm(n=length(cumulq),
length(sub_idx <- seq(1, length(cumulq), by=15))
plot(1:covars$T, dat, type="b", las=1, col="darkgrey")
points(sub_idx, dat[sub_idx], col="red", pch=19)

Here is Stan’s code:

functions {
  vector M77(int T, vector PAR, vector LAI, int plantDensity, real epsilon_PAR,
             real k_B, real RUE){
    vector[T] q;
    q = ((PAR * epsilon_PAR) / plantDensity) .* (1 - exp(-k_B * LAI)) * RUE;
    return cumulative_sum(q);

data {
  int<lower=1> T;
  int<lower=1> plantDensity;
  real<lower=0,upper=1> epsilon_PAR;
  vector[T] PAR;
  vector[T] LAI;
  real<lower=0> y[T];

parameters {
  real<lower=0,upper=1> k_B;
  real<lower=0> RUE;
  real<lower=0> sigma_e;

model {
  vector[T] q;
  k_B ~ normal(0.5, 0.5);
  RUE ~ normal(2.5, 0.5);
  sigma_e ~ cauchy(0, 0.1);
  q = M77(T, PAR, LAI, plantDensity, epsilon_PAR, k_B, RUE);
  y ~ normal(q, sigma_e);

I saved it into a character variable and then into a file:

modelName <- "M77"
stanF <- paste0(modelName, ".stan")
write(stan_code, stanF)

Then it compiled successfully:

cm <- stanc(file=stanF, model_name=modelName)
sm <- stan_model(stanc_ret=cm)

Sampling worked well (n_eff above 1500 for all three parameters):

stan_dat <- list("T"=length(sub_idx), "PAR"=covars$PAR[sub_idx],
                 "plantDensity"=160, "epsilon_PAR"=0.95,
fit <- sampling(sm, data=stan_dat,
                warmup=500, iter=1000, chains=4, cores=4, thin=1)

Confirmed when looking at the trace plots:

traceplot(fit, pars=c("k_B")) + geom_hline(yintercept=true_params$k_B)
traceplot(fit, pars=c("RUE")) + geom_hline(yintercept=true_params$RUE)
traceplot(fit, pars=c("sigma_e")) + geom_hline(yintercept=sqrt(true_params$sigma_e2))

The parameters k_B and RUE are well estimated.
However, sigma_e is systematically over-estimated: its posterior mean is around 5 whereas the true value used to simulate the data is \sqrt(0.1).
Is it expected? Should I change something?

Thanks in advance!

1 Like

Is there a reason for that Cauchy prior on the standard deviation ? What happens if you use a half standard normal?

I used this prior following Gelman (Bayesian Analysis, 2006), in an attempt to be weakly informative.

With sigma_e ~ normal(0, 0.3);, the posterior mean of sigma_e is lower, around 2.3, but it’s still way higher than the true value.

1 Like

Variances are not exactly the easiest things to estimate. To truly speak of bias, you would need to run this experiment a few times, with different seeds and then compute the bias directly. Everything I am saying is conditional on there being no bugs on your code, which I’m unfortunately not able to run at the moment.