Having trouble recovering known parameters from simulated data

Hello, all.

I’m interested in using Stan for a project involving Poisson (and related distributions) regressions. I am simulating data to make sure I can properly retrieve parameters from a simple model but I’m having more trouble than I thought. Consistently across simulation attempts some of the parameter samples are estimating the true values fairly poorly (even specifying the prior as the true underlying distribution). This makes me think that I have a fundamental misunderstanding of simulating data, [Bayesian] statistics, Stan, or CmdStanPy (or likely all of the above).

Simulated data:

y_i \sim \textrm{Poisson}\left(\lambda_i\right)

\lambda_i \sim \textrm{Normal}\left(\mu=3, \sigma=1\right)

Stan Code

data {
int<lower=0> N;
int<lower=0> D;
int y[N, D];
}

parameters {
vector<lower=0>[D] lambda;
}

model {
lambda ~ normal(3, 1);

for (i in 1:D){
y[, i] ~ poisson(lambda[i]);
}
}

Python Code

import cmdstanpy
import matplotlib.pyplot as plt
import numpy as np

rng = np.random.default_rng()

N = 1000  # number of samples
D = 20    # number of microbes

lams = rng.normal(loc=3, scale=1, size=D)  # draw lambda values from norm
tbl = np.zeros([N, D])

for i, lam in enumerate(lams):
g = rng.poisson(lam=lam, size=N)
tbl[:, i] = g

sm = cmdstanpy.CmdStanModel(stan_file="poisson.stan")
data = {
"N": N,
"D": D,
"y": tbl.astype(int)
}
sm_fit = sm.sample(data=data, output_dir="output")

model_lams = sm_fit.stan_variable("lambda")

fig, axs = plt.subplots(4, 5, figsize=(8, 8))
for i, ax in enumerate(axs.flatten()):
ax.hist(model_lams[:, i], bins=30)
ax.axvline(x=lams[i], color="black")
ax.set_yticks([])
ax.set_title(f"$\\lambda = {round(lams[i], 2)}$")

1 Like

Hi,

I don’t see a problem recovering the parameters here. You’ll notice that the 95% Bayesian credible intervals for all \lambda_i include the generating values. In general, there’s no theoretical guarantee that a BCI will include the parameter value, even in a well-posed model and using a well-calibrated algorithm.

As an aside, I think you could speed your code up by flattening the matrices for the sampling statement. Something like

to_vector(y) ~ poisson(to_vector(lambda));

Smarter folks such as @andrjohns can advise.

To second @maxbiostat’s comment, I don’t believe there’s an issue here.

To verify, I ran your example (in R):

mod =  "
data {
int<lower=0> N;
int<lower=0> D;
int y[N, D];
}

parameters {
vector<lower=0>[D] lambda;
}

model {
lambda ~ normal(3, 1);

for (i in 1:D){
y[, i] ~ poisson(lambda[i]);
}
}"

set.seed(2021)

N = 1000
D = 20
lams = rnorm(D,3,1)
tbl = sapply(lams,\(x){ rpois(N,x) })

stanmod = stan(model_code=mod,data=list(N=N,D=D,y=tbl),seed=2021)

# Display the generated numbers next to the posterior means
#   and credible intervals
data.frame("RNG"=lams,
"Mean" = summary(stanmod,pars="lambda")$summary[,1], "LCI" = summary(stanmod,pars="lambda")$summary[,4],
"UCI" = summary(stanmod,pars="lambda")$summary[,8]) With the result: > data.frame("RNG"=lams, + "Mean" = summary(stanmod,pars="lambda")$summary[,1],
+            "LCI" = summary(stanmod,pars="lambda")$summary[,4], + "UCI" = summary(stanmod,pars="lambda")$summary[,8])
RNG     Mean      LCI      UCI
lambda[1]  2.877540 2.828699 2.718493 2.939306
lambda[2]  3.552457 3.549054 3.427303 3.669426
lambda[3]  3.348650 3.330822 3.224224 3.439157
lambda[4]  3.359632 3.364761 3.250505 3.480708
lambda[5]  3.898054 3.898179 3.776413 4.020687
lambda[6]  1.077430 1.102559 1.036426 1.170559
lambda[7]  3.261744 3.336227 3.225798 3.445680
lambda[8]  3.915566 4.105481 3.986568 4.227281
lambda[9]  3.013772 3.067185 2.960315 3.176200
lambda[10] 4.729963 4.720221 4.585500 4.854437
lambda[11] 1.917795 1.933399 1.848622 2.021750
lambda[12] 2.727175 2.773565 2.672875 2.874267
lambda[13] 3.181995 3.259899 3.153894 3.369741
lambda[14] 4.508542 4.443493 4.315403 4.571365
lambda[15] 4.604470 4.521787 4.389557 4.655715
lambda[16] 1.158524 1.228154 1.162206 1.295991
lambda[17] 4.623310 4.622765 4.494721 4.754011
lambda[18] 3.131389 3.081907 2.977046 3.187859
lambda[19] 4.481122 4.563573 4.430324 4.696709
lambda[20] 4.513318 4.496616 4.367645 4.627121

All posterior means are within a decimal or two of the RNG values, and all easily within the credible intervals

1 Like