# Having trouble recovering known parameters from simulated data

Hello, all.

I’m interested in using Stan for a project involving Poisson (and related distributions) regressions. I am simulating data to make sure I can properly retrieve parameters from a simple model but I’m having more trouble than I thought. Consistently across simulation attempts some of the parameter samples are estimating the true values fairly poorly (even specifying the prior as the true underlying distribution). This makes me think that I have a fundamental misunderstanding of simulating data, [Bayesian] statistics, Stan, or CmdStanPy (or likely all of the above).

# Simulated data:

y_i \sim \textrm{Poisson}\left(\lambda_i\right)

\lambda_i \sim \textrm{Normal}\left(\mu=3, \sigma=1\right)

# Stan Code

data {
int<lower=0> N;
int<lower=0> D;
int y[N, D];
}

parameters {
vector<lower=0>[D] lambda;
}

model {
lambda ~ normal(3, 1);

for (i in 1:D){
y[, i] ~ poisson(lambda[i]);
}
}


# Python Code

import cmdstanpy
import matplotlib.pyplot as plt
import numpy as np

rng = np.random.default_rng()

N = 1000  # number of samples
D = 20    # number of microbes

lams = rng.normal(loc=3, scale=1, size=D)  # draw lambda values from norm
tbl = np.zeros([N, D])

for i, lam in enumerate(lams):
g = rng.poisson(lam=lam, size=N)
tbl[:, i] = g

sm = cmdstanpy.CmdStanModel(stan_file="poisson.stan")
data = {
"N": N,
"D": D,
"y": tbl.astype(int)
}
sm_fit = sm.sample(data=data, output_dir="output")

model_lams = sm_fit.stan_variable("lambda")

fig, axs = plt.subplots(4, 5, figsize=(8, 8))
for i, ax in enumerate(axs.flatten()):
ax.hist(model_lams[:, i], bins=30)
ax.axvline(x=lams[i], color="black")
ax.set_yticks([])
ax.set_title(f"$\\lambda = {round(lams[i], 2)}$")


1 Like

Hi,

I don’t see a problem recovering the parameters here. You’ll notice that the 95% Bayesian credible intervals for all \lambda_i include the generating values. In general, there’s no theoretical guarantee that a BCI will include the parameter value, even in a well-posed model and using a well-calibrated algorithm.

As an aside, I think you could speed your code up by flattening the matrices for the sampling statement. Something like

to_vector(y) ~ poisson(to_vector(lambda));


Smarter folks such as @andrjohns can advise.

To second @maxbiostat’s comment, I don’t believe there’s an issue here.

To verify, I ran your example (in R):

mod =  "
data {
int<lower=0> N;
int<lower=0> D;
int y[N, D];
}

parameters {
vector<lower=0>[D] lambda;
}

model {
lambda ~ normal(3, 1);

for (i in 1:D){
y[, i] ~ poisson(lambda[i]);
}
}"

set.seed(2021)

N = 1000
D = 20
lams = rnorm(D,3,1)
tbl = sapply(lams,\(x){ rpois(N,x) })

stanmod = stan(model_code=mod,data=list(N=N,D=D,y=tbl),seed=2021)

# Display the generated numbers next to the posterior means
#   and credible intervals
data.frame("RNG"=lams,
"Mean" = summary(stanmod,pars="lambda")$summary[,1], "LCI" = summary(stanmod,pars="lambda")$summary[,4],
"UCI" = summary(stanmod,pars="lambda")$summary[,8])  With the result: > data.frame("RNG"=lams, + "Mean" = summary(stanmod,pars="lambda")$summary[,1],
+            "LCI" = summary(stanmod,pars="lambda")$summary[,4], + "UCI" = summary(stanmod,pars="lambda")$summary[,8])
RNG     Mean      LCI      UCI
lambda  2.877540 2.828699 2.718493 2.939306
lambda  3.552457 3.549054 3.427303 3.669426
lambda  3.348650 3.330822 3.224224 3.439157
lambda  3.359632 3.364761 3.250505 3.480708
lambda  3.898054 3.898179 3.776413 4.020687
lambda  1.077430 1.102559 1.036426 1.170559
lambda  3.261744 3.336227 3.225798 3.445680
lambda  3.915566 4.105481 3.986568 4.227281
lambda  3.013772 3.067185 2.960315 3.176200
lambda 4.729963 4.720221 4.585500 4.854437
lambda 1.917795 1.933399 1.848622 2.021750
lambda 2.727175 2.773565 2.672875 2.874267
lambda 3.181995 3.259899 3.153894 3.369741
lambda 4.508542 4.443493 4.315403 4.571365
lambda 4.604470 4.521787 4.389557 4.655715
lambda 1.158524 1.228154 1.162206 1.295991
lambda 4.623310 4.622765 4.494721 4.754011
lambda 3.131389 3.081907 2.977046 3.187859
lambda 4.481122 4.563573 4.430324 4.696709
lambda 4.513318 4.496616 4.367645 4.627121


All posterior means are within a decimal or two of the RNG values, and all easily within the credible intervals

1 Like