Hello, all.
I’m interested in using Stan for a project involving Poisson (and related distributions) regressions. I am simulating data to make sure I can properly retrieve parameters from a simple model but I’m having more trouble than I thought. Consistently across simulation attempts some of the parameter samples are estimating the true values fairly poorly (even specifying the prior as the true underlying distribution). This makes me think that I have a fundamental misunderstanding of simulating data, [Bayesian] statistics, Stan, or CmdStanPy (or likely all of the above).
Simulated data:
y_i \sim \textrm{Poisson}\left(\lambda_i\right)
\lambda_i \sim \textrm{Normal}\left(\mu=3, \sigma=1\right)
Stan Code
data {
int<lower=0> N;
int<lower=0> D;
int y[N, D];
}
parameters {
vector<lower=0>[D] lambda;
}
model {
lambda ~ normal(3, 1);
for (i in 1:D){
y[, i] ~ poisson(lambda[i]);
}
}
Python Code
import cmdstanpy
import matplotlib.pyplot as plt
import numpy as np
rng = np.random.default_rng()
N = 1000 # number of samples
D = 20 # number of microbes
lams = rng.normal(loc=3, scale=1, size=D) # draw lambda values from norm
tbl = np.zeros([N, D])
for i, lam in enumerate(lams):
g = rng.poisson(lam=lam, size=N)
tbl[:, i] = g
sm = cmdstanpy.CmdStanModel(stan_file="poisson.stan")
data = {
"N": N,
"D": D,
"y": tbl.astype(int)
}
sm_fit = sm.sample(data=data, output_dir="output")
model_lams = sm_fit.stan_variable("lambda")
fig, axs = plt.subplots(4, 5, figsize=(8, 8))
plt.subplots_adjust(hspace=0.5)
for i, ax in enumerate(axs.flatten()):
ax.hist(model_lams[:, i], bins=30)
ax.axvline(x=lams[i], color="black")
ax.set_yticks([])
ax.set_title(f"$\\lambda = {round(lams[i], 2)}$")