I’m using data that has zeros during covid. Using a lognormal model (of y+1), I’m finding that this blows up my posterior distribution, getting a huge variance. (I’m using PyStan 2.19.1.1.)
Here’s a toy example to illustrate the problem. I generate exponential data, then create a second dataset with some values forced to be very small.
N = 100
np.random.seed(123)
y = np.random.exponential(scale=1e8, size=N) + 1e8
df = pd.DataFrame(y, columns=['y'])
df1 = df.copy()
df1.loc[12:13, 'y'] = 1
model_data1 = {
'N': len(df),
'y': df['y'],
}
model_data2 = {
'N': len(df1),
'y': df1['y'],
}
This is how the data looks in df1
; note that the minimum value is 1.
Stan code:
data {
int<lower=1> N;
vector[N] y;
}
parameters {
real theta;
real<lower=0> sigma;
}
model {
y ~ lognormal(rep_vector(theta,N),sigma);
}
generated quantities {
real y_rep[N] = lognormal_rng(rep_vector(theta,N), sigma);
}
Compile and get the posterior predictive density, first for the unaltered data:
sm = pystan.StanModel(model_code=model_code, verbose=True)
sm_fit1 = sm.sampling(data=model_data, iter=1000, chains=4)
ppc_data1 = az.from_pystan(posterior=sm_fit1, posterior_predictive=["y_rep"], observed_data=["y"])
ax = az.plot_ppc(ppc_data1, data_pairs={"y": "y_rep"})
Looks okay:
Now for the data with small values:
sm_fit2 = sm.sampling(data=model_data2, iter=1000, chains=4)
ppc_data2 = az.from_pystan(posterior=sm_fit2, posterior_predictive=["y_rep"], observed_data=["y"])
ax = az.plot_ppc(ppc_data2, data_pairs={"y": "y_rep"})
Whoa!
Note the scale of the x-axis.
When I set the minimum value to be 1e5
= 100000, I get this (recall the data is centered around 2e8
):
This is a bit better.
And when the minimum value is 1e7
= 10000000, I get:
So I get a better fit when the data stays away from zero.
Is this behavior something special about having values near zero? Or is it about the variance in the data? What is the takeaway here?