Stan doesn't like (near-) zeros?

maswiebe · November 11, 2021, 2:13am

I’m using data that has zeros during covid. Using a lognormal model (of y+1), I’m finding that this blows up my posterior distribution, getting a huge variance. (I’m using PyStan 2.19.1.1.)

Here’s a toy example to illustrate the problem. I generate exponential data, then create a second dataset with some values forced to be very small.

N = 100
np.random.seed(123)
y = np.random.exponential(scale=1e8, size=N) + 1e8
df = pd.DataFrame(y, columns=['y'])
df1 = df.copy()
df1.loc[12:13, 'y'] = 1

model_data1 = {
  'N': len(df),
  'y': df['y'],
}
model_data2 = {
  'N': len(df1),
  'y': df1['y'],
}

This is how the data looks in df1; note that the minimum value is 1.

Stan code:

data {
  int<lower=1> N;
  vector[N] y;
}

parameters {
  real theta;
  real<lower=0> sigma;
}

model {
  y ~ lognormal(rep_vector(theta,N),sigma);
}

generated quantities {
    real y_rep[N] = lognormal_rng(rep_vector(theta,N), sigma);
}

Compile and get the posterior predictive density, first for the unaltered data:

sm = pystan.StanModel(model_code=model_code, verbose=True)

sm_fit1 = sm.sampling(data=model_data, iter=1000, chains=4)
ppc_data1 = az.from_pystan(posterior=sm_fit1, posterior_predictive=["y_rep"], observed_data=["y"])
ax = az.plot_ppc(ppc_data1, data_pairs={"y": "y_rep"})

Looks okay:

Now for the data with small values:

sm_fit2 = sm.sampling(data=model_data2, iter=1000, chains=4)
ppc_data2 = az.from_pystan(posterior=sm_fit2, posterior_predictive=["y_rep"], observed_data=["y"])
ax = az.plot_ppc(ppc_data2, data_pairs={"y": "y_rep"})

Whoa!

Note the scale of the x-axis.

When I set the minimum value to be 1e5 = 100000, I get this (recall the data is centered around 2e8):

This is a bit better.

And when the minimum value is 1e7 = 10000000, I get:

So I get a better fit when the data stays away from zero.

Is this behavior something special about having values near zero? Or is it about the variance in the data? What is the takeaway here?

jsocolar · November 11, 2021, 3:05am

The takeaway is that the best fit lognormal when the data contain those tiny values doesn’t fit well at all. The logarithm of 2e8 is approximately 19. The logarithm of 1e8 is roughly 18.4. The logarithm of 1 is zero. So we need a Gaussian that puts a lot of probability density in the vicinity of 18 or 19, but also puts non-negligible probability density at zero. The only way to achieve this is to pick a gaussian that also puts a non-negligible probability density up near 30 or higher. And indeed you see in your plot that the x axis runs up to 5e13, whose logarithm is about 31.5.

To accommodate the probability mass in the vicinity of 2e8 while also accommodating outliers down around 1, you need a distribution whose logarithm has a much longer left tail than a Gaussian.

maswiebe · November 11, 2021, 6:45pm

Great answer, thank you!

maswiebe · November 12, 2021, 6:35pm

Here’s the histogram when the minimum value is 1:

And here it is after taking log:

So essentially, I’m trying to fit a Normal here, which means adding equal probability mass on the right tail, somewhere around 36. The posterior only goes to 31.5, though. Can anyone explain that?

jsocolar · November 12, 2021, 6:57pm

The best-fit normal is perhaps centered a bit further left than you’re assuming.

Carlos_Parada · November 14, 2021, 1:55am

Where is your data taken from? Is it possible that those near-0 measurements are errors?

If not, you’re probably going to want a mixture model of some sort — it’s unlikely the data really are coming from a log-normal, given the huge gap on the log scale.

maswiebe · November 14, 2021, 9:05pm

It’s revenue data, which went to 0 during March 2020 as stores closed.

Any explainers for mixture modelling that you recommend?

jmh530 · November 15, 2021, 3:39pm

Think about it like there is a probability, p, that the value is zero and probability 1-p that it is distributed by some other distribution

maswiebe · November 17, 2021, 2:32am

Hmm, looks like this is not possible for continuous distributions:

Zero inflation does not work for continuous distributions in Stan because of issues with derivatives; in particular, there is no way to add a point mass to a continuous distribution, such as zero-inflating a normal as a regression coefficient prior.

spinkney · November 17, 2021, 2:40am

You can inflate in an epsilon around zero using a normal with a small sd. Depending on the problem this may be a workable solution

stijn · November 17, 2021, 2:58am

If you know which stores closed, you can include this as a covariate. I actually think you want to include it as an interaction with all other variables (If closed all the other variables would not matter).

If you don’t know which stores closed, but you know that all stores with 0 revenue were closed and no open store had 0 revenue it might be easier to just remove the 0s depending on the goal of your model.

More broadly, it seems that you know the mechanism by which stores have 0 revenue, you can try to model that mechanism directly. I think mixture models are more appropriate if you don’t know (for sure) which observations will have 0 revenue.

Topic		Replies	Views
Trouble restricting samples to positive values Modeling techniques , fitting-issues	2	581	June 27, 2020
The myth of log_prob from Stan model? PyStan	7	960	December 7, 2020
Hurdle lognormal distribution Modeling mixture	11	2359	October 9, 2020
About Pystan Convergence Issue Modeling specification	6	1114	August 9, 2020
Toy ODE model problem -- posterior predictives not consistent with real values Modeling fitting-issues , specification	14	1047	June 4, 2019

Stan doesn't like (near-) zeros?

Related topics