Stan doesn't like (near-) zeros?

I’m using data that has zeros during covid. Using a lognormal model (of y+1), I’m finding that this blows up my posterior distribution, getting a huge variance. (I’m using PyStan 2.19.1.1.)

Here’s a toy example to illustrate the problem. I generate exponential data, then create a second dataset with some values forced to be very small.

N = 100
np.random.seed(123)
y = np.random.exponential(scale=1e8, size=N) + 1e8
df = pd.DataFrame(y, columns=['y'])
df1 = df.copy()
df1.loc[12:13, 'y'] = 1

model_data1 = {
  'N': len(df),
  'y': df['y'],
}
model_data2 = {
  'N': len(df1),
  'y': df1['y'],
}

This is how the data looks in df1; note that the minimum value is 1.

Stan code:

data {
  int<lower=1> N;
  vector[N] y;
}

parameters {
  real theta;
  real<lower=0> sigma;
}

model {
  y ~ lognormal(rep_vector(theta,N),sigma);
}

generated quantities {
    real y_rep[N] = lognormal_rng(rep_vector(theta,N), sigma);
}

Compile and get the posterior predictive density, first for the unaltered data:

sm = pystan.StanModel(model_code=model_code, verbose=True)

sm_fit1 = sm.sampling(data=model_data, iter=1000, chains=4)
ppc_data1 = az.from_pystan(posterior=sm_fit1, posterior_predictive=["y_rep"], observed_data=["y"])
ax = az.plot_ppc(ppc_data1, data_pairs={"y": "y_rep"})

Looks okay:
image

Now for the data with small values:

sm_fit2 = sm.sampling(data=model_data2, iter=1000, chains=4)
ppc_data2 = az.from_pystan(posterior=sm_fit2, posterior_predictive=["y_rep"], observed_data=["y"])
ax = az.plot_ppc(ppc_data2, data_pairs={"y": "y_rep"})

Whoa!


Note the scale of the x-axis.

When I set the minimum value to be 1e5 = 100000, I get this (recall the data is centered around 2e8):

image
This is a bit better.

And when the minimum value is 1e7 = 10000000, I get:

So I get a better fit when the data stays away from zero.

Is this behavior something special about having values near zero? Or is it about the variance in the data? What is the takeaway here?

The takeaway is that the best fit lognormal when the data contain those tiny values doesn’t fit well at all. The logarithm of 2e8 is approximately 19. The logarithm of 1e8 is roughly 18.4. The logarithm of 1 is zero. So we need a Gaussian that puts a lot of probability density in the vicinity of 18 or 19, but also puts non-negligible probability density at zero. The only way to achieve this is to pick a gaussian that also puts a non-negligible probability density up near 30 or higher. And indeed you see in your plot that the x axis runs up to 5e13, whose logarithm is about 31.5.

To accommodate the probability mass in the vicinity of 2e8 while also accommodating outliers down around 1, you need a distribution whose logarithm has a much longer left tail than a Gaussian.

9 Likes

Great answer, thank you!

Here’s the histogram when the minimum value is 1:

And here it is after taking log:

So essentially, I’m trying to fit a Normal here, which means adding equal probability mass on the right tail, somewhere around 36. The posterior only goes to 31.5, though. Can anyone explain that?

The best-fit normal is perhaps centered a bit further left than you’re assuming.

1 Like

Where is your data taken from? Is it possible that those near-0 measurements are errors?

If not, you’re probably going to want a mixture model of some sort — it’s unlikely the data really are coming from a log-normal, given the huge gap on the log scale.

It’s revenue data, which went to 0 during March 2020 as stores closed.

Any explainers for mixture modelling that you recommend?

Think about it like there is a probability, p, that the value is zero and probability 1-p that it is distributed by some other distribution

Hmm, looks like this is not possible for continuous distributions:

Zero inflation does not work for continuous distributions in Stan because of issues with derivatives; in particular, there is no way to add a point mass to a continuous distribution, such as zero-inflating a normal as a regression coefficient prior.

You can inflate in an epsilon around zero using a normal with a small sd. Depending on the problem this may be a workable solution

1 Like

If you know which stores closed, you can include this as a covariate. I actually think you want to include it as an interaction with all other variables (If closed all the other variables would not matter).

If you don’t know which stores closed, but you know that all stores with 0 revenue were closed and no open store had 0 revenue it might be easier to just remove the 0s depending on the goal of your model.

More broadly, it seems that you know the mechanism by which stores have 0 revenue, you can try to model that mechanism directly. I think mixture models are more appropriate if you don’t know (for sure) which observations will have 0 revenue.

1 Like