Efficient modeling of poisson likelihoods with gaussian observation level errors

I often come across problems with overdispersed poisson-distributed data. For example time series of counts of animals in a given area, e.g. one count per year over 30 years. In these situations it makes sense to me to model these using a gaussian observation level error, a logarithmical link function and a poisson likelihood. The gaussian error would for example represent the effect of weather.

In practice I have the problem that the hyperparameter that controls the size of the observation level errors can vary over several orders of magnitude, which leads to sampling problems. (also with non-centered parameterisations)

I’ve thus recently started trying to approximately integrate out the observation level error, using two different approaches. Basically the idea is to create a custom _lpmf function that includes both the poisson and the gaussian component. But developing them is quite time intensive so I wonder if a solution might already be out there.
Do you know of any existing solutions?
For example I’ve heard of a laplace integrator that has or is beeing developed in stan, but couldn’t find much info about it. Also I wonder whether someone might already have coded a custom lmpf-function that uses an analytical approximation.

Just for completeness:
My first approach is to just use a, not very smart, grid based integrator.
The second approach is to approximate the distribution by a mixture of a negative binomial distribution and a lognormal distribution. This probably would work quite well for high numbers of animals, but for low numbers it fails because the lognormal distribution is continuous while my data is discrete.

1 Like

If you can write your model in equations, it might make it easier for us to see what can be done. There might be a closed form solution lurking about, but for me at least its hard to gauge at the moment.

if you can write us a theoretical equation to see better nothing is better than a mathematical equation to solve a problem with an equation in front of us we can play with the flexibility of mathematics

This is the basic model:



linearPredictor: vector, representing the combined effect of all predictors
y: vector, data to predict, e.g. numberOfBirds counted
e: vector, latent observation level error representing e.g. effect of weather
sd: skalar, a model parameter representing the size of the observation level deviations

And the idea was to find a function “d” so it can be rewritten as:
Thus getting rid of the latent vector e.

But I’d also be happy with any other solution that helps sampling these types of models.

The problem with the model is that the posterior of sd can span orders of magnitude, as sometimes the data allows it to get arbitrarily close to zero. I was playing around with zero-avoiding priors, but excluding areas near zero does noticeably change the shape of the posterior. I tried a non-centered parameterisation, but this fails too, which I think is because it introduces non-linear correlations between model parameters (which, again, vary by orders of magnitude), while the sampler is only able to adapt to linear correlations.

The non-centered version looked like this:


Looks very similar to the models discussed in


Not sure if you’ve already mentioned it, but for your model:


If you assume that e is Gamma-distributed with shape and scale parameters \frac{1}{\phi}:

y \sim Poisson(\exp(\mu + e)) \\ e \sim Gamma(\frac{1}{\phi},\frac{1}{\phi})

then this reduces to a negative-binomial, and the latent vector e can be integrated out:

y \sim NegBin(\mu, \phi)

This is the neg_binomial_2_log distribution in Stan (you don’t need to exponentiate \mu with the _log suffix):

y ~ neg_binomial_2_log(mu, phi);

Additionally, if you’re using this in the context of regression, and the mean parameter \mu is actually an intercept and covariates:

\mu = \alpha + x\beta

Then you can use the neg_binomial_2_log_glm distribution:

y ~ neg_binomial_2_log_glm(x, alpha, beta, phi);
1 Like

That works well if sd (in my formula) is small, but as sd becomes larger than 1 the negative binomial is quite different from lognormal+poisson.
At large sd values the shape of the distribution is dominated by the lognormal, thats why I started to experiment with mixtures of negative-binomial and lognormal. From my experiments it seems like such mixtures could work well for large values of the linear predictor, but not so much for small ones as the log-normal is not a discrete distribution.
I’m not aware of a discrete distribution that “corresponds to” lognormal for large values of sd, but i might well be overlooking something. If anyone knows such a distribution that might help for the mixture solution.

Right, I poked around a bit but I think closed-form marginalisation is out of reach. Nevertheless, we can do a few calculations to help us understand the situation a little bit better.

Consider the following hierarchical model:

\begin{align*} W &\sim \mathcal{P}(\lambda), \\ \lambda &= \exp(\mu + Z), \\ Z &\sim \mathcal{N}(0, \sigma^2), \end{align*}

with \mu \in \mathbb{R} and \sigma \in \mathbb{R}^+.

We would like to compute E[W] and \text{Var}(W).
We will use the laws of total expectation and total variance

\begin{align*} E[X] &= E_Y[E_{X|Y}[X|Y]], \\ \text{Var}(X) &= E_Y[\text{Var}_{X|Y}(X|Y)] + \text{Var}_Y(E_{X|Y}[X|Y]). \end{align*}

So let us first compute E[W|Z] and \text{Var}(W|Z).
From the Poisson distribution we know that if U \sim \mathcal{P}(\nu) then E[U] = \text{Var}(U) = \nu, hence E[W|Z = z] = \text{Var}(W|Z = z) = \exp(\mu + z).
Next we need

\begin{align} E[W] &= E_Z[\exp(\mu + z)] = \int_{-\infty}^\infty \exp(\mu + z) \frac{\exp(-z^2/2\sigma^2)}{\sqrt{2\pi\sigma^2}}dz, \\ &= \exp(\mu + \sigma^2/2). \end{align}


\begin{align} \text{Var}(W) &= E_Z[\exp(\mu + z)] + \text{Var}_Z(\exp(\mu + z)),\\ &= \exp(\mu + \sigma^2/2) + E_Z[\left(\exp(\mu + z)\right)^2] - \left(E_Z[\exp(\mu + z)]\right)^2, \\ &= \exp(\mu + \sigma^2/2) + \left(\exp(\sigma^2)-1\right)\exp(2\mu + \sigma^2). \end{align}

Sanity check

N <- 1E6
mu <- log(5)
sigma2 <- .5
Z <- rnorm(N, mean = 0, sd = sqrt(sigma2))
W <- rpois(N, lambda = exp(mu + Z))
mean(W); exp(mu + sigma2/2)
[1] 6.416886
[1] 6.420127
 var(W); (exp(sigma2)-1) * exp(2*mu + sigma2) + exp(mu + sigma2/2)
[1] 33.14435
[1] 33.15914


No wonder things go awry when the normal random effect’s variance grows, look at what happens to the variance of W when \sigma^2 grows. I don’t quite know what do to from here other than maybe fix the linear predictor (\mu in the calculations above) and see what the induced distribution on the first two moments looks like under various priors for the random effect variance, \sigma^2.

Thanks for the calculations! Although I’m not sure if I understand where you’re going with this. I think that scaling of the variance with sigma is not what is causing the problems of these models, as the latent observation error is on the scale of the linear predictor, where the variance is well behaved.

Here’s a real world example showing one of the datasets I’m modelling with this. What this example shows is that real world data actually does have such high sigma values. The example model has posterior values of sigma^2 of over 1, which is in the range where the negative-binomial approximation breaks down. Here’s the posterior of sigma for one of them:
x-axis logarithmical
blue: prior (normal(0,3), preliminary prior for testing purposes)
red: posterior

Here’s the data and model predictions:
Its a local-linear-slope model, year on the x-axis, number of counted individuals on the (logarithmical) y-axis.
That model works ±ok-ish, but doesn’t sample well enougth for publication quality yet on some of the datasets I’m using it for. The problems of this particular model are only partially related to this thread.

During writing this answer I realized that I made some mistakes in the previous posts:
(Note: It seems I can’t update my posts above anymore to correct the error)
Contrary to what I wrote there, as can be seen from the example model, the problem, at least with the models I currently work on, is not that sigma varies over orders of magnitude (it doesn’t!, although I think i remember I had that problem in the past with other similar models). Rather the problem in my current models is (at least that’s what I think) the correlations between the latent variables used to represent the observation level errors with other latent variabels in the model (e.g. the markov-chain representing population size in structural time series models). Thus what i wrote about the non-centered parameterisation I tried is also wrong. What I non-centered was not the observation level error, but the latent markov-chain in the structural time-series models. This works for “curvy” datasets, but when the datapoints are ± on a straight line, thus allowing for the posterior of the stepsize of population changes between the timesteps in the model to get close to zero, then I get the problem described previously that i need a noncentered parameterisation for the latent markov-chain which (i think) causes problems with the latent gaussian observation level errors because it makes the correlations between them and the non-centered markov-chain non-linear, which means that the sampler can no-longer adapt to them.

So, disregarding the particular problems I have with my current models, the motivation of this thread actually was the general pattern I saw in my work that I gravitate to the type of models explained in the opening post, which can lead to problems for various kinds of reasons. So the idea of this thread was to get rid of the latent observation level error to reduce the number of possible variable interactions, reduce dimensionality, etc, which would bring different benefits for different models.

I’ve only had a quick scheme and I’ll take a deeper look at the conversation. In the meantime a few comments.

As @nhuurre pointed out, we studied a similar model when prototyping HMC using an embedded Laplace approximation. The idea here is to do an approximate marginalization, although with a Poisson likelihood, the approximation is typically very good. Assuming the linear predictor is fixed data, I believe it should be possible to use the same code.

The StanCon notebook includes prototype code and a discussion of the examples: https://charlesm93.github.io/files/lgm_stan.pdf. The functions are not exposed to Stan, given their prototypical nature, but you should be able to install the branch with the code and try it out. I’m happy to provide assistance if you decide to go down this path.

(we’re still actively developing this feature and I’ll put together a post with updates)

1 Like