Trying to understand lp__ as it is not producing the results that I expect when modelling count data using poisson distribution

Apologies, I feel that this is a fairly basic question, however, after searching high and low I can find no answer. There is a similar post here, however, it doesn’t seem to answer my question.

This stems from trying to find bayesian point estimates from a poisson regression model where some of the data can be quite sparse. Note that I need to both a point estimate and a credible interval and the relationship between these is not as expected.

I have isolated my difficulty in understanding what is going on to the following;

Given a simple poisson model with improper prior in stan as follows:

poisson_model <- "data{
    int<lower=1> n;
    int<lower=0> x[n];
}

parameters{
    real<lower=0> lambda;
}

model{
  x ~ poisson(lambda);
}"
library(rstan)
set.seed(1)
n <- 2
lambda <- 10
x <- rpois(n , lambda)

mod <- stan_model(model_code = poisson_model)
map <- optimizing(mod, data = list(n = n, x = x))
fit_mod <- sampling(mod, data = list(n = n, x = x), iter = 12000, warmup = 2000, chains = 1) 
# using just 1 chain as don't have any issues with convergence in such a simple model

map$par 
summary(fit_mod)
mcmc <- extract(fit_mod)
mcmc$lambda[which.max(mcmc$lp__)]

I can use this to find the MLE (or MAP if I include a prior) estimator using the optimizing function in rstan. This comes out as expected and is equal to sum(x)/n or 9 in the example above.

However, sampling from the model using the sampling function and looking at the value for lambda that maximizes lp__ this turns out to be equal to (sum(x) + 1)/n or 9.5 in the example above.

My expectation was that the maximum value of lp__ would be at the mode of the posterior of lambda and therefore the MLE of lambda. It turns out that lp__ is maximum at the mean of the posterior of lambda. I am clearly missing something basic.

A further observation is that the posterior distribution of lambda is a Gamma(sum(x) + 1, n) distribution which is equivalent to the prior being a Gamma(alpha, beta) distribution with alpha = 1 and beta approaching 0.

I am happy to do the hard reading if there is a suitable reference or link to a response.

1 Like

What you are describing seems right to me, unless I’m missing something MAP should be nothing more than the mode of the posterior, i.e. the highest lp__ in the chain. Except for a bug (or something I could be missing) there is no reason it shouldn’t be.

The problem, however, may be as simple as a fixed seed guaranteeing the same result for the same method, but not for different methods, so maybe instead you can just run the same thing with several different seeds and see if the same problem occurs, if on average you get the right point estimate for both methods then it should be fine. I’m also guessing sum(x)+1 = 9.5 is just a coincidence (you can always find a number s that will relate your estimate to sum(x) and n).

Is the results for the posterior being a Gamma distribution an analytical result? Even if it is it doesn’t guarantee (to some stochastic degree) that the sampled posterior will have the mode you expect. Unless someone else more familiar with the implementation of the optimizing function in Stan has other comments I’d try to exclude it being random variation from chain to chain before thinking about a bug or something more complex.

Unless I’m missing something (always possible), in your example

`x = c(8, 10)` 
`sum(x)/n = (8+10)/2 = 9.5`

The MLE and MAP (with a flat prior) are both 9.5.

Did you include a typo in the above? MLE = (10 + 8)/2 = 9.0

I’m almost certain that seed is not the problem. While the sampling is random and may be impacted by the seed the point where lp__ is maximum will be the same as long as the relevant domain is covered by that sampling. I also don’t think is a bug but a gap in my understanding. I have added to the above question as I have found that restating the model slightly changes where lp__ is maximum and produces the MLE. While I think that this is a step along the way to understanding what is happening I am still confused.

A few things to note here. Firstly, the ~ syntax drops any normalising constants from the log-probability calculations, so you should use the: target += poisson_lpmf(x | lambda) format instead if you’re interested in inference with the lp__ value.

Additionally, the use of the lower bound on the rate parameter lambda means that an additional correction needs to be made to the log-probability calculations to account for the transformation. I believe this is just the addition of the value of lambda at each iteration.

On top of this, you also need to account for the implicit uniform prior that is placed on lambda. If you instead explicitly specified a prior (again using the target += notation), that would help you in being able to recover the lp_ value from a given set of inputs

3 Likes

In addition to my post I have slightly restated my model as follows:

poisson_model <- "data{
    int<lower=1> n;
    int<lower=0> x[n];
}

parameters{
    real log_lambda;
}
transformed parameters{
  real<lower=0> lambda = exp(log_lambda);
}
model{
  x ~ poisson(lambda);
}"

When I run this new model I now find that the value of lp__ is maximum at the MLE of lambda (in my example 9).

2 Likes

Oops!

1 Like

I don’t see a gap in you understanding of the concepts you describe, which seems correct to me
I’m not sure I understand how what @andrjohns describes would affect what the MAP estimate is, but that is something on the implementation side and treatment of the log-probability, not the basic concept. But that may just be it.

Like you say, that is true under the assumption that the posterior has enough resolution for the MAP to be precisely picked out. I agree a difference of 0.5 seems large for such a simple model and wouldn’t expect it to vary that much between runs, but the fact that you actually recover the true value with a simple transformation in your restated model supports your understanding being correct and something else being the cause.

Comparing multiple runs with both models is a relatively easy empirical test, or you may get deeper into the implementation aspects that may be causing the issue. Those are my two suggestions, but maybe someone can indeed point to a conceptual gap I’m missing, I just don’t see it.

A reason why the restated model would be consistent but not the original would possibly be cause the sampled parameter log_lambda no longer has a lower-bound, and so no correction to the log-probability is being made while sampling

2 Likes

Hi, thanks for your help, you were right in that the issue was to do with the fact that lambda is constrained.

I have now managed to get the behaviour that I expected by adding the log of the Jacobian (i.e. log(1 / lambda) to lp__. This adjusted value is now maximum at the MLE or MAP and agrees with the optimizing function.

What I have learnt is also illustrative of the importance of the parameterisation chosen (and by extension prior) as both the mean and posterior of lambda is different with my change in parameterisation.

Thanks again for your help.

2 Likes

Just to add more context, similar discussion has been held at
Pystan log-probability confusing! and How exactly is lp__ being calculated which could shed more details on why all this stuff happens :-)

6 Likes

Thanks - this is very helpful and supports my understanding. Would definitely vote for Bob to be king of Stan.