Large Poisson model with Individual effects is too slow

Brief description of the problem

I need to speed up the estimation of a hierarchical Poisson demand model with individual effects for a large dataset.
I wonder if there are any obvious performance improvements I could do, like vectorization, reparametrizations, etc. Also I would like some opinions on whether this problem is at all feasible in Stan.
Also, when I try to estimate it with a ~1.5MM observations sample (about 8000 products and 16 groups) it gets stuck. I think it might be a memory issue.


I want to implement a demand model for electronic products. I’m using a hierarchical Poisson model where the coefficients of interest are the price responses and where I need to include product specific and week effects.
The levels of the hierarchy are determined by the different groups products belong to (each product belongs to only one group). The motivation for a hierarchical model is that some products have limited data or minimal price variation during the sampled period and it is desirable that their estimates be pulled towards their group means.

I observe daily sales, prices and other product covariates.

An important assumption I’m making is that I need within week variation to identify the “true” price response (let’s take this as given for the purpose of the question). I could do this by including product-week dummy variables but doing this is problematic. To avoid the “incidental parameters problem” associated with these individual effects, and also to reduce the number of parameters to be estimated I use a transformation of the Poisson that leads to an equivalent likelihood function that looks like the product of multinomials (*). This leads to a model in which there are no dummies but the other parameters of the Poisson are identified through within week variation for each product.

Stan Code

This is the Stan model in which the likelihood is given by the product of within product-week observations.

// Poisson model using product-week FE and multinomial representation so that
// the FE are concentrated out and not estimated.

data {
  int<lower=1> N;          // nb of obs
  int<lower=1> J;          // nb of products
  int<lower=1> G;          // nb of groups
  int<lower=1> W;          // nb product-week
  int<lower=1> counts[W];  // W counts, i.e. nb of obs in each product-week (normally 7)
  int<lower=1> product[N]; // index of product for row n; map obs to product
  int<lower=1> group[J];   // group of product j
  int<lower=0> y[N];       // number of sales
  real p[N];               // prices

parameters {
  real mu_beta[G];         
  real<lower=0> sigma_beta[G];
  real beta_raw[J]; 

transformed parameters {
  vector[N] lambda;
  real beta[J];
  // center and scale parameters
  for (j in 1:J) 
    beta[j] = mu_beta[group[j]] + sigma_beta[group[j]] * beta_raw[j]; // price coefficient, different for each product
  // poisson parameter
  for (n in 1:N)
    lambda[n] = p[n] * beta[product[n]];

model {
  // priors
  beta_raw   ~ normal(0, 1);
  mu_beta    ~ normal(0, 2.5);
  sigma_beta ~ normal(0, 2.5);
  // likelihood of the data
    int pos = 1;
    for (aw in 1:W) {
      segment(y, pos, counts[aw]) ~ multinomial( softmax(segment(lambda, pos, counts[aw])) );
      pos = pos + counts[aw];

R Code

This code generates a synthetic data set and estimates the model. It assumes the Stan file is called discourse_poisson_FE.stan.

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

# In my application G = 16, and Jg > 500 which results
# in a dataset > 1,500,000.

G = 4             # number of groups
Jg = 50           # number of products per group
J = Jg * G        # total number of products
W = 27            # number of weeks
week_days = 7
D = W * week_days # numbe of days

mu_price = 50 + (rbeta(G, 3, 3) - 0.5) * 50
sigma_price = 100 + (rbeta(G, 5, 5) - 0.5) * 50

mu_beta = rbeta(n = G, 4, 3) * 2
sigma_beta = rbeta(1000, 2, 5) * 6 + 0.1

mu_alpha = 0.5
sigma_alpha = 1.267

mu_theta = rnorm(0, 2.5)

# Generate dataset. I draw from a Zero inflated Poisson as this is closest 
# to my application.

simdata = CJ(ASIN = 1:J, dd = 1:D)
simdata[, ww     := rep(1:(max(dd)/week_days), each=week_days), by=ASIN]
simdata[, AW_ix  := .GRP, by=.(ASIN, ww)]
simdata[, group  := cut(ASIN, G, labels = F)]

simdata[, price  := 1/100 * rtrunc(.N, a = 0, b = 10000, spec = 'norm', mean = mu_price[group], sd = sigma_price[group]), by=group]
  simdata[, beta  := rtrunc(n = 1, a = 0, b = 1.5, 
                            spec = 'norm', mean = mu_beta[group], 
                            sd = sigma_beta[group]), by=ASIN]

simdata[, alpha   := rnorm(1, beta, sigma_alpha), by=AW_ix]
simdata[, lambda  := exp(0 * alpha - beta * price)]
simdata[, theta   := rnorm(1, -1, 2.5), by=ASIN]
simdata[, nonzero := as.integer(runif(.N) < plogis(theta))]
simdata[, totalquantity       := nonzero * rpois(.N, lambda)]

simdata[, elast    := (1/lambda - 1/(exp(-lambda) + 1)) * beta * mean(price), by=ASIN]

stopifnot( 0 == nrow(simdata[!complete.cases(simdata)]) )

# create list ro pass to Stan
setkey(simdata, AW_ix)

data_stan = list(
  N = simdata[, .N],
  J = simdata[, uniqueN(ASIN)],
  G = simdata[, uniqueN(group)],
  W = simdata[, uniqueN(AW_ix)],
  product = simdata[, ASIN],
  group = simdata[, first(group), by=ASIN][, V1], 
  y = simdata[, totalquantity],
  p = simdata[, price],
  counts = simdata[, .N, by=AW_ix][, N]

model_file = 'discourse_poisson_FE.stan'
compiled_model = stan_model(model_file)

bfit = sampling(compiled_model,

(*) See, and Lancaster 2004.

Lancaster reparameterizations are critical when using Stan, although no one does them. However, his 2004 code was written for BUGS and that makes things a bit more confusing when adapting them to Stan. Basically, you want to put the parameters you integrated out back in. In the Poisson case, Lancaster writes something equivalent to

log(alphastar_g) = log(alpha_g)  + log_sum_exp(x[g] * beta);

so in the parameters block you declare

parameters {
  real log_alphastar[groups];

and in the transformed parameters block you map to

transformed parameters {
  real log_alpha[groups];
  for (g in 1:groups) 
    log_alpha[g] = log_alphastar[g] - log_sum_exp(x[g] * beta);

Finally, in the model block you use the log_alpha in conjunction with a poisson_log_lpmf likelihood with any prior you want on log_alphastar. This is actually faster than doing the multinomial thing when you call poisson_log_lpmf just once. It also makes things easier to do zero-inflation or overdispersion.

Thank you Ben for your answer.

Part of the reason I used the multinomial transformation is to avoid estimating the fixed effects (aka individual effects) for each product-week. The reason is that the number of these parameters grows more than proportionally with the number of products and I end up with millions of them. Even though the theory says that estimating them should be fine because the Poisson model doesn’t suffer from incidental parameters bias, it is still very burdensome numerically to have so many parameters.

Looking at your suggestion it appears to me that Stan would be estimating them in your approach. Is that right?

That Stan program can’t be right because it won’t parse.

If it gets stuck, you might want to try starting with a smaller stepsize.

We really need multinomial_logit to avoid the explicit softmax.

You can vectorize this

for (n in 1:N)
    lambda[n] = p[n] * beta[product[n]];


lambda = p .* beta[product];

It won’t be any more efficient, just safer in that you can’t get the wrong loop bounds or sizes and still have it work. Same for

for (j in 1:J) 
    beta[j] = mu_beta[group[j]] + sigma_beta[group[j]] * beta_raw[j]; 


beta = mu_beta[group] + sigma_beta[group] .* beta_raw;

You’ll need to declare things like beta as vectors.

You can also make those transformed parameters into local variables in the model, but then you won’t have them reconstructed easily.

There’s also the over-arching consideration of using K or K-1 free parameters to characterize the result. You’re using K, which can cause (soft) identifiability issues because only the prior is identifying the model for you . There’s a discussion in the manual.

Yes. If there are millions of them, then that is not going to be fast. But more generally with these sorts of models, if you condition on the sum and integrate out the intercepts, then you can’t do posterior prediction meaningfully. So, you are really confining yourself to only being able to talk about the price elasticity.

Thanks Bob! I will try your suggestions.
I have edited my Stan code and I checked that now it runs on my end.

Ben, I’m not sure I understand your suggestion. Do you have a complete toy model of your solution?

In your answer:

  • Is log_alphastar the poisson rate parameter?
  • And is log_alpha the dummy for the individual effects (fixed effects in econ jargon)?
  • How could we write the terms to be add to the log likelihood?
  • What is the advantage of your method with respect to defining the fixed effect dummies as parameters and the poisson parameter as a transformed parameter?


@bgoodri — I was going to put this in the manual, but I couldn’t quite tell either from your post or from Lancaster’s paper or book, what exactly was being reparameterized to what. Is there a multinomial being reparameterized as a Poisson or vice-versa?

What I need to see to understand this is a before and after model with the distributions other than the priors filled in. From Lancaster’s notation, isn’t log(alpha_b) is the linear component of the GLM? But then that makes me think log_alphastar should be a derived quantity but you have it as a parameter.

I do get that Lancaster is using the basic Poisson-multinomial identity that lets you take a bunch of Poissons and draw the overall size once, then the distribution within that size as a multinomial:

PROD_n  Poisson(y[n] | lambda[n])
Poisson(sum(y) | sum(lambda)
* Multinomial(y | lambda / sum(lambda))

In Stan code, that’d be

poisson_log_lpmf(y | log_lambda)
poisson_log_lpmf(sum(y) | log_sum_exp(log_lambda))
+ multinomial_lpmf(y | softmax(log_lambda))

I’m not quite sure if there’s a constant missing somewhere there.

Also, it’d be nice if we had a multinomial_logit along the lines of categorical_logit—then we cold drop the softmax.

@pauper42 - Out of curiosity, how long did it take for your model to run, and on what hardware?

OK. Let’s back up. Suppose the outcome is some count like "number of triples hit by batter i in year t" with J batters and T years and the predictor is or includes his age. Clearly, some batters are more likely to hit triples than other batters of the same age. Frequentist statisticians would just insert a “random effect” for each batter and integrate them out of the likelihood. Bayesian statisticians would include a separate intercept for each batter with a normal prior that has an unknown standard deviation.

But Lancaster is talking to econometricians who would otherwise insert a “fixed effect” for each batter in the design matrix X (and omit the intercept). These fixed effects cannot be estimated consistently as the number of batters goes to infinity; they can only be estimated consistently if the number of batters stays fixed and the number of years on which each batter is observed goes to infinity. Unlike the case of a linear, Gaussian, identity link model, the lack of consistency when estimating the batter-specific intercepts infects the estimator of the coefficient on age, making it an inconsistent as well.

Lancaster proposes a reparameterization from each alpha to the corresponding alphastar, which allows the likelihood for batter i to factor into two parts and one part only involves the alphastar parameters. If you integrate out alphastar, then you are left with a multinomial likelihood for beta conditional on N, which is the total number of triples that batter i hits over all observed years. You can do that multinomial logit regression and consistently estimate beta as the number of batters goes to infinity, whether your point estimator is a MLE or a posterior mode.

The OP stated that there were “millions” of alpha_i parameters, so it is understandable why someone would want to reparameterize, integrate out the alphastar_i parameters, and get the posterior distribution of beta. While that can be done, it should not be done, in general, because you cannot draw from the posterior predictive distribution. Or at least you can only draw from the posterior predictive distribution conditional on the sum of the outcome for each unit over time, which the likelihood conditions on but isn’t really “given” in many situations.

Ignoring computational considerations, what most people who use Stan should be doing is declaring vector[J] log_alphastar; in the parameters block and mapping to vector[J] log_alpha; in the transformed parameters block (with the log_sum_exp thing in Lancaster). Then use log_alpha and beta to construct the linear predictor for poisson_log_lpmf. The Hessian of the log-likelihood is block diagonal, with one block pertaining to the log_alphastar parameters and one block pertaining to the beta parameters. This makes things easier for NUTS and for what it is worth, the posterior mode of beta is consistent as the number of units goes to infinity (although the posterior modes of log_alphastar are not consistent estimators).

The rest of Lancaster talks about similar reparameterizations with other models in order to integrate the reparameterized “fixed effect” out analytically before using BUGS. The takeaway point of this from a Stan user’s perspective is that you should reparameterize to make the Hessian block diagonal or at least block diagonal on average over the posterior.