Negative Binomial Regression worse mean prediction performance than log_poisson

I’m trying to fit a descriptive model of page views per interval for particular sites through a varying-intercept fixed slope model. At a high level, the model description is that each site gets its own offset o_p, which modifies global trends including daily, weekly, and yearly periodicities and a cubic polynomial on the site’s remaining lifetime. So pageviews are
o_p \sim \text{N}(4.5, 5)
\phi_p^{-1 /2} \sim \text{student_t}(2, 0, 1)
\text{pageviews} \sim \text{NB}(\exp(o_p + X \beta), \phi_p)

I’m not inverting the qr_thin_R since I don’t need to recover the coefficients, I’m really just want the residual distributions

data {
    int<lower=1> num_events;
    int<lower=num_events> num_time_bins;
    int<lower=1, upper=num_events> event[num_time_bins];

    vector[num_time_bins] time_to_event;
    vector[num_time_bins] second_of_year;
    vector[num_time_bins] second_of_week;
    vector[num_time_bins] second_of_day;

    int<lower=0> num_pageviews[num_time_bins];
transformed data {
    int num_time_features = 23;
    matrix[num_time_bins, num_time_features] X_time;
    matrix[num_time_bins, num_time_features] Q_time;
    for (i in 1:num_time_bins) {
        X_time[i] = [
            pow(log(time_to_event[i]), 2.0),
            pow(log(time_to_event[i]), 3.0),
            sin(1.0 * 2.0 * pi() * second_of_year[i]),
            cos(1.0 * 2.0 * pi() * second_of_year[i]),
            sin(2.0 * 2.0 * pi() * second_of_year[i]),
            cos(2.0 * 2.0 * pi() * second_of_year[i]),
            sin(1.0 * 2.0 * pi() * second_of_week[i]),
            cos(1.0 * 2.0 * pi() * second_of_week[i]),
            sin(2.0 * 2.0 * pi() * second_of_week[i]),
            cos(2.0 * 2.0 * pi() * second_of_week[i]),
            sin(3.0 * 2.0 * pi() * second_of_week[i]),
            cos(3.0 * 2.0 * pi() * second_of_week[i]),
            sin(4.0 * 2.0 * pi() * second_of_week[i]),
            cos(4.0 * 2.0 * pi() * second_of_week[i]),
            sin(1.0 * 2.0 * pi() * second_of_day[i]),
            cos(1.0 * 2.0 * pi() * second_of_day[i]),
            sin(2.0 * 2.0 * pi() * second_of_day[i]),
            cos(2.0 * 2.0 * pi() * second_of_day[i]),
            sin(3.0 * 2.0 * pi() * second_of_day[i]),
            cos(3.0 * 2.0 * pi() * second_of_day[i]),
            sin(4.0 * 2.0 * pi() * second_of_day[i]),
            cos(4.0 * 2.0 * pi() * second_of_day[i])
    Q_time = qr_thin_Q(X_time) * sqrt(num_time_bins - 1);
parameters {
    vector[num_events] event_offset;
    vector<lower=0>[num_events] inv_sqrt_phi;
    vector[num_time_features] time_coefficients;
transformed parameters {
    vector[num_time_bins] mu_final = exp(event_offset[event] + Q_time * time_coefficients);
    vector<lower=0>[num_time_bins] phi_final;
    for (i in 1:num_time_bins) {
        phi_final[i] = pow(inv_sqrt_phi[event[i]], -2.0);
model {
    event_offset ~ normal(4.5, 5);
    inv_sqrt_phi ~ student_t(2, 0, 1);
    time_coefficients ~ normal(0, 5);

    num_pageviews ~ neg_binomial_2(mu_final, phi_final);
generated quantities {
    vector[num_time_bins] n = phi_final;
    vector[num_time_bins] r = phi_final ./ (mu_final + phi_final);
    vector[num_time_bins] log_likelihood;
    vector[num_time_bins] quantile;
    for (i in 1:num_time_bins) {
        log_likelihood[i] = neg_binomial_2_lpmf(num_pageviews[i] | mu_final[i], phi_final[i]);
        quantile[i] = neg_binomial_2_cdf(num_pageviews[i], mu_final[i], phi_final[i]);

I also fit a poisson version of the same model, which is underdispersed. Below is the distribution of residual quantiles, ideal is uniform(0, 1)

the negative binomial is better calibrated, though it fails to capture fat tailed pageview events. Maybe mixing a zeta distribution would help, though I’m at a bit of a loss there

What confuses me the most though is that mean prediction for the poisson model is considerably better. Here’s the poisson
bokeh_plot (2)

and the negative binomial
bokeh_plot (4)

R squared for the poisson is ~0.85 and for the negative binomial is ~0.75

Since I’m using the same features and form for the negative binomial \log(\mu) and the poisson \log(\lambda), I would expect the former to be a better descriptive model of the data since it’s essentially the same with an additional parameter for increasing the variance. I thought maybe low dispersion parameters were causing sharp curvature in the posterior surface, creating problems for the gradient descent’s step-size adaptation, but typical \phi are in the respectable range of 6-8. Any idea what’s going wrong here? Maybe I should give the site offset and dispersions proper hierarchical priors?

Other relevant information I can’t explain:

I’m fitting the following using the optimizer since hamiltonian monte carlo refuses to work at all. My computer jet engines start to take off, then they go away and the subprocess hangs.

I’m doing about 90,000 datapoints with 50 groups, which doesn’t seem terribly large, but even using l-BFGS the fitting is pretty slow. 20 gradient evaluations takes about 15 seconds, and the optimization terminates after about a minute and a half with gradient evaluations in the low hundreds. For comparison, the poisson model takes maybe 30 seconds with a thousand gradient evaluations.

With large counts like that, what about also trying a normal or a lognormal likelihood?

Start with smaller data sizes and try to figure out what the limitation is. The computational problems could just be a symptom of a modeling problem (non-identifiability or something). If you get lots of max-treedepth warnings, or divergences, etc, that’s an indicator of a modeling problem.