Help with Poisson model

I’m having some troubles with a Poisson model:

image

In this model y_it denotes the number of hospitalizations from practice i at time t, and and n_it is the number of patients in that practice.

This is the Stan code that I wrote:

data {
  int<lower=0> I;              // number of practices 
  int<lower=0> T;              // number of time periods 
  int<lower=1> y[I*T];         // Count outcome
  int<lower=0> n[I*T];          // number of patients in period t
  int<lower=0> K;              // number of predictors
  vector[K] x[I*T];             // predictor matrix
  int<lower=1,upper=I> practice[I*T];          //
  int<lower=-1,upper=4> time[I*T];          //
  int<lower=1,upper=5> index_time[I*T];          //
}

parameters {
  vector[I] eta_a;            //
  real<lower=0> tau_a;         // 
  vector[I] eta_b;            //
  real<lower=0> tau_b;         // 
  vector[K] beta;              // 
  real<lower=0> gamma[T];      // year specific intercept   
}

transformed parameters {
  vector[I] a; 
  vector[I] b; 
  real lp[I*T];
  real <lower=0> nmu[I*T];
  a = tau_a * eta_a; // Matt trick
  b = tau_b * eta_b; // Matt trick
  
  for (it in 1:(I*T)) {
    // Linear predictor
    lp[it] = dot_product(x[it], beta) + gamma[index_time[it]]  + a[practice[it]] + b[practice[it]]*time[it]; 
   nmu[it] = n[it]*exp(lp[it]);
}
}

model {
  eta_a ~ student_t(4, 0, 1);
  eta_b ~ student_t(4, 0, 1);
  tau_a ~ normal(0,1);
  tau_b ~ normal(0,1);
  gamma ~normal(0,1);
  beta ~ normal(0,1); 
  y ~ poisson(nmu);   // likelihood
}

generated quantities {
  real <lower=0> yhat[I*T];
  for (it in 1:(I*T)) {
    yhat[it] = 1000*nmu[it]/n[it];
  }
}

I’m able to fit the model without any errors nor warnings. The effective sample size and the rhat for the parameters all look good. That said, I have the following weird behavior:

  • tau_a = 0.1728, this is what I get when I plot a and b

image

How can it be that I get a tau_a so small and an a can take the value -12? Any suggestions on how to fix this?

Thanks!

Ignacio

The way I’d code this is by making x a vector and sticking the log scale:

y ~ poisson_log(log(n) + x * beta + gamma[index_time] + a[practice] + b[practice] .* time);

I’m guessing you mean to define the posterior predictive distribution for y, which would be poisson_rng(y_hat[i]). As is, you’re missing the sampling uncertainty from the Poisson. There’s much use for the posterior over just y-hat.

I find it easier to keep track of the non-centered parameterizations is to define

a = a_std * sigma_a;

It keeps the notations aligned that a_std is the standard normal vrsion of a and sigma_a is the scale of a.

Those student-t priors on your scales are very wide. I’m guessing what I called sigma_a won’t get so large with a normal prior.

1 Like

First, you are amazing @Bob_Carpenter thanks for all the help! I’m struggling with how to implement:

Right now I declare vector[K] x[I*T]; // predictor matrix

How would I make it a vector?

Oops, meant matrix.

matrix[I * T, K] x;
...
vector[K] beta;
...
vector[I * T] x_beta = x * beta;
...

but you can also just use x * beta as a vector[I * T]—I only wrote it out here to indicate all the types clearly.

Thanks @Bob_Carpenter! I was able to get the model to run with the suggestions you made

data {
  int<lower=0> I;              // number of practices 
  int<lower=0> T;              // number of time periods 
  int<lower=1> y[I*T];         // Count outcome
  vector[I * T] n;          // number of patients in period t
  int<lower=0> K;              // number of predictors
  matrix[I * T, K] x;             // predictor matrix
  int<lower=1,upper=I> practice[I*T];          //
  vector<lower=-1,upper=4>[I*T] time;          //
  int<lower=1,upper=5> index_time[I*T];          //
}


parameters {
  vector[I] a_std;              //
  real<lower=0> sigma_a;            // 
  vector[I] b_std;              //
  real<lower=0> sigma_b;            // 
  vector[K] beta;                 // 
  vector[T] gamma;                // year specific intercept   
}


transformed parameters {
  vector[I] a; 
  vector[I] b; 
  vector[I * T] x_beta = x * beta;
  a = sigma_a * a_std ;            // Matt trick
  b = sigma_b * b_std ;            // Matt trick
}

model {
  sigma_a ~ normal(0, 1);
  sigma_b ~ normal(0, 1);
  a_std ~ normal(0,1);
  b_std ~ normal(0,1);
  gamma ~normal(0,1);
  beta ~ normal(0,1); 
  y ~ poisson_log(log(n) + x_beta + gamma[index_time] + a[practice] + b[practice] .* time);   // likelihood
}


generated quantities {
  vector<lower=0>[I*T] ytilda;
  for (it in 1:(I*T)) {
    ytilda[it] = poisson_rng(y[it]);
  }
}

I have a final question. Did I implement the generated quantities block right? My intention is to use bayesplot to do some posterior predictive checks for different models. For example,

library(bayesplot)
ytilda <- as.data.frame(fit, pars='ytilda')
ppc_dens_overlay(y = bsdata$y,
                 yrep = as.matrix(ytilda[sample(x = 1:nrow(ytilda), size = 50),]))

Thanks again!

Ignacio

I don’t think so, as you’re providing the data y as the parameter to a Poisson RNG.

But you didn’t say what you want it to do—I don’t know Bayesplot, but maybe the question will make sense to @jonah .

If you just want to generate PPCs for y, then what you want to do is this:

{ 
  vector[I * T] y_hat = log(n) + x_beta + gamma[index_time] + a[practice] + b[practice] .* time;
  for (it in 1:I * T) {
    y_sim[it] = poisson_log_rng(y_hat[it]);
}

to generate the y_sim (what you called ytilda).

1 Like

I made the change you suggested but now I get the following error:

Exception: poisson_log_rng: Log rate parameter is 21.8036, but must be less than 20.7944  (in 'modeldf1c1eabc8_5e215340a9834a686c6d7e96112d9feb' at line 47)

:_(

I’m still trying to figure out why I’m getting that error. I thought looking at max(y_hat) could be informative, I think that value is reasonable:

max(y_hat)
[1] 7.511698

@jonah , @Bob_Carpenter any suggestions? I have no clue what is that I am doing wrong :_(

Since poisson_rngs return an int, I believe what is happening here is that Stan is trying to keep the rate parameter vaguely within range of 32 bit integers (2^30, or e^20.744). That’s a very large rate. Does it make sense?

@syclik Was there a thing at one point to move to 64bit ints? Did this include the rngs at all?

Yes, and we’re going to be upgrading to 64-bit integers for this and other reasons. I never suspected anyone would be using RNGs generating integers in the billions. I’m still not sure why this makes sense. By the time your numbers are that big, normal approximations work very well.

1 Like

If I understand what you are saying correctly this means when I draw from the posterior of my simple model I get some predictions that are larger than 2,147,483,647. Which makes me think that my model is dead wrong. After all, in my data max(y)= 1,772.

On the other hand, max(yhat)= 7.511698, which implies \lambda=1,829.317. This number makes me think that my simple model fits the data reasonably well, compared to a model that predicts y>2,147,483,647.

Is my model dead wrong, am I using poisson_log_rng wrong, or do I have some other problem?

Thanks for all the help!

Ignacio

1 Like

Is this happening during warmup or sampling? It can wind up happening out in the tails during warmup, and if you wind up saving the warmup, generated quantities gets executed. If it fails, it stops sampling altogether.

I thought we’d fixed that so it went on with NaN values, but apparently not. If not, it’s on our ever-growing to-do list.

I tried save_warmup=FALSE :

fit <- sampling(model_poisson, data = practices_dat,
                chains = 4, show_messages = T,
                cores=cores,
                control = list(max_treedepth = 18),
                iter = iter,
                warmup = warmup,
                save_warmup=FALSE)


some chains had errors; consider specifying chains = 1 to debughere are whatever error messages were returned
[[1]]
Stan model 'stan-2fc0657e77e1' does not contain samples.

[[2]]
Stan model 'stan-2fc0657e77e1' does not contain samples.

[[3]]
Stan model 'stan-2fc0657e77e1' does not contain samples.

[[4]]
Stan model 'stan-2fc0657e77e1' does not contain samples.

Not sure if this is helpful information or not

I’d suggest following that advice and setting chains = 1. RStan loses debug info with multiple chains. Are you up to date with the latest RStan? A bunch of other messages were being supressed and I believe the latest RStan addresses all of those issues other than the parallelism.

1 Like

This is what i get when I run it with chains = 1

> fit <- sampling(model_poisson, data = practices_dat,
+                 chains = 1, show_messages = T,
+                 cores=1,
+                 control = list(max_treedepth = 18),
+                 iter = iter,
+                 warmup = warmup,
+                 save_warmup=FALSE)

SAMPLING FOR MODEL 'stan-364053436475' NOW (CHAIN 1).

Gradient evaluation took 0.001 seconds
1000 transitions using 10 leapfrog steps per transition would take 10 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 3000 [  0%]  (Warmup)
Iteration:  300 / 3000 [ 10%]  (Warmup)
Iteration:  600 / 3000 [ 20%]  (Warmup)
Iteration:  900 / 3000 [ 30%]  (Warmup)
Iteration: 1200 / 3000 [ 40%]  (Warmup)
Iteration: 1500 / 3000 [ 50%]  (Warmup)
Iteration: 1501 / 3000 [ 50%]  (Sampling)
Iteration: 1800 / 3000 [ 60%]  (Sampling)
Iteration: 2100 / 3000 [ 70%]  (Sampling)
Iteration: 2400 / 3000 [ 80%]  (Sampling)
Iteration: 2700 / 3000 [ 90%]  (Sampling)
Iteration: 3000 / 3000 [100%]  (Sampling)

 Elapsed Time: 477.549 seconds (Warm-up)
               266.095 seconds (Sampling)
               743.644 seconds (Total)

[1] "Error in sampler$call_sampler(args_list[[i]]) : "                                                                                                 
[2] "  Exception: poisson_log_rng: Log rate parameter is 23.6146, but must be less than 20.7944  (in 'model36406d6a5f2c_stan_364053436475' at line 47)"
[1] "error occurred during calling the sampler; sampling not done"

And this is my sessionInfo

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2         rstan_2.17.2         StanHeaders_2.17.1  
 [4] forcats_0.2.0        stringr_1.2.0        dplyr_0.7.4         
 [7] purrr_0.2.4          tidyr_0.7.2          tibble_1.4.1        
[10] ggplot2_2.2.1        tidyverse_1.2.1      readr_1.1.1         
[13] RevoUtilsMath_10.0.1 RevoUtils_10.0.7     RevoMods_11.0.0     
[16] MicrosoftML_1.0.0    mrsdeploy_1.0        RevoScaleR_9.0.1    
[19] lattice_0.20-35      rpart_4.1-11        

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.3       reshape2_1.4.3         haven_1.1.0           
 [4] colorspace_1.3-2       stats4_3.4.3           CompatibilityAPI_1.1.0
 [7] rlang_0.1.6            pillar_1.0.1           foreign_0.8-69        
[10] glue_1.2.0             modelr_0.1.1           readxl_1.0.0          
[13] matrixStats_0.52.2     foreach_1.4.5          bindr_0.1             
[16] plyr_1.8.4             munsell_0.4.3          gtable_0.2.0          
[19] cellranger_1.1.0       rvest_0.3.2            codetools_0.2-15      
[22] psych_1.7.8            inline_0.3.14          knitr_1.18            
[25] mrupdate_1.0.0         parallel_3.4.3         curl_3.1              
[28] broom_0.4.3            Rcpp_0.12.14           scales_0.5.0          
[31] jsonlite_1.5           gridExtra_2.3          mnormt_1.5-5          
[34] hms_0.4.0              stringi_1.1.6          grid_3.4.3            
[37] cli_1.0.0              tools_3.4.3            magrittr_1.5          
[40] lazyeval_0.2.1         crayon_1.3.4           pkgconfig_2.0.1       
[43] xml2_1.1.1             lubridate_1.7.1        rstudioapi_0.7        
[46] assertthat_0.2.0       httr_1.3.1             iterators_1.0.9       
[49] R6_2.2.2               nlme_3.1-131           compiler_3.4.3 

This is the relevant part of the error message. Stan can’t simulate from a poisson_log_rng if the rate parameter is greater than 20.8.

Which brings me back to my previous question:

What is the range of n?

If you assign above to y_hat as Bob suggested, you can extract y_hat and easily trackdown
the location of the problem.

Can you check your parameter values?

dump_file_base <- "practices"
options(max.print=1000000, width = 999)
sink(paste0(dump_file_base, ".diagnosis"))
print(summary(fit))
sink()
  • n goes from 250 to 4704
  • y_hat from 3.615584 to 7.511698

If i keep y_sim in my code, the practices.diagonosis file contains this:

Stan model 'stan-364053436475' does not contain samples.
NULL

For diagnosis, please temporarily disable the poisson_rng part in the generated_quantities:

generated quantities {
      vector[I * T] y_hat = log(n) + x_beta + gamma[index_time] + a[practice] + b[practice] .* time;
      //for (it in 1:I * T) {
      //  y_sim[it] = poisson_log_rng(y_hat[it]);
    }

Then run again. Get the diagnosis file.
The first rows are named by variables.
Look for high values, mean or in 97.5% column of your yhat.
Use the averages (mean-value) column and try to calculate the y_hat by using
a calculator and try to figure out which parameter a_std, b_std, … in your model
causes the high values.

Some of these commands may help:

extr ← extract(fit)
boxplot(extr$beta)
boxplot(extr$x_beta)
apply(extr$x_beta,2,mean)
apply(extr$x_beta,2,sd)