In this model y_it denotes the number of hospitalizations from practice i at time t, and and n_it is the number of patients in that practice.
This is the Stan code that I wrote:
data {
int<lower=0> I; // number of practices
int<lower=0> T; // number of time periods
int<lower=1> y[I*T]; // Count outcome
int<lower=0> n[I*T]; // number of patients in period t
int<lower=0> K; // number of predictors
vector[K] x[I*T]; // predictor matrix
int<lower=1,upper=I> practice[I*T]; //
int<lower=-1,upper=4> time[I*T]; //
int<lower=1,upper=5> index_time[I*T]; //
}
parameters {
vector[I] eta_a; //
real<lower=0> tau_a; //
vector[I] eta_b; //
real<lower=0> tau_b; //
vector[K] beta; //
real<lower=0> gamma[T]; // year specific intercept
}
transformed parameters {
vector[I] a;
vector[I] b;
real lp[I*T];
real <lower=0> nmu[I*T];
a = tau_a * eta_a; // Matt trick
b = tau_b * eta_b; // Matt trick
for (it in 1:(I*T)) {
// Linear predictor
lp[it] = dot_product(x[it], beta) + gamma[index_time[it]] + a[practice[it]] + b[practice[it]]*time[it];
nmu[it] = n[it]*exp(lp[it]);
}
}
model {
eta_a ~ student_t(4, 0, 1);
eta_b ~ student_t(4, 0, 1);
tau_a ~ normal(0,1);
tau_b ~ normal(0,1);
gamma ~normal(0,1);
beta ~ normal(0,1);
y ~ poisson(nmu); // likelihood
}
generated quantities {
real <lower=0> yhat[I*T];
for (it in 1:(I*T)) {
yhat[it] = 1000*nmu[it]/n[it];
}
}
I’m able to fit the model without any errors nor warnings. The effective sample size and the rhat for the parameters all look good. That said, I have the following weird behavior:
tau_a = 0.1728, this is what I get when I plot a and b
How can it be that I get a tau_a so small and an a can take the value -12? Any suggestions on how to fix this?
The way I’d code this is by making x a vector and sticking the log scale:
y ~ poisson_log(log(n) + x * beta + gamma[index_time] + a[practice] + b[practice] .* time);
I’m guessing you mean to define the posterior predictive distribution for y, which would be poisson_rng(y_hat[i]). As is, you’re missing the sampling uncertainty from the Poisson. There’s much use for the posterior over just y-hat.
I find it easier to keep track of the non-centered parameterizations is to define
a = a_std * sigma_a;
It keeps the notations aligned that a_std is the standard normal vrsion of a and sigma_a is the scale of a.
Those student-t priors on your scales are very wide. I’m guessing what I called sigma_a won’t get so large with a normal prior.
Thanks @Bob_Carpenter! I was able to get the model to run with the suggestions you made
data {
int<lower=0> I; // number of practices
int<lower=0> T; // number of time periods
int<lower=1> y[I*T]; // Count outcome
vector[I * T] n; // number of patients in period t
int<lower=0> K; // number of predictors
matrix[I * T, K] x; // predictor matrix
int<lower=1,upper=I> practice[I*T]; //
vector<lower=-1,upper=4>[I*T] time; //
int<lower=1,upper=5> index_time[I*T]; //
}
parameters {
vector[I] a_std; //
real<lower=0> sigma_a; //
vector[I] b_std; //
real<lower=0> sigma_b; //
vector[K] beta; //
vector[T] gamma; // year specific intercept
}
transformed parameters {
vector[I] a;
vector[I] b;
vector[I * T] x_beta = x * beta;
a = sigma_a * a_std ; // Matt trick
b = sigma_b * b_std ; // Matt trick
}
model {
sigma_a ~ normal(0, 1);
sigma_b ~ normal(0, 1);
a_std ~ normal(0,1);
b_std ~ normal(0,1);
gamma ~normal(0,1);
beta ~ normal(0,1);
y ~ poisson_log(log(n) + x_beta + gamma[index_time] + a[practice] + b[practice] .* time); // likelihood
}
generated quantities {
vector<lower=0>[I*T] ytilda;
for (it in 1:(I*T)) {
ytilda[it] = poisson_rng(y[it]);
}
}
I have a final question. Did I implement the generated quantities block right? My intention is to use bayesplot to do some posterior predictive checks for different models. For example,
I made the change you suggested but now I get the following error:
Exception: poisson_log_rng: Log rate parameter is 21.8036, but must be less than 20.7944 (in 'modeldf1c1eabc8_5e215340a9834a686c6d7e96112d9feb' at line 47)
Since poisson_rngs return an int, I believe what is happening here is that Stan is trying to keep the rate parameter vaguely within range of 32 bit integers (2^30, or e^20.744). That’s a very large rate. Does it make sense?
@syclik Was there a thing at one point to move to 64bit ints? Did this include the rngs at all?
Yes, and we’re going to be upgrading to 64-bit integers for this and other reasons. I never suspected anyone would be using RNGs generating integers in the billions. I’m still not sure why this makes sense. By the time your numbers are that big, normal approximations work very well.
If I understand what you are saying correctly this means when I draw from the posterior of my simple model I get some predictions that are larger than 2,147,483,647. Which makes me think that my model is dead wrong. After all, in my data max(y)= 1,772.
On the other hand, max(yhat)= 7.511698, which implies \lambda=1,829.317. This number makes me think that my simple model fits the data reasonably well, compared to a model that predicts y>2,147,483,647.
Is my model dead wrong, am I using poisson_log_rng wrong, or do I have some other problem?
Is this happening during warmup or sampling? It can wind up happening out in the tails during warmup, and if you wind up saving the warmup, generated quantities gets executed. If it fails, it stops sampling altogether.
I thought we’d fixed that so it went on with NaN values, but apparently not. If not, it’s on our ever-growing to-do list.
fit <- sampling(model_poisson, data = practices_dat,
chains = 4, show_messages = T,
cores=cores,
control = list(max_treedepth = 18),
iter = iter,
warmup = warmup,
save_warmup=FALSE)
some chains had errors; consider specifying chains = 1 to debughere are whatever error messages were returned
[[1]]
Stan model 'stan-2fc0657e77e1' does not contain samples.
[[2]]
Stan model 'stan-2fc0657e77e1' does not contain samples.
[[3]]
Stan model 'stan-2fc0657e77e1' does not contain samples.
[[4]]
Stan model 'stan-2fc0657e77e1' does not contain samples.
I’d suggest following that advice and setting chains = 1. RStan loses debug info with multiple chains. Are you up to date with the latest RStan? A bunch of other messages were being supressed and I believe the latest RStan addresses all of those issues other than the parallelism.
Then run again. Get the diagnosis file.
The first rows are named by variables.
Look for high values, mean or in 97.5% column of your yhat.
Use the averages (mean-value) column and try to calculate the y_hat by using
a calculator and try to figure out which parameter a_std, b_std, … in your model
causes the high values.