I’m just getting started with Stan (thanks to a class from Jonah, Rob, and Scott) and I had a question about how much prior information I should try to put into priors vs. eliciting that information from the data.

Suppose I’m trying to predict integer counts y with a simple Poisson model that depends purely on a single categorical variable (in this case sex, male or female). I set up the following simulated data.

N <- 1000
sex <- sample(1:2, N, replace = TRUE)
# male = 1, female = 2
lambda <- 1 + ifelse(sex == 1, 0.3, -0.2)
y_true <- sapply(lambda, rpois, n = 1)
poi_data <- list(N = N, J = length(unique(sex)), sex_idx = sex)

Even though I hypothetically don’t know the true data generating process, suppose that I know that (1) counts vary by sex and (2) counts are higher for males than females. I want to do some prior predictive checking to see if I chose a good prior for my model. I have the following Stan program:

data {
int<lower = 0> N;
int<lower=1> J; // number of levels
array[N] int<lower=1, upper=J> sex_idx;
}
generated quantities {
vector[N] y_sim;
vector[J] alpha;
for(j in 1:J){
alpha[j] = lognormal_rng(1, 1);
}
for(n in 1:N){
y_sim[n] = poisson_rng(alpha[sex_idx[n]]);
}
}

Because the prior on the sex coefficient has \sigma_\alpha = 1> 0, I’m including that I think the coefficient varies between males and females. However, I’m not saying anything about which coefficient I think is the larger one. Is this something I should be including in my priors, or is it better to allow the data to provide this information through the likelihood?

data {
int<lower = 0> N;
int<lower=1> J; // number of levels
array[N] int<lower=1, upper=J> sex_idx;
array[N] int y;
}
parameters{
real alpha_mean ;
real<lower=0> alpha_sd ;
vector[J] alpha;
}
model {
alpha_mean ~ std_normal();
alpha_sd ~ std_normal() ;
alpha ~ lognormal_rng(alpha_mean, alpha_sd);
y ~ poisson(alpha[sex_idx]);
}

This would reflect what some would term “random effects of sex”, where you are conveying that the set of levels for the variable sex that are observed in the data are a subset of a larger population of levels. Now, sometimes folks will use this “subset sample from a larger population” framework to help “regularize”/“mutually-inform” models for the individual levels when they actually do reflect the full set of categories (ex. counties in a state), but with only two levels it is much more common to model things with a “fixed effect of sex”, as in:

...
parameters{
vector[J] alpha;
}
model {
alpha ~ lognormal_rng(1,1);
y ~ poisson(alpha[sex_idx]);
}

If your expectation that counts are higher than males derives from your dataset that you are analyzing, then you should not encode this information in your prior, as you would in effect be using these data twice, first to inform the prior and second in the likelihood.

If on the other hand, you have a prior expectation based on previous information that is independent of your dataset it would be useful to include this information in the prior.

CAVEATS:

Some models can benefit from informative priors for computational reasons. In these cases, it might be appropriate with great caution to use a prior informed by the data.

Even if you have a genuine prior expectation (independent of your data) for the direction of an effect, including that information in your prior can sometimes muddle the presentation of your results, because it becomes less straightforward to understand whether the modeled effect is due primarily to the prior versus evidence contained in your data. Thus, it’s sometimes useful to intentionally omit prior information about effect directionality, particularly if the likelihood does a good job of constraining the effect anyway, if only to convince less-knowledgable readers of the appropriateness of the analysis.