Prior information in varying intercepts

Matt_Kosko · July 28, 2021, 4:18pm

I’m just getting started with Stan (thanks to a class from Jonah, Rob, and Scott) and I had a question about how much prior information I should try to put into priors vs. eliciting that information from the data.

Suppose I’m trying to predict integer counts y with a simple Poisson model that depends purely on a single categorical variable (in this case sex, male or female). I set up the following simulated data.

N <- 1000
sex <- sample(1:2, N, replace = TRUE)
# male = 1, female = 2
lambda <- 1 + ifelse(sex == 1, 0.3, -0.2)
y_true <- sapply(lambda, rpois, n = 1)
poi_data <- list(N = N, J = length(unique(sex)), sex_idx = sex)

Even though I hypothetically don’t know the true data generating process, suppose that I know that (1) counts vary by sex and (2) counts are higher for males than females. I want to do some prior predictive checking to see if I chose a good prior for my model. I have the following Stan program:

data {
  int<lower = 0> N;
  int<lower=1> J; // number of levels
  array[N] int<lower=1, upper=J> sex_idx;
}

generated quantities {
  vector[N] y_sim;
  vector[J] alpha;
  for(j in 1:J){
    alpha[j] = lognormal_rng(1, 1);
  }
  for(n in 1:N){
    y_sim[n] = poisson_rng(alpha[sex_idx[n]]);
  }
}

Because the prior on the sex coefficient has \sigma_\alpha = 1> 0, I’m including that I think the coefficient varies between males and females. However, I’m not saying anything about which coefficient I think is the larger one. Is this something I should be including in my priors, or is it better to allow the data to provide this information through the likelihood?

mike-lawrence · July 28, 2021, 5:02pm

If you code your inference-focused model as:

data {
  int<lower = 0> N;
  int<lower=1> J; // number of levels
  array[N] int<lower=1, upper=J> sex_idx;
  array[N] int y; 
}
parameters{
  real alpha_mean ;
  real<lower=0> alpha_sd ;
  vector[J] alpha;
}
model {
  alpha_mean ~ std_normal();
  alpha_sd ~ std_normal() ;
  alpha ~ lognormal_rng(alpha_mean, alpha_sd);
  y ~ poisson(alpha[sex_idx]);
}

This would reflect what some would term “random effects of sex”, where you are conveying that the set of levels for the variable sex that are observed in the data are a subset of a larger population of levels. Now, sometimes folks will use this “subset sample from a larger population” framework to help “regularize”/“mutually-inform” models for the individual levels when they actually do reflect the full set of categories (ex. counties in a state), but with only two levels it is much more common to model things with a “fixed effect of sex”, as in:

...
parameters{
  vector[J] alpha;
}
model {
  alpha ~ lognormal_rng(1,1);
  y ~ poisson(alpha[sex_idx]);
}

jsocolar · July 28, 2021, 6:02pm

If your expectation that counts are higher than males derives from your dataset that you are analyzing, then you should not encode this information in your prior, as you would in effect be using these data twice, first to inform the prior and second in the likelihood.

If on the other hand, you have a prior expectation based on previous information that is independent of your dataset it would be useful to include this information in the prior.

CAVEATS:

Some models can benefit from informative priors for computational reasons. In these cases, it might be appropriate with great caution to use a prior informed by the data.
Even if you have a genuine prior expectation (independent of your data) for the direction of an effect, including that information in your prior can sometimes muddle the presentation of your results, because it becomes less straightforward to understand whether the modeled effect is due primarily to the prior versus evidence contained in your data. Thus, it’s sometimes useful to intentionally omit prior information about effect directionality, particularly if the likelihood does a good job of constraining the effect anyway, if only to convince less-knowledgable readers of the appropriateness of the analysis.

jackyjoy123 · August 4, 2021, 1:03pm

jsocolar:

If your expectation that counts are higher than males derives from your dataset that you are analyzing , then you should not encode this information in your prior, as you would in effect be using these data twice, first to inform the prior and second in the likelihood.

If on the other hand, you have a prior expectation based on previous information that is independent of your dataset it would be useful to include this information in the prior.

CAVEATS:

Some models can benefit from informative priors for computational reasons. In these cases, it might be appropriate with great caution to use a prior informed by the data.

Even if you have a genuine prior expectation (independent of your data) for the direction of an effect, including that information in your prior can sometimes muddle the presentation of your results, because it becomes less straightforward to understand whether the modeled effect is due primarily to the prior versus evidence contained in your data. Thus, it’s sometimes useful to intentionally omit prior information about effect directionality, particularly if the likelihood does a good job of constraining the effect anyway, if only to convince less-knowledgable readers of the appropriateness of the analysis.

thanks for the awesome information.

Topic		Replies	Views
Why not put a prior on the entire model? Modeling rstan , specification	25	1005	June 27, 2023
Set an informative prior brms	5	2562	February 19, 2023
Setting priors on specific values of random intercept Modeling	8	513	October 15, 2019
Priors: imposed form outside calculation or calculated from data within the model Modeling bioinformatics	9	1205	June 21, 2017
Although priors are set exactly as posterior, sampling from priors gives invalid predictions Modeling rstan , brms	6	649	June 17, 2021

Prior information in varying intercepts

Related topics