Prior for a second-degree term?

I have ~30 binary covariates with SD’s around 0.5, and one quantitative covariate with an SD of 3.85. Along the lines of BDA3 (p. 412-420), I use N(0, 2.5) priors on the binary covariates, i.e. a prior SD approximately 5x the SD of the covariates. Modeled upon this, I decided to use a prior of N(0, 3.85/5) on the quantitative covariate (correct me if that’s bad!). I’m not a big fan of scaling my covariates because of the interpretability issues.

The problem is that I now need to fit a second-degree polynomial to the quantitative covariate, i.e. a \beta_{1}x_1 + \beta_{2}x_{1}^2 type of thing. The squared covariate has a SD of 91. Is there a rule of thumb for how to define a prior for the 2nd-degree term? If not, I’ll probably end up using something like N(0, 0.15), but it would be better to base my choice on an authoritative source.

It’s pretty straightforward to unscale again in the GQ block:

data{
	int n;
	vector[n] y ;
	vector[n] x ;
}
transformed data{
	real y_m = mean(y) ;
	real y_s = sd(y) ;
	vector[n] y_ = (y-y_m)/y_s ;

	real x_m = mean(x) ;
	real x_s = sd(x) ;
	vector[n] x_ = (x-x_m)/x_s ;
}
parameters{
	real<lower=0> noise_ ;
	real intercept_ ;
	real beta_x_ ;
}
model{
	noise_ ~ weibull(2,1) ; // peaked at .8, zero at zero, ~2% mass above 2
	intercept_ ~ std_normal() ;
	beta_x_ ~ std_normal() ;
	y_ ~ normal( intercept_ + (beta_x_ * x_) , noise_ ) ;
}
generated quantities{
	real noise = noise_ * y_s;
	real intercept = intercept_ * y_s + y_m ;
	real beta_x = beta_x_ * x_s;
}

I’ve never read BDA3, but it strikes me that any recommendations like that presumably rely on y being standardized too? (as I’ve done in the code above)

1 Like

Unfortunately all of that code is Greek to me. I’m strictly a brms user.

You are right that the BDA folks do recommend scaling the covariates. However, I figure it’s not so dangerous to omit that step when most of the covariates are binary. There will be slight prior bias, but not much. In fact, I probably would scale those variables if they were strictly binary, but many are qualitative factors which are only converted into sets of binary dummies during model fitting. Within the R data frame they are single vectors taking >2 nominal values. I’ve never heard of anyone conveniently scaling such things.

So I’m presently still looking to find a sensible prior for that quadratic term with a SD of 91.

Scaling the covariates is different from scaling the outcome. Now that I think of it, prior recommendations for any kind of covariate only make sense in the context of a scaled outcome.

(Deleted brainfart)

For ordinal outcomes with >2 levels, you want ordinal regression. See here and here.

Ahhh, that make more sense. Make sure you do prior predictive checks. I think that even N(0,2.5) for a binary covariate is going to put a surprising amount of mass to the extremes of the probability space, especially if the covariates are additive/multiplicative.

2 Likes

This is categorical regression with the logit link. The outcome has 4 unordered categories. The BDA folks scale their covariates with a binary outcome. But AFAIK categorical (multinomial) regression is just a generalization of binary logistic regression, so I think same principles should apply.

EDIT: I’m unlikely the budge from the N(0, 2.5) thing at this point because it has many appealing properties. It shrinks just the right amount, from what I’ve experienced so far. What I’m confused about is what to do with the quadratic term.

Unless you have a very specific mechanism by which you strongly expect a polynomial effect of that covariate, I strongly recommend against using a polynomial model. These are typically used to permit a degree of non-linearity in effects, but there are much more powerful tools to achieve this. If you only have a hundred or less unique values of the covariate, use a Gaussian Process. If you have more unique values, use the GP-approximating Generalized additive model. Both are available in brms.

1 Like

Oops, this should read “ordinal outcomes”. edited above so others aren’t confused later.

I start with a large model and do backward-elimination of unnecessary terms. The 2nd-degree term is there at the beginning to check for a non-monotone effect. It will likely get dropped sooner or later, and then I’ll try log transformations vs the linear term. But I know nothing about Gaussian processes or any of that additive business, and neither do most of my readers. The 2nd-degree term is a simple way to check for nonmonotonicity that even nonspecialists can (probably) understand, hence I’m asking about a guideline for its prior.

Fair enough, but be aware of the consequences of diffuse priors on logit-scale parameters, particularly when binary covariates are 0/1 coded:

probs1 <- vector()
for(i in 1:1000){
  probs1[i] <- boot::inv.logit(sum(rnorm(30, 0, 2.5)) + rnorm(1)) # The first term represents an
    # observation with a 1 for every covariate, and the second term represents an intercept with
    # a standard normal prior
}

probs0 <- boot::inv.logit(rnorm(1000)) # this the prior predictive distribution for an observation
# with zeros for every covariate

hist(probs0, main = "prior pushforward distribution for reference category")
hist(probs1, main = "prior pushworward distribution for non-reference category")

If this worries you, then you might consider effects-coding of the binary predictors (so that the pushforward distributions don’t depend on arbitrary choice of reference categories) and/or narrower priors.

3 Likes

This is the first time I hear the terms ‘pushforward distribution’ or ‘effects-coding of binary predictors’. The binary predictors aren’t the problem here either. They could conceivably be scaled with little difficulty. The problem are the nominal-scale multi-df covariates, which are hard to scale because of the way they are internally represented in R.

And the biggest problem is the second-degree term with an SD of 91, which I’d love to hear someone’s opinion on how to set a prior for.

1 Like

“Pushforward distribution” can be more-or-less understood to mean the (marginal) prior predictive distribution for some quantity of interest.
“Effects coding” means coding binary predictors as -1/1 rather than 0/1.

My suggestion for the prior on the second-degree term is to examine the prior predictive distribution for whatever quantities about which domain knowledge can be brought to bear, and choose a prior that is as diffuse as possible without putting substantial prior predictive mass over outlandish outcomes/conclusions. Doing this well is HARD! One of the challenges that you might encounter is that even fairly standard priors on the other coefficients might already be yielding weird prior predictive distributions. In general, you probably won’t be able to separate your choice of prior on the second-degree polynomial term from the rest of your priors, because you need to worry about whether your joint prior is placing an unreasonably concentration of prior mass on probabilities near one and zero, which is a ubiquitous bugbear for prior specification on the joint distribution of logit-scale coefficients. I wish this were easier.

EDIT:

The problem is that the way to a good prior, especially on the logit scale, is to write down a good joint prior over the parameters, so that you are getting plausible pushforward distributions for the probabilities. Thus, your choice about a good prior for one term isn’t really separable from your choice of priors on the other parameters. If you really don’t care about the pushforward distributions on the probability scale (but you do!) then you could just use an arbitrarily diffuse prior on the second-degree term (but don’t do that!).

3 Likes

Many thanks for your informative posts, @jsocolar. After a night’s sleep, I was now able to comprehend your histogram example and what it illustrates, i.e. that the 0/1 coding causes the reference level of the binary covariate to be treated differently from the non-reference level, prior-wise. However, in my case the Intercept has a prior of N(0, 5) which is even more diffuse that the one on the beta. This prior is also based on the section of BDA3 mentioned earlier. Thus, when conducting your histogram experiment, I find that the prior probability distribution is highly U-shaped for both the reference category and the non-reference category. This doesn’t worry me particularly, given that at least one authoritative sources thinks it is not a problem:

This is good enough for me. In a perfect world, I would indeed scale at least the binary covariates – they can have only two values, and so I don’t see how interpretation could be badly compromised. But the problem is that then I’d have to also figure out a way to subject the multicategory covariates to the same scaling without the major complication of having to manually create dummies for every non-reference category of every multicategory covariate.

References:

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin (2014). Bayesian Data Analysis. 3rd ed. CRC Press.

Agresti, Alan (2013). Categorical Data Analysis. 3rd ed. Hoboken, New Jersey: John Wiley & Sons.