Prior for a second-degree term?

blokeman · April 15, 2021, 6:30pm

I have ~30 binary covariates with SD’s around 0.5, and one quantitative covariate with an SD of 3.85. Along the lines of BDA3 (p. 412-420), I use N(0, 2.5) priors on the binary covariates, i.e. a prior SD approximately 5x the SD of the covariates. Modeled upon this, I decided to use a prior of N(0, 3.85/5) on the quantitative covariate (correct me if that’s bad!). I’m not a big fan of scaling my covariates because of the interpretability issues.

The problem is that I now need to fit a second-degree polynomial to the quantitative covariate, i.e. a \beta_{1}x_1 + \beta_{2}x_{1}^2 type of thing. The squared covariate has a SD of 91. Is there a rule of thumb for how to define a prior for the 2nd-degree term? If not, I’ll probably end up using something like N(0, 0.15), but it would be better to base my choice on an authoritative source.

mike-lawrence · April 15, 2021, 6:55pm

It’s pretty straightforward to unscale again in the GQ block:

data{
	int n;
	vector[n] y ;
	vector[n] x ;
}
transformed data{
	real y_m = mean(y) ;
	real y_s = sd(y) ;
	vector[n] y_ = (y-y_m)/y_s ;

	real x_m = mean(x) ;
	real x_s = sd(x) ;
	vector[n] x_ = (x-x_m)/x_s ;
}
parameters{
	real<lower=0> noise_ ;
	real intercept_ ;
	real beta_x_ ;
}
model{
	noise_ ~ weibull(2,1) ; // peaked at .8, zero at zero, ~2% mass above 2
	intercept_ ~ std_normal() ;
	beta_x_ ~ std_normal() ;
	y_ ~ normal( intercept_ + (beta_x_ * x_) , noise_ ) ;
}
generated quantities{
	real noise = noise_ * y_s;
	real intercept = intercept_ * y_s + y_m ;
	real beta_x = beta_x_ * x_s;
}

I’ve never read BDA3, but it strikes me that any recommendations like that presumably rely on y being standardized too? (as I’ve done in the code above)

blokeman · April 15, 2021, 7:13pm

Unfortunately all of that code is Greek to me. I’m strictly a brms user.

You are right that the BDA folks do recommend scaling the covariates. However, I figure it’s not so dangerous to omit that step when most of the covariates are binary. There will be slight prior bias, but not much. In fact, I probably would scale those variables if they were strictly binary, but many are qualitative factors which are only converted into sets of binary dummies during model fitting. Within the R data frame they are single vectors taking >2 nominal values. I’ve never heard of anyone conveniently scaling such things.

So I’m presently still looking to find a sensible prior for that quadratic term with a SD of 91.

mike-lawrence · April 15, 2021, 7:28pm

Scaling the covariates is different from scaling the outcome. Now that I think of it, prior recommendations for any kind of covariate only make sense in the context of a scaled outcome.

blokeman · April 15, 2021, 7:29pm

(Deleted brainfart)

mike-lawrence · April 15, 2021, 7:29pm

For ordinal outcomes with >2 levels, you want ordinal regression. See here and here.

mike-lawrence · April 15, 2021, 7:31pm

Ahhh, that make more sense. Make sure you do prior predictive checks. I think that even N(0,2.5) for a binary covariate is going to put a surprising amount of mass to the extremes of the probability space, especially if the covariates are additive/multiplicative.

blokeman · April 15, 2021, 7:32pm

This is categorical regression with the logit link. The outcome has 4 unordered categories. The BDA folks scale their covariates with a binary outcome. But AFAIK categorical (multinomial) regression is just a generalization of binary logistic regression, so I think same principles should apply.

EDIT: I’m unlikely the budge from the N(0, 2.5) thing at this point because it has many appealing properties. It shrinks just the right amount, from what I’ve experienced so far. What I’m confused about is what to do with the quadratic term.

mike-lawrence · April 15, 2021, 7:35pm

Unless you have a very specific mechanism by which you strongly expect a polynomial effect of that covariate, I strongly recommend against using a polynomial model. These are typically used to permit a degree of non-linearity in effects, but there are much more powerful tools to achieve this. If you only have a hundred or less unique values of the covariate, use a Gaussian Process. If you have more unique values, use the GP-approximating Generalized additive model. Both are available in brms.

mike-lawrence · April 15, 2021, 7:36pm

Oops, this should read “ordinal outcomes”. edited above so others aren’t confused later.

blokeman · April 15, 2021, 7:42pm

I start with a large model and do backward-elimination of unnecessary terms. The 2nd-degree term is there at the beginning to check for a non-monotone effect. It will likely get dropped sooner or later, and then I’ll try log transformations vs the linear term. But I know nothing about Gaussian processes or any of that additive business, and neither do most of my readers. The 2nd-degree term is a simple way to check for nonmonotonicity that even nonspecialists can (probably) understand, hence I’m asking about a guideline for its prior.

jsocolar · April 15, 2021, 7:48pm

Fair enough, but be aware of the consequences of diffuse priors on logit-scale parameters, particularly when binary covariates are 0/1 coded:

probs1 <- vector()
for(i in 1:1000){
  probs1[i] <- boot::inv.logit(sum(rnorm(30, 0, 2.5)) + rnorm(1)) # The first term represents an
    # observation with a 1 for every covariate, and the second term represents an intercept with
    # a standard normal prior
}

probs0 <- boot::inv.logit(rnorm(1000)) # this the prior predictive distribution for an observation
# with zeros for every covariate

hist(probs0, main = "prior pushforward distribution for reference category")
hist(probs1, main = "prior pushworward distribution for non-reference category")

If this worries you, then you might consider effects-coding of the binary predictors (so that the pushforward distributions don’t depend on arbitrary choice of reference categories) and/or narrower priors.

blokeman · April 15, 2021, 8:11pm

This is the first time I hear the terms ‘pushforward distribution’ or ‘effects-coding of binary predictors’. The binary predictors aren’t the problem here either. They could conceivably be scaled with little difficulty. The problem are the nominal-scale multi-df covariates, which are hard to scale because of the way they are internally represented in R.

And the biggest problem is the second-degree term with an SD of 91, which I’d love to hear someone’s opinion on how to set a prior for.

jsocolar · April 15, 2021, 8:21pm

“Pushforward distribution” can be more-or-less understood to mean the (marginal) prior predictive distribution for some quantity of interest.
“Effects coding” means coding binary predictors as -1/1 rather than 0/1.

My suggestion for the prior on the second-degree term is to examine the prior predictive distribution for whatever quantities about which domain knowledge can be brought to bear, and choose a prior that is as diffuse as possible without putting substantial prior predictive mass over outlandish outcomes/conclusions. Doing this well is HARD! One of the challenges that you might encounter is that even fairly standard priors on the other coefficients might already be yielding weird prior predictive distributions. In general, you probably won’t be able to separate your choice of prior on the second-degree polynomial term from the rest of your priors, because you need to worry about whether your joint prior is placing an unreasonably concentration of prior mass on probabilities near one and zero, which is a ubiquitous bugbear for prior specification on the joint distribution of logit-scale coefficients. I wish this were easier.

EDIT:

The problem is that the way to a good prior, especially on the logit scale, is to write down a good joint prior over the parameters, so that you are getting plausible pushforward distributions for the probabilities. Thus, your choice about a good prior for one term isn’t really separable from your choice of priors on the other parameters. If you really don’t care about the pushforward distributions on the probability scale (but you do!) then you could just use an arbitrarily diffuse prior on the second-degree term (but don’t do that!).

blokeman · April 16, 2021, 6:34am

Many thanks for your informative posts, @jsocolar. After a night’s sleep, I was now able to comprehend your histogram example and what it illustrates, i.e. that the 0/1 coding causes the reference level of the binary covariate to be treated differently from the non-reference level, prior-wise. However, in my case the Intercept has a prior of N(0, 5) which is even more diffuse that the one on the beta. This prior is also based on the section of BDA3 mentioned earlier. Thus, when conducting your histogram experiment, I find that the prior probability distribution is highly U-shaped for both the reference category and the non-reference category. This doesn’t worry me particularly, given that at least one authoritative sources thinks it is not a problem:

This is good enough for me. In a perfect world, I would indeed scale at least the binary covariates – they can have only two values, and so I don’t see how interpretation could be badly compromised. But the problem is that then I’d have to also figure out a way to subject the multicategory covariates to the same scaling without the major complication of having to manually create dummies for every non-reference category of every multicategory covariate.

References:

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin (2014). Bayesian Data Analysis. 3rd ed. CRC Press.

Agresti, Alan (2013). Categorical Data Analysis. 3rd ed. Hoboken, New Jersey: John Wiley & Sons.

Topic		Replies	Views
Quadratic/cubic regression: scaling vs. using different priors General prior-choice , priors	1	1145	February 1, 2022
Guidance on prior predictive checks in ordinal model brms ecology , prior-predictive	37	2595	March 31, 2025
How to scale your priors by number of predictors in logistic regression Modeling techniques , specification	23	2159	November 15, 2023
Priors for a novice General	27	4435	October 25, 2019
(Dummy) coding/scaling of binary variable and prior choice in BRMS brms prior-choice	2	2718	June 9, 2020

Prior for a second-degree term?

Related topics