Regularising prior on sd for ordinal regression

I’m running an ordinal probit regression, in which respondents rate several ideas regarding how important they think they are. There are about 10-15 things being rated, and so I want to reduce the chance of just observing false positives when making comparisons between the importance ratings given to these different items. I think a decent way of doing this would be to not set the item they are responding to as a fixed effect of each item, which would be like this:

importance ~ 1 + item + (1 | respondent)

But rather as a deviation from the average importance ratings, so as in a hierarchical model:

importance ~ 1 + (1 | item) + (1 | respondent)

In this way, as far as I understand it, if there is not so much information to inform where responses to a specific item should sit, they will be shrunk towards the global mean of the items. Firstly, is that a correct understanding of how this would work in principle? We actually have a couple of thousand respondents so I suspect there will not be much shrinkage.

The second consideration is the prior to set on the sd for this parameter. In some typical examples of ML models, I’ve seen people suggesting exp(1), but I think this implies a very large amount of variation is expected a priori between the items - especially as the response is measured on the probit scale, so +/-1 is a full standard deviation difference. I thought instead it might be good to test out a tighter prior for the sd components such as exp(10), which restrains the range of values initially entertained quite substantially:

Does this seem appropriate as a general approach?

In addition, would you expect there to be much shrinkage if there are a couple thousand observations per item, and all the items have the same number of observations? I ask because in doing some testing with different priors, I have to say there seems to be essentially no shrinkage occurring, but surprisingly this is the case even when I reduce the n per item to quite small numbers like 20 or 50, unless using really extreme priors like exp(100), which does not seem a reasonable prior in terms of actual prior expectations

1 Like

Yes, that makes sense. I don’t know how you could treat this as a fixed effect if item is the item’s identifier. If there’s some value for items rather than just idenifiers, then it won’t make sense as a random effect.

Regularization toward population mean (presumably not just straight shrinkage toward zero) will depend not on how many respondents, but on how much data informs each one and how tight of a distribution they are consistent with (and to some extent, the hyperprior). . If there are a lot of respondents with very little data each, but they all come from a common distribution, you will see a lot of shrinkage. For instance, simulate from binomial(K, theta[j]) for K = 10 and j in 1:1000 and you’ll see a huge amount of shrinkage.

You mean you are going to take

beta_item[k] ~ normal(0, sigma)

and you want suggestions for a prior on sigma? That should depend on what you know about the item variation ahead of time. Usually if the number of k is large, the posterior won’t be very sensitive to the prior as long as it’s consistent with the prior. But the only way to test this is trying different priors and seeing what their effect is. I didn’t understand what sd(10) meant as a suggested prior or what you were plotting.

The other question is whether to take something like half normal, which is consistent with zero (full pooling) or something like lognormal, which isn’t consistent with zero.

You’ll find logit a lot more efficient computationally. We even have an ordinal-login built in.

Thanks @Bob_Carpenter for your thoughts and the detail you’ve gone into.

I didn’t understand what sd(10) meant as a suggested prior or what you were plotting.

Apologies this was a typo, where I wrote “sd(10)” I meant “exp(10)” as a prior on the standard deviation of the individual items around the mean of the items. A typical prior on this hyper-parameter that I’ve seen is exp(1) but I thought it could make sense to make it more tightly bound around 0 such that, a priori, it is assumed that there is not a huge amount of variation among the items - that is what I was hoping to show with the plot I presented.

I think I wold be quite happy to allow a prior that is consistent with 0 such as the half-normal. Is it correct that if I just set the prior as normal, it is constrained to be half-normal anyway if it’s on an sd parameter, because the sd can’t be less than 0 anyway?

You’ll find logit a lot more efficient computationally.

Thanks for pointing this out. I will test it with a logit model - if it is not very significantly faster or produces much more ESS then I might stick with probit because I find thinking in that transformation much more intuitive than the logit!