Default prior for probit regression with many predictors fails to sample

I’ve been working through Exercise 15.5 of Regression and Other Stories, which asks to take a logit model and convert it to a probit.
I found when trying to fit models with more than a few predictors, without setting the prior the sampler will fail.
Is there a better default weakly informative prior?

I’m running rstanarm version 2.21.1 on Ubuntu-20.04 (via WSL2).

To demonstrate here’s an example where we’re trying to fit a random binomial variable on 5 noise predictors with 1000 data points:

N <- 1000

fake_data <- data.frame(a=rbinom(N,1,0.5),
                        b=rbinom(N,1,0.5),
                        c=rbinom(N,1,0.5),
                        d=rbinom(N,1,0.5),
                        e=rbinom(N,1,0.5),
                        y=rbinom(N,1,0.5))

fit <- rstanarm::stan_glm(y ~ a + b + c + d + e,
                          family=binomial(link="probit"),
                          data=fake_data)

Most of the time running this I will get an error like:

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
Chain 1: Rejecting initial value:
Chain 1:   Log probability evaluates to log(0), i.e. negative infinity.
Chain 1:   Stan can't start sampling from this initial value.
...
Chain 1: 
Chain 1: Initialization between (-2, 2) failed after 100 attempts. 
Chain 1:  Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
[1] "Error in sampler$call_sampler(args_list[[i]]) : Initialization failed."
error occurred during calling the sampler; sampling not done
Error in check_stanfit(stanfit) : 
  Invalid stanfit object produced please report bug

It seems that the more predictors there are the more likely this is to happen.
I haven’t come across this problem when working with logistic regression (using a “logit” link function in the above).

One way I can work around this is by setting a flat prior, which seems to converge.

fit <- rstanarm::stan_glm(y ~ a + b + c + d + e,
                          family=binomial(link="probit"),
                          prior=NULL,
                          data=fake_data)

Another way to get it to converge is by setting a slightly tighter prior (which gives very similar estimates):

fit <- rstanarm::stan_glm(y ~ a + b + c + d + e,
                          family=binomial(link="probit"),
                          prior=rstanarm::normal(scale=1, autoscale=TRUE),
                          data=fake_data)

I don’t understand how Stan works, but I’m guessing as we add more predictors there’s a greater chance of getting one that is far from it’s actual value (in this case 0).
I suspect because the probabilities in the normal distribution go to zero much faster than in the logistic distribution, this leads to numerical errors in link="probit" that don’t occur in link="logit".

Is there a better heuristic for a weakly informative prior for probit regression?
I’ve found in an example using prior=rstanarm::normal(scale=0.5, autoscale=TRUE), seems to work and give similar results to a flat prior, but I don’t know whether that’s a reasonable thing to do in general.

That’s my guess too. In this case you can either modify the inits or scale the predictors to have a smaller range.

This has always been a problem when using the probit link. I believe that if you set the init to 0 it goes away. I’m sure there are more elegant solutions.

1 Like

I want to follow up on this threat as I’m wondering about the automatic scaling that rstanarm does for probit models. (I’m running version 2.21.1 on Mac.)

If I read help('priors') it states:
“If scale is not specified it will default to 2.5 , unless the probit link function is used, in which case these defaults are scaled by a factor of dnorm(0)/dlogis(0) , which is roughly 1.6.”

Indeed this can be easily checked running the code provided by Edward:

N <- 1000
fake_data <- data.frame(a=rbinom(N,1,0.5),
                        b=rbinom(N,1,0.5),
                        c=rbinom(N,1,0.5),
                        d=rbinom(N,1,0.5),
                        e=rbinom(N,1,0.5),
                        y=rbinom(N,1,0.5))

fit_probit<- rstanarm::stan_glm(y ~ a + b + c + d + e,
                          family=binomial(link="probit"),
                          prior=NULL,
                          data=fake_data)
prior_summary(fit_probit)

fit_logit <- rstanarm::stan_glm(y ~ a + b + c + d + e,
                          family=binomial(link="logit"),
                          prior=NULL,
                          data=fake_data)
prior_summary(fit_logit)

I get the following information on the priors:

Priors for model 'fit_probit' 
------
Intercept (after predictors centered)
 ~ normal(location = 0, scale = 4)

Coefficients
 ~ normal(location = [0,0,0,...], scale = [1.6,1.6,1.6,...])
------

Priors for model 'fit_logit' 
------
Intercept (after predictors centered)
 ~ normal(location = 0, scale = 2.5)

Coefficients
 ~ normal(location = [0,0,0,...], scale = [1,1,1,...])
------

This confirms that rstanarm is scaling the logit default of 2.5 by a factor of 1.6 so that it equals 4. So far so good, but I always thought the reason for scaling was to correct for a shift between the logistic and the normal distribution. Consider the following toy example:

plot(density(plogis(rnorm(1e4,0,1))))
lines(density(pnorm(rnorm(1e4,0,1))),col="red")

Here I compare the logit inverse (black) to the probit inverse (red) assuming the same scale–clearly not the same!

Now, I’m scaling up the probit inverse by a factor of 1.6, i.e.

plot(density(plogis(rnorm(1e4,0,1))))
lines(density(pnorm(rnorm(1e4,0,1.6))),col="red")

I obtain an even stronger mismatch. Now I’m scaling down the probit inverse by a factor of 1.6, i.e.

plot(density(plogis(rnorm(1e4,0,1))))
lines(density(pnorm(rnorm(1e4,0,1/1.6))),col="red")

Now we are talking! Based on this I believe that rstanarm should scale down rather than scale up the default logit prior, if one specifies a probit model. So if I’m not mistaken (and that could well be the case), then the scale for a probit model should default to 1.57 rather than 4, which may explain the convergence problems reported by Edward.

I would be glad if someone with more insights could confirm my hunch.