Interpretation of predictions from a Bayesian logistic/probit regression

Hi all,

I’m having some trouble understanding the proper way to make predictions from a Bayesian binary regression model (logistic or probit for the time being). Let’s say I have a joint posterior distribution for the model parameters of a logistic regression that I have simulated using Stan: f(\theta|x, y) where x and y are the predictor variables/binary responses i.e. the labelled data. I want to propagate my uncertainty in the estimates of the parameters through to new predictions. My understanding of how to do this is to simulate the distribution of f(p_{new}|x_{new})= E[y_{new}] = Pr(y_{new} = 1| x_{new}). I’d take the draws of posterior \theta i.e \theta^{(1)}, \theta^{(2)}, ... and compute p_{new}^{(i)} = 1/(1+exp(-(\theta^{(i)T}*x_{(new)})). This would then give me a probability distribution for p, which could be interpreted as the probability of the new input belonging to the positive class plus some uncertainty regarding the estimation of that probability.

I’ve seen some other work on the topic https://www.inf.ed.ac.uk/teaching/courses/mlpr/2016/notes/w8a_bayes_logistic_regression_laplace.pdf that seems to collapse this process into a single probability value. This seems to go against the process I’ve laid out above (which I previously thought to be correct). Am I misinterpreting this work or am I wrong about someting else?

In the same line of reasoning, previous work on my research question has formulated the binary regression in terms of a latent variable probit model. This is typically formulated as:

Given predictor variables X and binary responses Y encoded such that x_i is the vector of predictor variables’ values and y_i the response for the i^{th} case. Let \theta be the vector of model parameters associated with those predictor variables. Adopt a latent variable interpretation such that we assume that y_i = 1 when y^*_i > 0 and y_i = 0 when y^*_i < 0 defining the latent variable as y^*_i = g_i((x_i, \theta) + \epsilon where \epsilon ~ Normal$(0, \sigma_e)$ and g() is some combination of predictors/coefficients that allows the error term to be identified as a model parameter. Perform the Bayesian updating and determine the join posterior distribution for our parameters f(\theta, \epsilon |x).

They follow a similar line of resoning for making predictions of class probabilities that seems to be at odds with my own:

To make a prediction of probability that some new observation y_{new} = 1 given some new data x_{new}.

Pr(y_{new} = 1)=Pr(y^*_{new}>0) = Pr(g(x_i, \theta) + \epsilon > 0). Viewing the \theta's and \epsilon as random variables with their joint posterior distributions the probability becomes a multifold integral over all the region where the function becomes negative. This results in a single probability as the answer.

Again, am I missing something in my interpretation or is their a flaw in the above logic?

Please forgive any sloppiness of notation (although maybe said sloppiness might reveal where my understanding fails!)

Since nobody else answered, I will give it a try:

I think you are mixing together two things:

  1. the predictive distribution of the outcome of the linear model after applying the inverse-link transformation (which is indeed a distribution over the interval [0,1]). This distribution has density f(p_{new}|x_{new}), but we actually only have samples from this distribution p_{new}^{(1)}, ... , p_{new}^{(S)}
  2. quantities derived from the predictive distribution (E[y_{new}], Pr(y_{new} = 1| x_{new})) which are both a single number. To be precise, those result from integrating over the distribution, e.g.
E[y_{new}] = E[0 \times (1 - p_{new}) + 1 \times p_{new}] = E[p_{new}] = \\ =\int_0^1 p_{new} f(p_{new}|x_{new}) \mathtt{d}p_{new} \simeq \sum_{i=1}^S \frac{p_{new}^{(i)}}{S}

The confusion here might arise, because both E[y_{new}], Pr(y_{new} = 1| x_{new}) have the same domain as the posterior distribution they are derived from, and because E[y_{new}] = E[p_{new}] due to the properties of the Bernoulli distribution.

But the same distinction would apply to any kind of regression - in a linear regression, there is a whole posterior distribution of outcomes given x_{new}, but E[y_{new}] or Pr(y_{new} > 2) would still be single numbers derived from this distribution.

I believe the same distinction should resolve the second part of your question.

Does that make sense?

Yes, that clarifies things nicely. Consistent with the interpretations above, what do you think is the proper way to make fully Bayesian predictions from such a model? Some background: the purpose of these models are to predict future probabilities of failure based upon previously observed failures/successes. Originally I was thinking the appropriate way to do this would be the predictive distribution which I interpreted as summarizing the uncertainty about future probabilities. But am I misinterpreting the predictive distribution in this sense? Is the forward prediction of probability just the expected value of this predictive distribution?

This would IMHO depend on your aims. Assuming the model is reasonably close to reality (which it almost certainly isn’t, but let’s run with it for a while), there are two basic ways to think about it:

  1. For a given x_{new}, there is some true actual average success rate over many future observations. The predictive distribution gives me uncertainty about this value.
  2. I am about to place a bet (or make a decision of any sort) on a single future observation. Then Pr(y_{new} = 1|x_{new}) is the relevant quantity.

Hope that helps

1 Like

Number 2 would be more in line with this predictive model I think, as we are trying to predict a probability of failure. I am still a little confused by the interpretations though. Am I correct in interpreting the predictive distribution of the outcome after applying the inverse link transformation as a bunch of probable Bernoulli distribution parameters/expected values of binary random variables rather than a distribution of actual predicted probabilities. Possibly exacerbated by the fact that E[p_{new}] = p_{new} when using a MAP or MLE estimate of \theta? And to find the actual probability we have to consider the outcomes of all those Bernoulli variables?

Yeah, you could say that.

One more way of thinking about this:

  • You may predict the success of one event - there are only two possibilities, so the probability distribution of those can be described by a single number (this is case 2 in my previous post)
  • You may predict the success rate over 2 events - there are 3 possibilities (0, 0.5, 1), to describe the probability distribution you need use 2 numbers.
  • You may predict the success rate over 5 events - there are 6 possibilities (0, 0.2, 0.4, 0.6, 0.8, 1), to describe the probability distribution you need use 5 numbers.
  • You may predict the success rate over infinite events, there are infinitely many possibilities and you need to describe it via a continuous density function (this is Case 1 in my previous post)

Hope that helps more than obscures :-)

I discuss this here:

For Bayesian logistic regression, we want to average over our uncertainty in the regression coefficients \beta. Suppose we have data y from which we get a posterior p(\beta \mid y) over regression coefficients. Now suppose we get a new predictor \tilde{x} and we want to predict the binary outcome \tilde{y}. It’s the usual posterior predictive inference formula p(\tilde{y} \mid y) (plus predictors), which averages by integrating,

\begin{array}{rcl} \textrm{Pr}[\tilde{y} = 1 \mid \tilde{x}, y, x] & = & \int \textrm{Bernoulli}(\textrm{logit}^{-1}(\tilde{x} \cdot \beta)) \cdot p(\beta \mid y, x) \ \textrm{d}\beta. \\[8pt] & \approx & \frac{1}{M} \sum_{m=1}^M \textrm{Bernoulli}(\textrm{logit}^{-1}(\tilde{x} \cdot \beta^{(m)})) \end{array}

The second line gives the MCMC estimate, where \beta^{(m)} is the m-th draw from the posterior p(\theta \mid y, x).

1 Like