Hi all,

I’m having some trouble understanding the proper way to make predictions from a Bayesian binary regression model (logistic or probit for the time being). Let’s say I have a joint posterior distribution for the model parameters of a logistic regression that I have simulated using Stan: f(\theta|x, y) where x and y are the predictor variables/binary responses i.e. the labelled data. I want to propagate my uncertainty in the estimates of the parameters through to new predictions. My understanding of how to do this is to simulate the distribution of f(p_{new}|x_{new})= E[y_{new}] = Pr(y_{new} = 1| x_{new}). I’d take the draws of posterior \theta i.e \theta^{(1)}, \theta^{(2)}, ... and compute p_{new}^{(i)} = 1/(1+exp(-(\theta^{(i)T}*x_{(new)})). This would then give me a probability distribution for p, which could be interpreted as the probability of the new input belonging to the positive class plus some uncertainty regarding the estimation of that probability.

I’ve seen some other work on the topic https://www.inf.ed.ac.uk/teaching/courses/mlpr/2016/notes/w8a_bayes_logistic_regression_laplace.pdf that seems to collapse this process into a single probability value. This seems to go against the process I’ve laid out above (which I previously thought to be correct). Am I misinterpreting this work or am I wrong about someting else?

In the same line of reasoning, previous work on my research question has formulated the binary regression in terms of a latent variable probit model. This is typically formulated as:

Given predictor variables **X** and binary responses **Y** encoded such that x_i is the vector of predictor variables’ values and y_i the response for the i^{th} case. Let **\theta** be the vector of model parameters associated with those predictor variables. Adopt a latent variable interpretation such that we assume that y_i = 1 when y^*_i > 0 and y_i = 0 when y^*_i < 0 defining the latent variable as y^*_i = g_i((x_i, \theta) + \epsilon where \epsilon ~ Normal$(0, \sigma_e)$ and g() is some combination of predictors/coefficients that allows the error term to be identified as a model parameter. Perform the Bayesian updating and determine the join posterior distribution for our parameters f(\theta, \epsilon |x).

They follow a similar line of resoning for making predictions of class probabilities that seems to be at odds with my own:

To make a prediction of probability that some new observation y_{new} = 1 given some new data x_{new}.

Pr(y_{new} = 1)=Pr(y^*_{new}>0) = Pr(g(x_i, \theta) + \epsilon > 0). Viewing the \theta's and \epsilon as random variables with their joint posterior distributions the probability becomes a multifold integral over all the region where the function becomes negative. This results in a single probability as the answer.

Again, am I missing something in my interpretation or is their a flaw in the above logic?

Please forgive any sloppiness of notation (although maybe said sloppiness might reveal where my understanding fails!)