Changing regression response variable from gaussian to logistic

jroon · February 19, 2024, 6:43pm

I’m trying to adapt a complicated model from a continuous response variable to a binary one using logit and I’m having some trouble (I believe I’m missing something)

I’ve made a simple linear regression model to try things out with:

data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  vector[N] y;      // outcome vector
}
parameters {
  real alpha;           // intercept
  vector[K] beta;       // coefficients for predictors
  real<lower=0> sigma;  // error scale
}
model {
  y ~ normal(x * beta + alpha, sigma);  // likelihood
}

Ok this fits just fine no issues to toy data:

data(mtcars)

m <- 32
y <- sample(c(0,1), size = m, replace = TRUE)
dat <- list(N = m, y = y, K = 3, x = as.matrix(mtcars[, c("disp", "hp", "wt")]));


fit <- stan(file = "regression.stan",
            data = dat, iter = 2012, chains = 4, cores = 4)

So I tried to edit this to a bernoulli_logit for the binary outcome:

data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  int y[N];      // outcome vector
}
parameters {
  real alpha;           // intercept
  vector[K] beta;       // coefficients for predictors
  real<lower=0> sigma;  // error scale
  
  vector[N] ystar;
}
model {
  // likelihood
  ystar ~ normal(x * beta + alpha, sigma);
  y ~ bernoulli_logit(ystar);  
}

I’m getting thousands of max tree depth warnings and pairs plots look like this:

I suspect I’m missing something with the model specificaiotn but I can’t think what. Can anyone help?
Thanks

Edit. I tried adding priors as follows:

  //priors
  alpha ~ std_normal();
  beta ~std_normal();
  sigma ~ inv_gamma(1,1);

which shows up some funnels, but I’m at a loss what to do about it:

jsocolar · February 19, 2024, 6:48pm

Observation-level random effects are not well identified in logistic regression with binary outcomes. The traditional logistic regression would remove your parameter sigma and the normal sampling statement.

Edit: Alternatively, you need a strongly informative prior on sigma and a non-centered parameterization of ystar ~ normal(x * beta + alpha, sigma);, but this only makes sense if you truly have quite strong prior information about the true value of sigma

jroon · February 20, 2024, 6:29am

Ok interesting. Alas I’m not sure how to use either approach with my full model which is a form of GP regression.

Some toy data:

data1 <- list(n_obs = 20L, n_exp = 3L, n_cov = 1,
            y = c(1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1), 
            Z = structure(c(0.235, -0.331, -0.312, -2.302, -0.171, 0.14, -1.497, -1.01, -0.948, 
                           -0.494, -0.174, -0.407, 1.846, 0.394, 0.798, -1.567, -0.086, 
                           -0.359, -1.194, 0.364, 0.362, 0.347, 0.19, -0.16, 0.327, 0.598, 
                           -1.842, 2.718, 0.191, -1.301, -3.113, -0.941, 1.4, -1.62, -2.266, 
                            1.163, -0.116, 0.334, -0.621, -1.31, -1.176, -1.121, -1.362, 
                            0.481, 0.742, 0.028, 0.331, 0.644, 2.486, 1.96, 0.192, 1.553, 
                            0.914, 0.359, 0.175, -0.847, 0.978, 1.806, 0.123, -0.13), dim = c(20L, 3L),
            dimnames = list(NULL, c("z1", "z2", "z3"))),
            X = structure(c(2.485, 5.73, 3.675, -0.182, 5.817, 2.208, 0.625, -0.017, 2.338, 5.451, 
                            5.002, 3.708, -2.155, 3.089, 1.33, 1.884, 1.726, 2.612, 3.169, 
                            3.579), dim = c(20L, 1L)))

Gaussian version of the model: gpgaussian.stan (1.1 KB)

That runs just fine (note I attempted a non-centered version but it had alot of problems). My attempt to make a binomial version I changed the parameter and model block as follows:

parameters {
    real<lower=0.0> sigma;
    real<lower=0.0> lambda;
    vector[n_cov] beta;
    vector[n_obs] ystar;

    vector<lower=0.0>[n_exp] r;
}

model {
    matrix[n_obs, n_obs] V;

    V = add_diag( lambda * exp(-cov_Kzr(Z, r)) , 1.0 ) ;

    // Priors
    sigma ~ inv_gamma(1, 0.25);
    lambda ~ inv_gamma(10, 10);
    r ~ cauchy(0, 0.1);
    beta ~ std_normal();

    // Likelihood
    ystar ~ multi_normal_cholesky(X*beta, sigma * cholesky_decompose(V));
    y ~ bernoulli_logit( ystar );
}

I wasn’t sure could if/how I could drop the multi_normal_cholesky() without fundamentally altering the model.

jsocolar · February 20, 2024, 12:26pm

Using a GP prior changes things because of the way it pools information across space. It would be worth taking a standard GP model and using that as the linear predictor in a logistic regression, and seeing how the result behaves. This is a rare situation where building up from a simpler model doesn’t work well.

Note that you can easily construct this model in brms using a gp term; I’d recommend giving that a try if your implementation continues to have problems, just in case the brms implementation works right out of the box. brms also gives you access to fast approximate gps, in case that’s of interest.

cgoold · February 20, 2024, 3:55pm

@jroon I might be missing something, but do you just want to fit a logistic regression to a known binary response variable? That’s what it looks like to me, rather than what you currently have which is imputing an unknown logit-scale parameter/variable called y_star, which is a more complex model.

A simple logistic regression for your binary data would be:

data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  int y[N];      // outcome vector
}
parameters {
  real alpha;           // intercept
  vector[K] beta;       // coefficients for predictors
  
  vector[N] ystar = alpha + x * beta;
}
model {
  // likelihood
  y ~ bernoulli_logit(ystar);
}

Apologies if you do actually want to estimate the latent sigma here, I just thought I’d clarify this!

EDIT: @jsocolar Sorry, just realised this was part of your original reply!

jroon · February 20, 2024, 5:46pm

As I have a custom covariance function I did not think brms would facilitate this model?

Yup I need the latent I’m afraid.

jsocolar · February 20, 2024, 5:52pm

Yeah, that’s right.

jroon · February 20, 2024, 6:01pm

Although, perhaps I could get some clues using a brms supported cov function. I’ll play with this later, thanks for the idea!

Edit: Ok I had some time to look at this as follows:

dat <- mgcv::gamSim(1, n = 30, scale = 2)
dat$ystar <- dat$y
dat$y <- rbinom(nrow(dat), 1, 1/(1 + exp(-dat$ystar)))   # pass through an inv-logit function

bf5 <- bf(ystar ~ gp(x1, x2), family = "gaussian") +
          bf(y ~ ystar, family = "bernoulli")
fit5 <- brm(bf5, dat, chains = 2)

This model threw a small number of divergences but ok wanted to see the code and that proved interesting. I’ll just show the model block for brevity:

model {
  // likelihood including constants
  if (!prior_only) {
    vector[Nsubgp_ystar_1] gp_pred_ystar_1 = gp(Xgp_ystar_1, sdgp_ystar_1[1], lscale_ystar_1[1], zgp_ystar_1);
    // initialize linear predictor term
    vector[N_ystar] mu_ystar = rep_vector(0.0, N_ystar);
    mu_ystar += Intercept_ystar + gp_pred_ystar_1[Jgp_ystar_1];
    target += normal_lpdf(Y_ystar | mu_ystar, sigma_ystar);
    target += bernoulli_logit_glm_lpmf(Y_y | Xc_y, Intercept_y, b_y);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(zgp_ystar_1);
}

Ok so it used a bernoulli regression model. I’ll see if I can adapt this approach to my own model above

jroon · February 22, 2024, 11:44am

I made some attempts at this, but no joy. I think that there is an identifibiltiy problem here:

model {
  // likelihood
  ystar ~ normal(x * beta + alpha, sigma);
  y ~ bernoulli_logit(ystar);  
}

I was wondering, is it possible to analytically integrate for example y with respect to beta here? I am unsure if this is possible with on equation being a normal and the other bernoulli_logit?

jsocolar · February 22, 2024, 1:15pm

There is definitely an identifiability problem when there’s an unstructured observation-level random effect and a bernoulli response. However, when that random effect is spatial, it might be possible to identify.

So it’s expected that this won’t work:

But it’s still possible that a gaussian process prior on the residual component of ystar will work.

jroon · February 22, 2024, 1:18pm

I’m not sure what this means - can you elaborate?

jsocolar · February 22, 2024, 1:55pm

This

needs to be replaced with something like this

bf6 <- bf(y ~ gp(x1, x2), family = "bernoulli")

So that the only latent residuals are the spatial residuals modeled by the gaussian process, without an unstructured observation-level random effect.

jroon · February 22, 2024, 4:56pm

But then I’m not solving the same model. My original Gaussian model is
y ~ multi_normal_cholesky(X*beta, sigma * cholesky_decompose(V));

But this is itself an analytically integrated GP, and what I’m most interested in here is V which I use to reconstuct a latent factor in the un-intergrated version of the original math. If I do what you suggest it seems to me I’m fundamentally changing that.

Maybe what I’m trying to do is just not doable.

jsocolar · February 22, 2024, 5:23pm

Two final points:

So far in this thread, I don’t think we’ve actually discussed what happens when you try to just fit your desired model, as given in your second post. It’s not obvious to me that that model should have problems or isn’t fittable. What happens if you try to fit it?
You’re probably right; however, I don’t yet see why your desired model is different from the model given by bf(y ~ gp(x1, x2), family = "bernoulli"), up to the replacement of the gaussian process term constructed by brms with the gaussian process term that you wish to use. The problems arise when you separately introduce a scale parameter sigma that is unstructured. When you write bf(ystar ~ gp(x1, x2), family = "gaussian"), you have a gaussian process term and an unstructured residual term. It sounds like you are telling me that you need to retain the unstructured residual term for some reason, and I believe you, but I don’t yet see why. As long as you need to retain this unstructured logit-scale residual, then you are right that this isn’t doable, absent a very strong prior on the variance of that residual. The key intuition can be encapuslated as follows. Suppose you have a model with binary response data consisting of about half zeros and half ones. How is the model supposed to infer whether the true observation-wise probabilities are all .5 (so the logit-scale residual variance is minimal) versus whether the true observation-wise probabilities are all concentrated near 0 and 1 (under extremely high logit-scale residual variance)? As long as the intercept is roughly 0.5, the resulting data stream is entirely consistent with any conceivable residual variance. BUT, if the residual variance is structured such that it’s not possible to adjust the marginal variances independent of the covariances, then the non-identifiability might go away. You can reject extremely high (or low) covariances when you see data like [0, 1] (or [1, 1]). Thus, you should be able to add some gaussian process to the linear predictor, and everything might work out ok provided that you avoid also adding an unstructured logit-scale residual.

jroon · February 22, 2024, 10:03pm

Sorry I should have said perhaps 😅. The toy model I designed to have the same problems. But here is the result using the data and model in my second post:

With your second point… eh I think I’m missing something maybe.

Ok this I think I observed in posterior draws. I extracted ystar posterior samples pushed them through an inv-logit function to get probabilities. I saw that averaging over all draws the mean was close to the probability of y, but in a single posterior draw probability was typically pushed to 0 or 1.

This is where you’re losing me. I’m not following this point of residual variance being structured or unstructed. Do you means there’s a GP prior on sigma as well as mu?

jsocolar · February 22, 2024, 10:44pm

So your model for ystar has the form of y = \mu + G + \epsilon, where \epsilon is the residual error, G is a Gaussian process given by sampling from a big multivariate normal with mean zero, and \mu is the rest of the linear predictor (i.e. alpha + x * beta). Since ystar is the linear predictor in a binomial regression, \epsilon effectively is an observation-level normal random effect. G can likewise be thought of as an observation level random effect with an interesting covariance structure. Thinking of G in this way leads to an interpretation of the model wherein the linear predictor contains a fixed-effect component and then two latent residual terms (i.e. observation-level random effects): an unstructured one corresponding to \epsilon, and a structured one corresponding to G (if the gaussian process is over space, then we might say it’s a spatially structured residual). We might also say that the latent ystar has two associated residuals, one (\epsilon) with a normal prior with null covariance, and the other (G) with a gaussian-process prior.

Binary response data cannot in general identify the unstructured random effect \epsilon, because the data do not normally contain sufficient information about the variance of the random effect. However, the data DO contain information about the variances in G, because these variances control the covariance in the observed response. To see what I mean here, consider again the case where \mu is near zero. If the variances in the gaussian process are small, then the response will be zeros and ones in equal proportion with minimal spatial structure. If the variances are very large, then pairs of observations whose G's are expected to covary strongly and positively will overwhelmingly tend to have the same value (both zeros or both ones). So the extent to which the response data shows the signature of the covariance in G identifies the variance in the Gaussian process in a way that is not possible for the unstructured random effect \epsilon.

Is any of that clarifying?

jroon · February 23, 2024, 7:15am

So, do I understand correctly that ϵ is not an explicit parameter in the code, but the left over residuals from fitting the ystar submodel, and when that is fed into the bernoulli_logit it doesn’t meet the assumptions of bernoulli logit?

jsocolar · February 23, 2024, 12:19pm

Yes, I think so. Just to be totally clear, we can declare e as a parameter and ystar as a transformed parameter and write something like ystar = mu + G + e, or we can declare ystar as a parameter and write something like ystar ~ normal(mu + G, sigma), in which case if desired we could recover e as a transformed parameter with e = ystar - (mu + G). You code follows the ystar ~ normal(mu + G, sigma) route. In either case, the problem is that the linear predictor in the bernoulli logit contains the e term and not just the mu + G term.

But as far as I can see, your code from your second post does NOT contain this extra e term, which is why I think it might be promising to tinker with that model to see if it will yield a reliable computation.

jroon · February 23, 2024, 4:09pm

Ok really great tips thanks, Such considerations had not occurred to me. I will experiment and get back with learnings :)

jroon · February 23, 2024, 5:40pm

Although…. can I get a realization of G in the transformed parameters block? Can’t use ‘ _rng ‘ there right?

Topic		Replies	Views
Problems with fitting inverse logit function to data Modeling techniques , fitting-issues	8	1591	September 5, 2020
R2D2 prior and Gaussian Processes Modeling brms	65	1233	June 19, 2025
Gaussian distribution for logit-transformed binomial response to add an auto-regressive structure brms bayesplot	17	2399	June 13, 2019
Binary time series Modeling	27	2092	February 17, 2020
How to convert gaussian GP model to fit binomial outcome variable Modeling fitting-issues	2	275	February 15, 2024

Changing regression response variable from gaussian to logistic

Related topics