Changing regression response variable from gaussian to logistic

I’m trying to adapt a complicated model from a continuous response variable to a binary one using logit and I’m having some trouble (I believe I’m missing something)

I’ve made a simple linear regression model to try things out with:

data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  vector[N] y;      // outcome vector
}
parameters {
  real alpha;           // intercept
  vector[K] beta;       // coefficients for predictors
  real<lower=0> sigma;  // error scale
}
model {
  y ~ normal(x * beta + alpha, sigma);  // likelihood
}

Ok this fits just fine no issues to toy data:

data(mtcars)

m <- 32
y <- sample(c(0,1), size = m, replace = TRUE)
dat <- list(N = m, y = y, K = 3, x = as.matrix(mtcars[, c("disp", "hp", "wt")]));


fit <- stan(file = "regression.stan",
            data = dat, iter = 2012, chains = 4, cores = 4)

So I tried to edit this to a bernoulli_logit for the binary outcome:

data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  int y[N];      // outcome vector
}
parameters {
  real alpha;           // intercept
  vector[K] beta;       // coefficients for predictors
  real<lower=0> sigma;  // error scale
  
  vector[N] ystar;
}
model {
  // likelihood
  ystar ~ normal(x * beta + alpha, sigma);
  y ~ bernoulli_logit(ystar);  
}

I’m getting thousands of max tree depth warnings and pairs plots look like this:

I suspect I’m missing something with the model specificaiotn but I can’t think what. Can anyone help?
Thanks

Edit. I tried adding priors as follows:

  //priors
  alpha ~ std_normal();
  beta ~std_normal();
  sigma ~ inv_gamma(1,1);

which shows up some funnels, but I’m at a loss what to do about it:

Observation-level random effects are not well identified in logistic regression with binary outcomes. The traditional logistic regression would remove your parameter sigma and the normal sampling statement.

Edit: Alternatively, you need a strongly informative prior on sigma and a non-centered parameterization of ystar ~ normal(x * beta + alpha, sigma);, but this only makes sense if you truly have quite strong prior information about the true value of sigma

2 Likes

Ok interesting. Alas I’m not sure how to use either approach with my full model which is a form of GP regression.

Some toy data:

data1 <- list(n_obs = 20L, n_exp = 3L, n_cov = 1,
            y = c(1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1), 
            Z = structure(c(0.235, -0.331, -0.312, -2.302, -0.171, 0.14, -1.497, -1.01, -0.948, 
                           -0.494, -0.174, -0.407, 1.846, 0.394, 0.798, -1.567, -0.086, 
                           -0.359, -1.194, 0.364, 0.362, 0.347, 0.19, -0.16, 0.327, 0.598, 
                           -1.842, 2.718, 0.191, -1.301, -3.113, -0.941, 1.4, -1.62, -2.266, 
                            1.163, -0.116, 0.334, -0.621, -1.31, -1.176, -1.121, -1.362, 
                            0.481, 0.742, 0.028, 0.331, 0.644, 2.486, 1.96, 0.192, 1.553, 
                            0.914, 0.359, 0.175, -0.847, 0.978, 1.806, 0.123, -0.13), dim = c(20L, 3L),
            dimnames = list(NULL, c("z1", "z2", "z3"))),
            X = structure(c(2.485, 5.73, 3.675, -0.182, 5.817, 2.208, 0.625, -0.017, 2.338, 5.451, 
                            5.002, 3.708, -2.155, 3.089, 1.33, 1.884, 1.726, 2.612, 3.169, 
                            3.579), dim = c(20L, 1L)))

Gaussian version of the model: gpgaussian.stan (1.1 KB)

That runs just fine (note I attempted a non-centered version but it had alot of problems). My attempt to make a binomial version I changed the parameter and model block as follows:

parameters {
    real<lower=0.0> sigma;
    real<lower=0.0> lambda;
    vector[n_cov] beta;
    vector[n_obs] ystar;

    vector<lower=0.0>[n_exp] r;
}

model {
    matrix[n_obs, n_obs] V;

    V = add_diag( lambda * exp(-cov_Kzr(Z, r)) , 1.0 ) ;

    // Priors
    sigma ~ inv_gamma(1, 0.25);
    lambda ~ inv_gamma(10, 10);
    r ~ cauchy(0, 0.1);
    beta ~ std_normal();

    // Likelihood
    ystar ~ multi_normal_cholesky(X*beta, sigma * cholesky_decompose(V));
    y ~ bernoulli_logit( ystar );
}

I wasn’t sure could if/how I could drop the multi_normal_cholesky() without fundamentally altering the model.

Using a GP prior changes things because of the way it pools information across space. It would be worth taking a standard GP model and using that as the linear predictor in a logistic regression, and seeing how the result behaves. This is a rare situation where building up from a simpler model doesn’t work well.

Note that you can easily construct this model in brms using a gp term; I’d recommend giving that a try if your implementation continues to have problems, just in case the brms implementation works right out of the box. brms also gives you access to fast approximate gps, in case that’s of interest.

3 Likes

@jroon I might be missing something, but do you just want to fit a logistic regression to a known binary response variable? That’s what it looks like to me, rather than what you currently have which is imputing an unknown logit-scale parameter/variable called y_star, which is a more complex model.

A simple logistic regression for your binary data would be:

data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  int y[N];      // outcome vector
}
parameters {
  real alpha;           // intercept
  vector[K] beta;       // coefficients for predictors
  
  vector[N] ystar = alpha + x * beta;
}
model {
  // likelihood
  y ~ bernoulli_logit(ystar);
}

Apologies if you do actually want to estimate the latent sigma here, I just thought I’d clarify this!

EDIT: @jsocolar Sorry, just realised this was part of your original reply!

2 Likes

As I have a custom covariance function I did not think brms would facilitate this model?

Yup I need the latent I’m afraid.

1 Like

Yeah, that’s right.

Although, perhaps I could get some clues using a brms supported cov function. I’ll play with this later, thanks for the idea!

Edit: Ok I had some time to look at this as follows:

dat <- mgcv::gamSim(1, n = 30, scale = 2)
dat$ystar <- dat$y
dat$y <- rbinom(nrow(dat), 1, 1/(1 + exp(-dat$ystar)))   # pass through an inv-logit function

bf5 <- bf(ystar ~ gp(x1, x2), family = "gaussian") +
          bf(y ~ ystar, family = "bernoulli")
fit5 <- brm(bf5, dat, chains = 2)

This model threw a small number of divergences but ok wanted to see the code and that proved interesting. I’ll just show the model block for brevity:

model {
  // likelihood including constants
  if (!prior_only) {
    vector[Nsubgp_ystar_1] gp_pred_ystar_1 = gp(Xgp_ystar_1, sdgp_ystar_1[1], lscale_ystar_1[1], zgp_ystar_1);
    // initialize linear predictor term
    vector[N_ystar] mu_ystar = rep_vector(0.0, N_ystar);
    mu_ystar += Intercept_ystar + gp_pred_ystar_1[Jgp_ystar_1];
    target += normal_lpdf(Y_ystar | mu_ystar, sigma_ystar);
    target += bernoulli_logit_glm_lpmf(Y_y | Xc_y, Intercept_y, b_y);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(zgp_ystar_1);
}

Ok so it used a bernoulli regression model. I’ll see if I can adapt this approach to my own model above

I made some attempts at this, but no joy. I think that there is an identifibiltiy problem here:

model {
  // likelihood
  ystar ~ normal(x * beta + alpha, sigma);
  y ~ bernoulli_logit(ystar);  
}

I was wondering, is it possible to analytically integrate for example y with respect to beta here? I am unsure if this is possible with on equation being a normal and the other bernoulli_logit?

There is definitely an identifiability problem when there’s an unstructured observation-level random effect and a bernoulli response. However, when that random effect is spatial, it might be possible to identify.

So it’s expected that this won’t work:

But it’s still possible that a gaussian process prior on the residual component of ystar will work.

1 Like

I’m not sure what this means - can you elaborate?

This

needs to be replaced with something like this

bf6 <- bf(y ~ gp(x1, x2), family = "bernoulli")

So that the only latent residuals are the spatial residuals modeled by the gaussian process, without an unstructured observation-level random effect.

But then I’m not solving the same model. My original Gaussian model is
y ~ multi_normal_cholesky(X*beta, sigma * cholesky_decompose(V));

But this is itself an analytically integrated GP, and what I’m most interested in here is V which I use to reconstuct a latent factor in the un-intergrated version of the original math. If I do what you suggest it seems to me I’m fundamentally changing that.

Maybe what I’m trying to do is just not doable.

Two final points:

  1. So far in this thread, I don’t think we’ve actually discussed what happens when you try to just fit your desired model, as given in your second post. It’s not obvious to me that that model should have problems or isn’t fittable. What happens if you try to fit it?

  2. You’re probably right; however, I don’t yet see why your desired model is different from the model given by bf(y ~ gp(x1, x2), family = "bernoulli"), up to the replacement of the gaussian process term constructed by brms with the gaussian process term that you wish to use. The problems arise when you separately introduce a scale parameter sigma that is unstructured. When you write bf(ystar ~ gp(x1, x2), family = "gaussian"), you have a gaussian process term and an unstructured residual term. It sounds like you are telling me that you need to retain the unstructured residual term for some reason, and I believe you, but I don’t yet see why. As long as you need to retain this unstructured logit-scale residual, then you are right that this isn’t doable, absent a very strong prior on the variance of that residual. The key intuition can be encapuslated as follows. Suppose you have a model with binary response data consisting of about half zeros and half ones. How is the model supposed to infer whether the true observation-wise probabilities are all .5 (so the logit-scale residual variance is minimal) versus whether the true observation-wise probabilities are all concentrated near 0 and 1 (under extremely high logit-scale residual variance)? As long as the intercept is roughly 0.5, the resulting data stream is entirely consistent with any conceivable residual variance. BUT, if the residual variance is structured such that it’s not possible to adjust the marginal variances independent of the covariances, then the non-identifiability might go away. You can reject extremely high (or low) covariances when you see data like [0, 1] (or [1, 1]). Thus, you should be able to add some gaussian process to the linear predictor, and everything might work out ok provided that you avoid also adding an unstructured logit-scale residual.

Sorry I should have said perhaps 😅. The toy model I designed to have the same problems. But here is the result using the data and model in my second post:

With your second point… eh I think I’m missing something maybe.

Ok this I think I observed in posterior draws. I extracted ystar posterior samples pushed them through an inv-logit function to get probabilities. I saw that averaging over all draws the mean was close to the probability of y, but in a single posterior draw probability was typically pushed to 0 or 1.

This is where you’re losing me. I’m not following this point of residual variance being structured or unstructed. Do you means there’s a GP prior on sigma as well as mu?

1 Like

So your model for ystar has the form of y = \mu + G + \epsilon, where \epsilon is the residual error, G is a Gaussian process given by sampling from a big multivariate normal with mean zero, and \mu is the rest of the linear predictor (i.e. alpha + x * beta). Since ystar is the linear predictor in a binomial regression, \epsilon effectively is an observation-level normal random effect. G can likewise be thought of as an observation level random effect with an interesting covariance structure. Thinking of G in this way leads to an interpretation of the model wherein the linear predictor contains a fixed-effect component and then two latent residual terms (i.e. observation-level random effects): an unstructured one corresponding to \epsilon, and a structured one corresponding to G (if the gaussian process is over space, then we might say it’s a spatially structured residual). We might also say that the latent ystar has two associated residuals, one (\epsilon) with a normal prior with null covariance, and the other (G) with a gaussian-process prior.

Binary response data cannot in general identify the unstructured random effect \epsilon, because the data do not normally contain sufficient information about the variance of the random effect. However, the data DO contain information about the variances in G, because these variances control the covariance in the observed response. To see what I mean here, consider again the case where \mu is near zero. If the variances in the gaussian process are small, then the response will be zeros and ones in equal proportion with minimal spatial structure. If the variances are very large, then pairs of observations whose G's are expected to covary strongly and positively will overwhelmingly tend to have the same value (both zeros or both ones). So the extent to which the response data shows the signature of the covariance in G identifies the variance in the Gaussian process in a way that is not possible for the unstructured random effect \epsilon.

Is any of that clarifying?

So, do I understand correctly that ϵ is not an explicit parameter in the code, but the left over residuals from fitting the ystar submodel, and when that is fed into the bernoulli_logit it doesn’t meet the assumptions of bernoulli logit?

Yes, I think so. Just to be totally clear, we can declare e as a parameter and ystar as a transformed parameter and write something like ystar = mu + G + e, or we can declare ystar as a parameter and write something like ystar ~ normal(mu + G, sigma), in which case if desired we could recover e as a transformed parameter with e = ystar - (mu + G). You code follows the ystar ~ normal(mu + G, sigma) route. In either case, the problem is that the linear predictor in the bernoulli logit contains the e term and not just the mu + G term.

But as far as I can see, your code from your second post does NOT contain this extra e term, which is why I think it might be promising to tinker with that model to see if it will yield a reliable computation.

1 Like

Ok really great tips thanks, Such considerations had not occurred to me. I will experiment and get back with learnings :)

1 Like

Although…. can I get a realization of G in the transformed parameters block? Can’t use ‘ _rng ‘ there right?