GLM with custom link function

Hi everyone,

I am trying to fit an admittedly very simple model, but cannot get it to work. I have used Stan in the past, but don’t have much experience with Bayesian modeling, so I would be very happy to be pointed to the relevant sections of the Stan documentation / other literature.

The model I’m trying to fit can be described as an extended logistic regression. For a standard logistic regression, my model block would look like this (being aware that the bernoulli_logit function would probably be a better choice):

model {
  for (n in 1:N)
    y[n] ~ bernoulli(inv_logit(alpha + beta * x[n]));
}

This is fine (in fact, it works as expected, i.e. produces similar results to MLE when using uniform priors), but now I want to introduce two more parameters into my link function, \gamma_l and \gamma_r. In the context of a a (simple) logistic regression these allow for data points at the very ends of the sigmoid curve to be in the “wrong” category with a certain probability (see figure).

The corresponding link function looks like this:

g(\mu,\gamma_l,\gamma_r) = \log{(\frac{\gamma_r-\mu}{\gamma_l + \mu - 1})},
g(\mu,\gamma_l,\gamma_r)^{-1} = (1 - (\gamma_l + \gamma_r)) \frac{1}{1 + e^{-\mu}} + \gamma_r,
which yields the following model:

model {
  for (n in 1:N)
    y[n] ~ bernoulli((1 - (gamma_l + gamma_r)) * 1/(1 + exp(-(alpha + beta * x[n]))) + gamma_r);
}

However, this does not work. What happens is that I get a lot of Informational Message: The current Metropolis proposal is about to be rejected because of the following issue: Exception: bernoulli_lpmf: Probability parameter is x, but must be in the interval [0, 1] warnings during fitting and when sampling finally finishes, the parameter estimates are way off (as compared to the same model using MLE).

  1. Why does this not work? I realize that this is a rather unspecific question, so again, I would be happy to be pointed to the relevant sections in the Stan documentation.
  2. What would be good priors for my two parameters here, considering that \gamma_l and \gamma_r should be constrained between [0,1]? (I could use <upper=0, upper=1> but the compiler says that it is not advised to do so…). For my data, I would expect them to be relatively close to 0.

My full model in pystan (with uniform priors) looks like this. I’m using the breast cancer dataset from sklearnhere for illustrative purposes, because it produces the same problems as my actual data.

import stan
import sklearn. datasets as datasets

model_code = """
data {
  int<lower=0> N;
  vector[N] x;
  int<lower=0,upper=1> y[N];
}

parameters {
  real alpha;
  real beta;
  real gamma_l;
  real gamma_r;
}

model {
  for (n in 1:N)
    y[n] ~ bernoulli((1 - (gamma_l + gamma_r)) * 1/(1 + exp(-(alpha + beta * x[n]))) + gamma_r);
}
"""
X, y = datasets.load_breast_cancer(return_X_y=True)
stan_data    = {"N": len(y),
                "y": y,
                "x": X[:,0]}
                
model = stan.build(model_code, data=stan_data, random_seed=1)
fit = model.sample(num_chains=4, num_samples=1000)

Any help would be appreciated!

1 Like

I’ve outlined one approach below using an ordered vector that corresponds to logits of the bounds. When I tested this with some fake data, it had a lot of divergences, so you will probably need to put some sensible priors on the bounds to help stabilize the estimates. I suspect your results will be highly sensitive to how much data you have in approaching the bounds.

data {
  int<lower=0> N;
  vector[N] x;
  int<lower=0,upper=1> y[N];
}
parameters {
  real alpha;
  real beta;
  ordered[2] gamma_raw;
}
transformed parameters{
    vector[2] gamma = inv_logit(gamma_raw);
}
model {
  real lb = gamma[1];
  real ub = gamma[2];
  real scale = 1 - lb - (1 - ub);

  for (n in 1:N){
    real unscaled_prob = inv_logit(alpha + beta * x[n]);
    y[n] ~ bernoulli(scale * unscaled_prob + lb);
  }
}

An alternative is to use a simplex to divide the [0,1] space into three bins. Here, the first element of gamma_raw corresponds to the space below the lower bound; the second element corresponds to the space between the bounds; and the third element corresponds to the space beyond the upper bound. So the lower bound is located at gamma_raw[1] and the upper bound is located at gamma_raw[1] + gamma_raw[1]. You can then directly put priors on the size of those bins in the final scale rather than in logits.

data {
  int<lower=0> N;
  vector[N] x;
  int<lower=0,upper=1> y[N];
}
transformed data{
    vector[3] gamma_prior;
    gamma_prior[1] = 5;
    gamma_prior[2] = 10;
    gamma_prior[3] = 5;
}
parameters {
  real alpha;
  real beta;
  simplex[3] gamma_raw;
}
transformed parameters{
    vector[2] gamma;
    gamma[1] = gamma_raw[1];
    gamma[2] = gamma_raw[1] + gamma_raw[2];
}
model {
  real lb = gamma[1];
  real ub = gamma[2];
  real scale = 1 - lb - (1 - ub);
  gamma_raw ~ dirichlet(gamma_prior);

  for (n in 1:N){
    real unscaled_prob = inv_logit(alpha + beta * x[n]);
    y[n] ~ bernoulli(scale * unscaled_prob + lb);
  }
}
3 Likes

You definitely need these constraints (modified to read <lower=0, upper=1>; I assume this was a typo). I wouldn’t think that the compiler should warn about this (did the typo in your post make it into your Stan model?), and I’d be curious to know what warning you are seeing.

There will still be an identifiability issue since you could swap the bounds (i.e. let \gamma_r exceed 1-\gamma_l) and negate the slope to recover the same model. That’s what @simonbrauer’s excellent solutions are designed to avoid, by enforcing an ordering of the bounds. Another solution, given that you expect both gammas to be near zero, is to hard-constrain both to be less than 0.5 (i.e. <lower = 0, upper=.5>, so that there’s no way to negate the slope via the bounds.

Also a small aside: note that overly vague priors on logit-scale parameters tend to do weird things, and might be expected to give results that are nontrivially different from MLEs. The issue is that vague priors on the logit induce pushforwards on the inverse-logit (i.e. the probability) that are very heavily concentrated near zero and one.

2 Likes

Just wanted to let you guys know that your replies were very helpful - thanks!

2 Likes

Cool method! @marcpabst or @simonbrauer would you be able to recommend some literature on it?

Using the breast cancer example above, the simplex method worked well. Do I read the estimates of \gamma_l, \gamma_r as misclassification rates? For this example I got 1% and 2% rates on the lower and upper ends, respectively. Would I interpret that as 1% of the malignant tumors are misclassified?

data {
  int<lower=0> N;
  vector[N] x;
  int<lower=0, upper=1> y[N];
}
parameters {
  real alpha;
  real beta;
  simplex[3] gamma_raw;
}
transformed parameters {
  vector[2] gamma = [gamma_raw[1], gamma_raw[1] + gamma_raw[2]]';
}
model {
    alpha ~ normal(0, 10);
    beta ~ normal(0, 5);
    y ~ bernoulli((1 - gamma[1] - (1 - gamma[2])) * inv_logit(alpha + beta * x) + gamma[1]);
}
generated quantities {
  vector[N] probs = (1 - gamma[1] - (1 - gamma[2])) * inv_logit(alpha + beta * x) + gamma[1];
}

Unfortunately I don’t know of any literature on this method. It is a pragmatic solution I came up with awhile back for another project.

I would interpret them as bounds on our confidence in the tails of the logit scale. If both are 0 (as in the standard logistic model), then we are assuming that we could asymptotically predict with 100% certainty if we had the right set of predictors. In your example, a lower bound of 1% suggests that the best we can do is 99% accuracy in the lower tail of the logit scale. They do not correspond to some constant change in misclassification.

2 Likes