Force brms to run even with invalid response variables

Is there a way to force brm to ignore invalid response values when running logistic regression (or in general)?

library(brms)

data_df <-
  data.frame(y=c(0.1, 0.2, 0.3, 1.1), x=c(1, 2, 3, 0))

logit_post <- brm(y ~ x - 1,
                  family = bernoulli(link = "logit"),
                  data = data_df)

# Error: Family 'bernoulli' requires responses to contain only two different values.

For example, the above posterior is perfectly well-defined, and I would like brms to blithely put the y I gave it into the standata object and run posterior sampling as it normally would had had the y been binary.

It’s beyond the scope of the present question to explain why I want to do this, but I would ask that respondents give me the benefit of the doubt that I know what I’m doing and have good reasons.

Please also provide the following information in addition to your question:

  • Operating System: Ubuntu 22.04
  • brms Version: 2.22.0

bernoulli_lpmf(0.1 | theta)literally isn’t defined. See Binary Distributions from the stan functions reference.

If brms did this, you’d get a bunch of errors from Stan about invalid arguments to bernoulli_lpmf.

What behavior do you want from this model?

3 Likes

Sounds like you’re saying that due to type restrictions in Stan, this is not possible as implemented, which is understandable and fair.

Formally, the log probability of a y, x pair with parameter \\theta in a GLM with natural link function is given (up to a constant) by y \\theta^T x - A(\\theta^T x) where A(\\cdot) is the log partition function of the corresponding exponential family. This can produce valid posteriors over \\theta even when y is not in its nominal domain — formally this would require modification of the exponential family carrier density, but Stan doesn’t need to know about that, and could simply sample from a posterior whose log joint has the corresponding gradients. But I can see how brms might not be implemented in a way that would allow this.

Sorry, there seems to be a discord bug and I can’t edit the math errors in my last post. The equations should be, respectively, y \theta^T x - A(\theta^T x), A(\cdot), and \theta. (I don’t see a way to preview my latex rendering so I’ll just leave it out the dollar signs.)

The issue isn’t so much that Stan is strongly typed as that brms understands family = bernoulli to indicate a regression in which the response is distributed as a Bernoulli random variable.

Note that brms supports user-defined custom families, so if you want to define a family with a continuous response whose lpdf for observation y and parameter theta is as you desire, that would be straightforward to do.

Something like the following, which is untested and was created in part by ChatGPT

library(brms)

# Custom family: parameter is mu in (0,1), with a logit link to the linear predictor.
bern_pseudo_logit <- custom_family(
  "bern_pseudo_logit",
  dpars  = "mu",
  links  = "logit",   # linear predictor eta; mu = inv_logit(eta)
  type   = "real"     # allow y to be any real (pseudo-likelihood)
)

# Stan function: lpdf for one observation.
# Uses an equivalent, numerically stable form in terms of mu:
#   y*log(mu) + (1 - y)*log(1 - mu)
stan_funs <- "
functions {
  real bern_pseudo_logit_lpdf(real y, real mu) {
    // mu is in (0,1) because brms applies the inverse-logit to the linear predictor.
    return y * log(mu) + (1 - y) * log1m(mu);
  }
  // No RNG on purpose: not a proper density over R, so posterior_predict isn't defined.
}
"

stanvars <- stanvar(scode = stan_funs, block = "functions")

# ---- Example usage ----
set.seed(1)
N  <- 100
x  <- rnorm(N)
b0 <- 0.3
b1 <- 1.2
eta <- b0 + b1 * x
p   <- plogis(eta)
y   <- p + rnorm(N, 0, 0.05)  # real-valued "responses" to demonstrate pseudo-lik

dat <- data.frame(y, x)

fit <- brm(
  bf(y ~ 1 + x),
  data     = dat,
  family   = bern_pseudo_logit,
  stanvars = stanvars,
  prior = c(
    prior(normal(0, 2.5), class = "b"),
    prior(normal(0, 2.5), class = "Intercept")
  ),
  chains = 4, iter = 2000, seed = 123
)

print(fit)

1 Like

Thank you, on the face of it this looks like it will do the trick! I’ll respond once I’ve tried it.

One quick question — does the brms automatic prior selection interact at all with the family selection? That is, if one does not specify the prior argument in your example, will you get the same priors as if you had specified family=bernoulli?

The following slight modification of the above code appears to run:

data_df <-
  data.frame(y=c(0.1, 0.2, 0.3, 1.1), x=c(1, 2, 3, 0), w=rep(1, 4))

bern_pseudo_logit <- custom_family(
  "bern_pseudo_logit",
  dpars  = "mu",
  links  = "logit",   # linear predictor eta; mu = inv_logit(eta)
  type   = "real"     # allow y to be any real (pseudo-likelihood)
)

stan_funs <- "
  real bern_pseudo_logit_lpdf(real y, real mu) {
    return y * log(mu) + (1 - y) * log1m(mu);
  }
"

stanvars <- stanvar(scode = stan_funs, block = "functions")

brm(y ~ x - 1,
    data=data_df,
    family=bern_pseudo_logit,
    stanvars=stanvars)

I’ll do some tests to make sure it’s computing the correct posterior and return to this thread. But thanks for the quick suggestion!

Also known as continuous Bernoulli? A thread and a blog post at Brms custom family continuous bernoulli

2 Likes

Nice connection! It happens that this is not what I mean in this case, since I do not want to use the continuous Bernoulli normalizing constant for y.

Maybe I should elaborate a little on the difference in case anyone else reads this. I do not mean to model the response y as being continuous-valued in [0,1], since if I did then my assertion that the carrier density H(y) in the exponential family doesn’t depend on the parameter would not be true. (I think this is described well in the chain of posts you linked to.) I actually mean for y to be dichotomous, but with values that are different for each observation (because they depend on the regressors), though with the same link function. In other words, in my model, each dichotomous observation’s mean function is modeled the same way by the GLM, though its domain is in fact different from observation to observation as a function of the regressors.

Again, going into why I would do this is beyond the scope of this thread, but this is a simple and clean way of approximating a regressor-aligned mean shift in binary data, without bothering to construct the corresponding randomized binary dataset.

In general no, with some exceptions. The prior depends on the parameter class (e.g. intercepts, coefficients, random effect covariances, etc), but not on the family. However, there are cases where priors on shape parameters for response distributions depend on the family, even when they are treated as being of the same internal brms parameter class. See for example Updated default prior for phi to inv_gamma(0.4, 0.3) by jan-joh · Pull Request #1822 · paul-buerkner/brms · GitHub

Edit: and there could well be other cases of parameter classes whose default priors depend on the family, that I’m not explictly aware of.

I’m not sure I follow what it is you’re doing, but Stan imposes tight restrictions on GLMs like Bernoulli in terms of what data input it accepts.

What I’d recommend is writing out a custom log-likelihood (which isn’t hard in Stan) that you can then overload as needed.

You could always add it to brms as a custom family or just model it in “pure” Stan.

1 Like