How to implement custom family with `integrate_1d`

I am trying to have a model that can estimate the sampling bias in a dataset. To do so, I need to create a custom PDF in brms, which is a combination of an exponential distribution and a logistic function:

p(obs) \sim Exp(\lambda) \cdot Logistic(\mu, \sigma)

This custom PDF needs to be normalized so that the density is 1. As I cannot find a close form solution, I’d like to use Stan numerical integration integrate_1d, but I struggle in implementing this routine.

My code, so far is:


biasexp <- custom_family(
  "biasexp",
  dpars = c("mu", "muBias", "sigmaBias"),
  links = c("log", "identity", "log"),
  lb = c(0, NA, 0),
  type = "real"
)

stan_funs <- "
  real biasexp_lpdf(real x, real mu, real muBias, real sigmaBias) {

    real log_exp_pdf = exponential_lpdf(x | mu);
    real log_logistic_cdf = logistic_lcdf(x | muBias, sigmaBias);
    real log_normalizer = log(integral(mu, muBias, sigmaBias));

    return log_exp_pdf + log_logistic_cdf - log_normalizer;
  }

  real integrand(real x, real xc, real[] theta, real[] x_r, int[] x_i) {
    real mu = theta[1];
    real muBias = theta[2];
    real sigmaBias = theta[3];

    return exp(exponential_lpdf(x | mu)) * logistic_cdf(x, muBias, sigmaBias);
  }

  real integral(real mu, real muBias, real sigmaBias) {
    return integrate_1d(integrand, 0.0, positive_infinity(), {mu, muBias, sigmaBias}, 0.001);
  }
"

and I do not understand how to deal with x_r and x_i.

Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  0
Semantic error in 'string', line 17, column 11 to column 92:
   -------------------------------------------------
    15:    }
    16:    real integral(real mu, real muBias, real sigmaBias) {
    17:      return integrate_1d(integrand, 0.0, positive_infinity(), {mu, muBias, sigmaBias}, 0.001);
                    ^
    18:    }
    19:  }
   -------------------------------------------------

Ill-typed arguments supplied to function 'integrate_1d':
(<F1>, real, real, array[] real, real)
where F1 = (real, real, array[] real, array[] real, array[] int) => real
Available signatures:
(<F2>, real, real, array[] real, data array[] real, data array[] int,
 data real) => real
where F2 = (real, real, array[] real, data array[] real, data array[] int) => real
  Expected 7 arguments but found 5 arguments.
(<F2>, real, real, array[] real, data array[] real, data array[] int) => real
  Expected 6 arguments but found 5 arguments.

If I create empty, local variables for x_i and x_r

real integral(real mu, real muBias, real sigmaBias) {
    real x_r[0];
    int x_i[0];
    return integrate_1d(integrand, 0.0, positive_infinity(), {mu, muBias, sigmaBias}, x_r, x_i, 0.001);
  }

then I get

Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  0
Semantic error in 'string', line 19, column 11 to column 102:
   -------------------------------------------------
    17:      real x_r[0];
    18:      int x_i[0];
    19:      return integrate_1d(integrand, 0.0, positive_infinity(), {mu, muBias, sigmaBias}, x_r, x_i, 0.001);
                    ^
    20:    }
    21:  }
   -------------------------------------------------

Ill-typed arguments supplied to function 'integrate_1d':
(<F1>, real, real, array[] real, array[] real, array[] int, real)
where F1 = (real, real, array[] real, array[] real, array[] int) => real
Available signatures:
(<F2>, real, real, array[] real, data array[] real, data array[] int,
 data real) => real
where F2 = (real, real, array[] real, data array[] real, data array[] int) => real
  The 5th argument must be data-only. (Local variables are assumed to depend
  on parameters; same goes for function inputs unless they are marked with
  the keyword 'data'.)
(<F2>, real, real, array[] real, data array[

How could I solve the issue?


To give more context: I wish to estimate the full, partially observed distribution (light gray), from a observed sample (dark gray), assuming that the sampling bias is a logistic function (red curve)

library(tidyverse)
library(brms)

SEED <- 20241022
set.seed(SEED)

N  <- 1000
X_MAX <- 10
X  <- seq(0, X_MAX, by=0.01)
RATE  <- 1

# Complete data
df_all <- tibble(x = rexp(N, rate = RATE))

BIAS_MEAN  <- 3
BIAS_SD  <- 0.5

# Observed data
df_obs  <- tibble(
    x = df_all$x,
    w = plogis(x, location = BIAS_MEAN, scale = BIAS_SD), # probability of observing
    treshold = runif(N) # probability of sampling
    )  %>%
    filter(w > treshold) # observed the points that are also sampled

df_all %>%
    ggplot() +
    geom_histogram(aes(x = x, y=after_stat(count), fill = 'p(x)'), alpha = 0.5, position = "identity", breaks=seq(0, X_MAX, 0.5)) +
    geom_histogram(aes(x = x, y=after_stat(count), fill = 'p(obs)'), position = "identity", breaks=seq(0, X_MAX, 0.5), data = df_obs) +
    geom_line(aes(x=x, y=w*100, color='p(obs|x)'), linewidth=2, data=df_obs) +
    labs(color = NULL, fill = NULL) +
    scale_fill_manual(values=c("p(x)" = "gray55", 'p(obs)'="gray25")) +
    coord_cartesian(xlim = c(0, X_MAX)) +
    theme(text = element_text(size=20))

biasexp <- custom_family(
  "biasexp",
  dpars = c("mu", "muBias", "sigmaBias"),
  links = c("log", "identity", "log"),
  lb = c(0, NA, 0),
  type = "real"
)

stan_funs <- "
  real biasexp_lpdf(real x, real mu, real muBias, real sigmaBias) {

    real log_exp_pdf = exponential_lpdf(x | mu);
    real log_logistic_cdf = logistic_lcdf(x | muBias, sigmaBias);
    real log_normalizer = log(integral(mu, muBias, sigmaBias));

    return log_exp_pdf + log_logistic_cdf - log_normalizer;
  }

  real integrand(real x, real xc, real[] theta, real[] x_r, int[] x_i) {
    real mu = theta[1];
    real muBias = theta[2];
    real sigmaBias = theta[3];

    return exp(exponential_lpdf(x | mu)) * logistic_cdf(x, muBias, sigmaBias);
  }

  real integral(real mu, real muBias, real sigmaBias) {
    real x_r[0];
    int x_i[0];
    return integrate_1d(integrand, 0.0, positive_infinity(), {mu, muBias, sigmaBias}, x_r, x_i, 0.001);
  }
"

fit <- brm(
  bf(x ~ 1),
  data = df_obs,
  family = biasexp,
  stanvars = stanvar(scode = stan_funs, block = "functions"),
    prior = c(
    prior(normal(0, 5), class = "Intercept"),
    prior(normal(0, 5), class = "muBias"),
    prior(normal(0, 1), class = "sigmaBias")
  ),
  control = list(adapt_delta = 0.9)
)

summary(fit)

Integrate_1d has a more rigid signature than some of the functions like the variadic ODE solvers. In particular, this error is complaining that the integrand function doesn’t accept enough arguments - you need to have them, even if they’re unused

Yes, that’s what the docs for integrate_1d says and so I have those (unused) variables in the signature of the integrand function. The problem is that those x_r and x_i are still required by integrand_1d it seems, and as those are unused I don’t know how to pass them around. Would you be able to point me in the right direction based on the snippet of code I posted?

This solution is closest to what I do, however you need to declare x_i and x_r inside the transformed data block, or put the data keyword in front of their declaration. Here’s an example.

I believe you also need to change the signature of the integrand to explicitly mark them as data

data array[] int is a distinct type from array[] int

Thank you for the hints @hhau and @WardBrian! I made some adjustment but still hitting some errors. I added the data marker and initialized x_r/x_i in the transformed data block:

stan_funs <- "
  real biasexp_lpdf(real x, real mu, real muBias, real sigmaBias) {

    real log_exp_pdf = exponential_lpdf(x | mu);
    real log_logistic_cdf = logistic_lcdf(x | muBias, sigmaBias);
    real log_normalizer = - log(integral(mu, muBias, sigmaBias));

    return log_exp_pdf + log_logistic_cdf - log_normalizer;
  }

  real integrand(real x, real xc, array[] real theta, data array[] real x_r, data array[] int x_i) {
    real mu = theta[1];
    real muBias = theta[2];
    real sigmaBias = theta[3];

    return exp(exponential_lpdf(x | mu)) * logistic_cdf(x, muBias, sigmaBias);
  }

  real integral(real mu, real muBias, real sigmaBias) {
    return integrate_1d(integrand, 0.0, positive_infinity(), {mu, muBias, sigmaBias}, x_r, x_i, 0.001);
  }
"

stan_tdata  <- "
  array[0] real x_r;
  array[0] int x_i;
"
stanvars <- stanvar(scode = stan_tdata,  block = "tdata") +
  stanvar(scode = stan_funs, block = "functions")

fit <- brm(
  bf(x ~ 1),
  data = df_obs,
  family = biasexp,
  stanvars = stanvars,
    prior = c(
    prior(normal(0, 5), class = "Intercept"),
    prior(normal(0, 5), class = "muBias"),
    prior(normal(0, 1), class = "sigmaBias")
  ),
  control = list(adapt_delta = 0.9)
)

but got the error

Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  0
Semantic error in 'string', line 17, column 86 to column 89:
   -------------------------------------------------
    15:    }
    16:    real integral(real mu, real muBias, real sigmaBias) {
    17:      return integrate_1d(integrand, 0.0, positive_infinity(), {mu, muBias, sigmaBias}, x_r, x_i, 0.001);
                                                                                               ^
    18:    }
    19:  }
   -------------------------------------------------

Identifier 'x_r' not in scope

so I guess it is not enough to declare those variables in the transformed data block? I used brms’ stanvar with the block tdata to do so:

stanvar(scode = stan_tdata,  block = "tdata") + stanvar(scode = stan_funs, block = "functions")

You will unfortunately need to add data array[] real x_r, data array[] int x_i to the other functions and pass them in for them to be properly seen as data. It’s definitely a bit clunky

I tried to add that signature to all the functions and yet the compiler complained. Maybe @paul.buerkner knows a neat way to deal with those variables from within brms?

Ok, I managed to do it! The missing bit was to define those x_r, x_i variables in the brms custom family too:

custom_family(
    name = "biasexp",
    dpars = c("mu", "muBias", "sigmaBias"),
    links = c("log", "identity", "log"),
    lb = c(0, NA, 0),
    type = "real",
    vars = c("x_r", "x_i"), # <--------
)

so that they are passed down the Stan functions:

stan_lpdf <- "
    real biasexp_lpdf(
        real x,
        real mu,
        real muBias,
        real sigmaBias,
        data array[] real x_r,
        data array[] int x_i) {

      real log_exp_pdf = exponential_lpdf(x | 1/mu);
      real log_logistic_cdf = logistic_lcdf(x | muBias, sigmaBias);
      real log_normalizer = log(integrate_1d(
                                    integrand,
                                    0.0, positive_infinity(),
                                    {mu, muBias, sigmaBias},
                                    x_r, x_i));

      return log_exp_pdf + log_logistic_cdf - log_normalizer;
  }
"

stan_integrand <- "
    real integrand(
        real x,
        real xc,
        array[] real theta,
        data array[] real x_r,
        data array[] int x_i) {

      real mu = theta[1];
      real muBias = theta[2];
      real sigmaBias = theta[3];

      real exp_pdf = exp(exponential_lpdf(x | 1/mu));
      real logistic_cdf = exp(logistic_lcdf(x | muBias, sigmaBias));

      return exp_pdf * logistic_cdf;
  }
"

To me, having to carry around those x_r, x_i variable is a bit cumbersome as I am not even using them… Is there a way to make them optional in the integrate_1d?

2 Likes

Not currently, there is an open issue that would: [FR] Expose variadic signatures for `integrate_1d` (deprecate current?) · Issue #1352 · stan-dev/stanc3 · GitHub

1 Like

You can change this to

integrate_1d(integrand,
                        0.0, positive_infinity(),
                        {mu, muBias, sigmaBias},
                        {0}, {0})

and then there is no need to pass x_r and x_i around (the values in the passed arrays can be anything as you don’t use them).

1 Like

Thanks! This makes the code much leaner. I thought about it too but I was unsure whether passing actual values would interfere somehow—thank you for suggesting this!

EDIT: I can’t find the default value for relative_tolerance. Is there any recommendation on what value to set? 1e-8 according to Computing One Dimensional Integrals

EDIT2: This is the part in the User Guide that discouraged me to pass actual values to x_r, x_i

This may require passing in zero-length arrays for data or a zero-length vector for parameters if the integral does not involve data or parameters.

From Computing One Dimensional Integrals

Thanks for pointing out this. I made a PR to clarify that any array (e.g. {0}) can be passed if the values are not used in the integrand function.

2 Likes