I am trying to have a model that can estimate the sampling bias in a dataset. To do so, I need to create a custom PDF in brms, which is a combination of an exponential distribution and a logistic function:
p(obs) \sim Exp(\lambda) \cdot Logistic(\mu, \sigma)
This custom PDF needs to be normalized so that the density is 1. As I cannot find a close form solution, I’d like to use Stan numerical integration integrate_1d
, but I struggle in implementing this routine.
My code, so far is:
biasexp <- custom_family(
"biasexp",
dpars = c("mu", "muBias", "sigmaBias"),
links = c("log", "identity", "log"),
lb = c(0, NA, 0),
type = "real"
)
stan_funs <- "
real biasexp_lpdf(real x, real mu, real muBias, real sigmaBias) {
real log_exp_pdf = exponential_lpdf(x | mu);
real log_logistic_cdf = logistic_lcdf(x | muBias, sigmaBias);
real log_normalizer = log(integral(mu, muBias, sigmaBias));
return log_exp_pdf + log_logistic_cdf - log_normalizer;
}
real integrand(real x, real xc, real[] theta, real[] x_r, int[] x_i) {
real mu = theta[1];
real muBias = theta[2];
real sigmaBias = theta[3];
return exp(exponential_lpdf(x | mu)) * logistic_cdf(x, muBias, sigmaBias);
}
real integral(real mu, real muBias, real sigmaBias) {
return integrate_1d(integrand, 0.0, positive_infinity(), {mu, muBias, sigmaBias}, 0.001);
}
"
and I do not understand how to deal with x_r
and x_i
.
Error in stanc(file = file, model_code = model_code, model_name = model_name, :
0
Semantic error in 'string', line 17, column 11 to column 92:
-------------------------------------------------
15: }
16: real integral(real mu, real muBias, real sigmaBias) {
17: return integrate_1d(integrand, 0.0, positive_infinity(), {mu, muBias, sigmaBias}, 0.001);
^
18: }
19: }
-------------------------------------------------
Ill-typed arguments supplied to function 'integrate_1d':
(<F1>, real, real, array[] real, real)
where F1 = (real, real, array[] real, array[] real, array[] int) => real
Available signatures:
(<F2>, real, real, array[] real, data array[] real, data array[] int,
data real) => real
where F2 = (real, real, array[] real, data array[] real, data array[] int) => real
Expected 7 arguments but found 5 arguments.
(<F2>, real, real, array[] real, data array[] real, data array[] int) => real
Expected 6 arguments but found 5 arguments.
If I create empty, local variables for x_i
and x_r
real integral(real mu, real muBias, real sigmaBias) {
real x_r[0];
int x_i[0];
return integrate_1d(integrand, 0.0, positive_infinity(), {mu, muBias, sigmaBias}, x_r, x_i, 0.001);
}
then I get
Error in stanc(file = file, model_code = model_code, model_name = model_name, :
0
Semantic error in 'string', line 19, column 11 to column 102:
-------------------------------------------------
17: real x_r[0];
18: int x_i[0];
19: return integrate_1d(integrand, 0.0, positive_infinity(), {mu, muBias, sigmaBias}, x_r, x_i, 0.001);
^
20: }
21: }
-------------------------------------------------
Ill-typed arguments supplied to function 'integrate_1d':
(<F1>, real, real, array[] real, array[] real, array[] int, real)
where F1 = (real, real, array[] real, array[] real, array[] int) => real
Available signatures:
(<F2>, real, real, array[] real, data array[] real, data array[] int,
data real) => real
where F2 = (real, real, array[] real, data array[] real, data array[] int) => real
The 5th argument must be data-only. (Local variables are assumed to depend
on parameters; same goes for function inputs unless they are marked with
the keyword 'data'.)
(<F2>, real, real, array[] real, data array[
How could I solve the issue?
To give more context: I wish to estimate the full, partially observed distribution (light gray), from a observed sample (dark gray), assuming that the sampling bias is a logistic function (red curve)
library(tidyverse)
library(brms)
SEED <- 20241022
set.seed(SEED)
N <- 1000
X_MAX <- 10
X <- seq(0, X_MAX, by=0.01)
RATE <- 1
# Complete data
df_all <- tibble(x = rexp(N, rate = RATE))
BIAS_MEAN <- 3
BIAS_SD <- 0.5
# Observed data
df_obs <- tibble(
x = df_all$x,
w = plogis(x, location = BIAS_MEAN, scale = BIAS_SD), # probability of observing
treshold = runif(N) # probability of sampling
) %>%
filter(w > treshold) # observed the points that are also sampled
df_all %>%
ggplot() +
geom_histogram(aes(x = x, y=after_stat(count), fill = 'p(x)'), alpha = 0.5, position = "identity", breaks=seq(0, X_MAX, 0.5)) +
geom_histogram(aes(x = x, y=after_stat(count), fill = 'p(obs)'), position = "identity", breaks=seq(0, X_MAX, 0.5), data = df_obs) +
geom_line(aes(x=x, y=w*100, color='p(obs|x)'), linewidth=2, data=df_obs) +
labs(color = NULL, fill = NULL) +
scale_fill_manual(values=c("p(x)" = "gray55", 'p(obs)'="gray25")) +
coord_cartesian(xlim = c(0, X_MAX)) +
theme(text = element_text(size=20))
biasexp <- custom_family(
"biasexp",
dpars = c("mu", "muBias", "sigmaBias"),
links = c("log", "identity", "log"),
lb = c(0, NA, 0),
type = "real"
)
stan_funs <- "
real biasexp_lpdf(real x, real mu, real muBias, real sigmaBias) {
real log_exp_pdf = exponential_lpdf(x | mu);
real log_logistic_cdf = logistic_lcdf(x | muBias, sigmaBias);
real log_normalizer = log(integral(mu, muBias, sigmaBias));
return log_exp_pdf + log_logistic_cdf - log_normalizer;
}
real integrand(real x, real xc, real[] theta, real[] x_r, int[] x_i) {
real mu = theta[1];
real muBias = theta[2];
real sigmaBias = theta[3];
return exp(exponential_lpdf(x | mu)) * logistic_cdf(x, muBias, sigmaBias);
}
real integral(real mu, real muBias, real sigmaBias) {
real x_r[0];
int x_i[0];
return integrate_1d(integrand, 0.0, positive_infinity(), {mu, muBias, sigmaBias}, x_r, x_i, 0.001);
}
"
fit <- brm(
bf(x ~ 1),
data = df_obs,
family = biasexp,
stanvars = stanvar(scode = stan_funs, block = "functions"),
prior = c(
prior(normal(0, 5), class = "Intercept"),
prior(normal(0, 5), class = "muBias"),
prior(normal(0, 1), class = "sigmaBias")
),
control = list(adapt_delta = 0.9)
)
summary(fit)