I have data from two different data sources (one is raw data from an assay in a lab, the other is data reported in other studies that I am combining as a kind of meta-analysis).
I can model each of these things comfortably in stan
(assay data is quite complex fitting a custom logistic curve).
I would then like to measure the relationship between parameters estimated from the different data sources. The structure of the data really doesn’t allow us to compare the data directly (completely different numbers of observations etc).
My two questions are:
- Is the approach I have set out below reasonable? I feel like I might be committing some grave sins
- Why do I get so many divergent transitions?
I’m going to illustrate using some dummy data.
# generate values between 0 and 1
y_data <- tibble(y = c(rbinom(n = 25,
size = 100,
prob = 0.5),
rbinom(n = 25,
size = 100,
prob = 0.7),
rbinom(n = 25,
size = 100,
prob = 0.8),
rbinom(n = 25,
size = 100,
prob = 0.9)),
level = rep(c(1,2,3,4), each = 25)) |>
mutate(y = y/100)
# generate some positive values
x_data <- tibble(x = c(rpois(n = 40, lambda = 3),
rpois(n = 40, lambda = 6),
rpois(n = 40, lambda = 9),
rpois(n = 40, lambda = 12)),
level = rep(c(1,2,3,4), each = 40))
# package this up for stan
model_data <- list(
n_levels = max(x_data$level),
y_n = nrow(y_data),
y = y_data$y,
y_level = y_data$level,
x_n = nrow(x_data),
x = x_data$x,
x_level = x_data$level
)
m <- stan(
file = "combining_models.stan",
data = model_data,
chains = 8,
cores = 8,
iter = 8000
)
With combining_models.stan
as:
data {
int<lower=0> n_levels;
int<lower=0> y_n;
real<lower=0,upper=1> y[y_n];
int<lower=0> y_level[y_n];
int<lower=0> x_n;
real<lower=0> x[x_n];
int<lower=0> x_level[x_n];
}
parameters {
real y_mu[n_levels]; // average for each level of y
real<lower=0> y_sigma;
real<lower=0> x_mu[n_levels]; // average for each level of x
real<lower=0> x_sigma;
real<lower=0> pred_sigma;
real<lower=0> m; // gradient between x and y across levels
real b; // y intercept
}
transformed parameters {
real<lower=0,upper=1> y_temp[y_n];
real<lower=0> x_temp[x_n];
real pred_y[n_levels];
for(i in 1:y_n){
y_temp[i] = inv_logit(y_mu[y_level[i]]); // inv_logit so that y_mu is logit transformed
}
for(j in 1:x_n){
x_temp[j] = x_mu[x_level[j]];
}
for(k in 1:n_levels){
pred_y[k] = x_mu[k] * m + b;
}
}
model {
// priors
m ~ normal(0, 0.5);
b ~ normal(0, 0.5);
pred_sigma ~ normal(0.5, 0.2);
// average across levels
y ~ normal(y_temp, y_sigma);
// average across levels
x ~ normal(x_temp, x_sigma);
// relationship between x_mu and y_mu
for(i in 1:n_levels) {
y_mu[i] ~ normal(pred_y[i], pred_sigma);
}
}
This seems to fit reasonably - the posterior looks approximately like what they should be if I was to compare them to a simple fit using (appreciate not the best comparison)
y_sum <- y_data |>
group_by(level) |>
summarise(y_mean = mean(y)) |>
mutate(y_mean_logit = qlogis(y_mean))
x_sum <- x_data |>
group_by(level) |>
summarise(x_mean = mean(x))
df <- y_sum |>
left_join(x_sum)
a <- lm(y_mean_logit ~ x_mean,
data = df)
summary(a)
BUT around 10% of my post warmup samples are divergent, so I don’t feel like stan is super happy with me.
Happy for any suggestions or thoughts. Please feel free to direct me to other posts - I wasn’t able to find something similar.