Identifying relationships between parameters estimated from different data sources

I have data from two different data sources (one is raw data from an assay in a lab, the other is data reported in other studies that I am combining as a kind of meta-analysis).

I can model each of these things comfortably in stan (assay data is quite complex fitting a custom logistic curve).

I would then like to measure the relationship between parameters estimated from the different data sources. The structure of the data really doesn’t allow us to compare the data directly (completely different numbers of observations etc).

My two questions are:

  • Is the approach I have set out below reasonable? I feel like I might be committing some grave sins
  • Why do I get so many divergent transitions?

I’m going to illustrate using some dummy data.

# generate values between 0 and 1
y_data <- tibble(y = c(rbinom(n = 25, 
                            size = 100, 
                            prob = 0.5),
                       rbinom(n = 25, 
                              size = 100, 
                              prob = 0.7),
                       rbinom(n = 25, 
                              size = 100, 
                              prob = 0.8),
                       rbinom(n = 25, 
                              size = 100, 
                              prob = 0.9)),
                 level = rep(c(1,2,3,4), each = 25)) |> 
  mutate(y = y/100)

# generate some positive values
x_data <- tibble(x = c(rpois(n = 40, lambda = 3),
                       rpois(n = 40, lambda = 6),
                       rpois(n = 40, lambda = 9),
                       rpois(n = 40, lambda = 12)),
                 level = rep(c(1,2,3,4), each = 40))

# package this up for stan
model_data <- list(
  n_levels = max(x_data$level), 
  y_n = nrow(y_data),
  y = y_data$y,
  y_level = y_data$level,
  x_n = nrow(x_data),
  x = x_data$x,
  x_level = x_data$level
)

m <- stan(
  file = "combining_models.stan", 
  data = model_data, 
  chains = 8, 
  cores = 8, 
  iter = 8000
)

With combining_models.stan as:


data {
  int<lower=0> n_levels;
  int<lower=0> y_n;
  real<lower=0,upper=1> y[y_n];
  int<lower=0> y_level[y_n];
  
  int<lower=0> x_n;
  real<lower=0> x[x_n];
  int<lower=0> x_level[x_n];
}


parameters {
  real y_mu[n_levels]; // average for each level of y
  real<lower=0> y_sigma; 
  
  real<lower=0> x_mu[n_levels]; // average for each level of x
  real<lower=0> x_sigma;
  
  real<lower=0> pred_sigma;
  real<lower=0> m; // gradient between x and y across levels
  real b; // y intercept
}

transformed parameters {
  real<lower=0,upper=1> y_temp[y_n];
  real<lower=0> x_temp[x_n];
  
  real pred_y[n_levels];
  
  for(i in 1:y_n){
    y_temp[i] = inv_logit(y_mu[y_level[i]]); // inv_logit so that y_mu is logit transformed
  }
  
  
  for(j in 1:x_n){
    x_temp[j] = x_mu[x_level[j]];
  }
  
  for(k in 1:n_levels){
    pred_y[k] = x_mu[k] * m + b;
  }
  
}

model {
  
  // priors
  m ~ normal(0, 0.5);
  b ~ normal(0, 0.5);
  pred_sigma ~ normal(0.5, 0.2);
  
  // average across levels
  y ~ normal(y_temp, y_sigma);
  
  // average across levels
  x ~ normal(x_temp, x_sigma);
  
  // relationship between x_mu and y_mu
  for(i in 1:n_levels) {
    y_mu[i] ~ normal(pred_y[i], pred_sigma);
  }
  
}

This seems to fit reasonably - the posterior looks approximately like what they should be if I was to compare them to a simple fit using (appreciate not the best comparison)

y_sum <- y_data |> 
  group_by(level) |> 
  summarise(y_mean = mean(y)) |> 
  mutate(y_mean_logit = qlogis(y_mean))


x_sum <- x_data |> 
  group_by(level) |> 
  summarise(x_mean = mean(x))

df <- y_sum |> 
  left_join(x_sum) 

a <- lm(y_mean_logit ~ x_mean, 
   data = df)

summary(a)

BUT around 10% of my post warmup samples are divergent, so I don’t feel like stan is super happy with me.

Happy for any suggestions or thoughts. Please feel free to direct me to other posts - I wasn’t able to find something similar.

1 Like

I wonder if @hhau has any insight into this.

1 Like

There’s no real sin in trying to jointly model multiple sources of data to get a better understanding of some latent quantity / parameter. This is a sensible form of Bayesian inference. The only conceptual challenge is that usually you know something about this parameter, and it’s difficult to tweak the joint prior to match your domain knowledge. This is not helped by the fact that you have “two priors” when considering each of the separate submodels.

Some questions:

  • How many shared parameters are there?
  • What do the prior/posterior predictive distributions for the shared quantities look like under each submodel separately? That is, if we say the first submodel is p_{1} with data Y_{1} (likewise for the second submodel) with the shared parameter vectored denoted with \phi, what do p_{1}(\phi), p_{2}(\phi), p_{1}(\phi \mid Y_{1}), p_{2}(\phi \mid Y_{2}) look like? I am wondering if there is some kind of conflict between some number of these distributions.
1 Like

If you have parameters theta and two different data sets y1 and y2 and two likelihoods p(y1 | theta) and p(y2 | theta), then it’s totally OK to combine them into a joint model. We do this all the time when we have different observations of the same process (for instance, we might measure waste water and look at hospitalizations as y1 and y2 for parameters theta representing infection levels at a given time for Covid).

To help with that, we’re going to need to see the rest of your model. It’s usually a problem with parameterizations. But if both models fit separately OK, then it could be a data inconsistency problem. If the two data sets aren’t consistent with the same values of the parameters, the model is misspecified, and misspecification can make it very hard for HMC to sample.

Thanks both for your replies.

I think perhaps I have given the wrong impression that there is a shared parameter between the two data sources when really I want to estimate a relationship (likely linear) between parameters estimated from two different data sources.

In the model above I am estimating parameters m and b to find the linear relationship between x_mu and y_mu. Here’s the relevant bits:

// inside transformed parameters
  for(k in 1:n_levels){
    pred_y[k] = x_mu[k] * m + b;
  }
.
.
.
// inside model
  for(i in 1:n_levels) {
    y_mu[i] ~ normal(pred_y[i], pred_sigma);
  }

I suppose m and b can be considered shared between the two data sources but there is not a common theta in the sense that @Bob_Carpenter suggested (I don’t think).

To answer your specific questions/comments:

  • How many shared parameters are there?

As discussed above, I think m and b

Could you please write this in terms of the model I have specified - I’m struggling to follow (I understand what prior and posterior predictive distributions are)

I’m getting 10% of samples as divergent using the dummy data and model I’ve specified here which I would expect should fit reasonably well. I’m still planning how to build this into the fuller model.

Here are some plots of the fit:
y_mu (note logit transformed)

x_mu

m

b

(note, I set quite tight priors to get these fits around what I expect m and b to be (cheating, I know))

  m ~ normal(0.27, 0.1);
  b ~ normal(-0.9, 0.2);

To try and make this more concrete we could consider:

  • level to be day of the week (I’ve only done 4 above but could be 7) say: Tuesday, Wednesday, Thursday, Friday
  • y_mu as the average percentage of flights that are cancelled on that day of the week. I’ve logit transformed this so that we can fit a linear relationship to x_mu
  • x_mu is the average number of pilots who call in sick on that day of the week

We expect y_mu to be higher when x_mu is higher. The days of the week become our “observations” in measuring this relationship. We have to estimate y_mu and x_mu from different data sources.

I think I was thinking about this the wrong way.

I think the best way to do something like this is using the following:


data {
  int<lower=0> n_levels;
  int<lower=0> y_n;
  real<lower=0,upper=1> y[y_n];
  int<lower=0> y_level[y_n];
  
  int<lower=0> x_n;
  real<lower=0> x[x_n];
  int<lower=0> x_level[x_n];
}


parameters {
  // real y_mu[n_levels]; // average for each level of y
  real<lower=0> y_sigma; 
  
  real<lower=0> x_mu[n_levels]; // average for each level of x
  real<lower=0> x_sigma;
  
  // real<lower=0> pred_sigma;
  real<lower=0> m; // gradient between x and y
  real b; // y intercept
}

transformed parameters {
  real<lower=0,upper=1> y_temp[y_n];
  real<lower=0> x_temp[x_n];
  
  // real pred_y[n_levels];

  for(j in 1:x_n){
    x_temp[j] = x_mu[x_level[j]];
  }
  
  for(i in 1:y_n){
    y_temp[i] = inv_logit(m*x_mu[y_level[i]] + b); // inv_logit so that y_mu is logit transformed
  }
  
}

model {
  
  // priors
  m ~ normal(0.27, 0.1);
  b ~ normal(-0.9, 0.2);
  // pred_sigma ~ normal(0.5, 0.2);
  
  // fit y
  y ~ normal(y_temp, y_sigma);
  
  // average across levels
  x ~ normal(x_temp, x_sigma);
  
}

Note

  for(i in 1:y_n){
    y_temp[i] = inv_logit(m*x_mu[y_level[i]] + b); // inv_logit so that y_mu is logit transformed
  }
.
.
.
  y ~ normal(y_temp, y_sigma);

I will close this, thanks for your help.