Generating data for meta-analysis

tea_taul · March 1, 2024, 2:31pm

Hi STAN community! I am a total beginner to STAN trying to specify a model to generate data from. The idea is to generate data for a meta-analysis with specific effect and heterogeneity parameters that I can then reproduce using the rma function from the metafor package in R. Below is my model.

data{
  int N; // number of studies
}

parameters{
  real mu;
  real <lower=.0001>tau;
}
                    
model{
  mu ~ normal(.5, .1);    // mean of true effects(d = .5)
  tau ~ gamma(.5, 1);     // heterogeneity parameter(tau = .5)
}

generated quantities{
vector[N] theta;      
vector[N] sigma;
vector[N] y;

for(i in 1:N){
  theta[i] = normal_rng(mu, tau);  // drawn true population effect
  sigma[i] = normal_rng(1, .1);   // population deviation
  
  y[i] = normal_rng(theta[i], sigma[i]); // observed effects
  }
}

My thinking is that this represent something like this:
\mu \sim N(0.5, 0.1)
\tau \sim \gamma(0.5, .1)

\theta_i \sim N(\mu, \tau)
\sigma_i \sim N(1, 1.1)

y_i \sim N(\theta_i, \sigma_i)

The issue is that when I simulate data from this model in R and try to estimate tau and mu the estimations are horribly off. For example, if I sample from this model for 5 studies with n = 300 like so:

n <- 300
iter <- 1000

set.seed(58)
sim_outcome <- sampling(model,
                        iter = iter, 
                        warmup = iter - n,
                        chains = 1,
                        refresh = 0,
                        data = list(N = 5))

meta_data <- data.frame(extract(sim_outcome))
yi <- (unlist(lapply(meta_data[12:16],
                     FUN = mean)))

vi <- (unlist(lapply(meta_data[12:16], 
                     FUN = var)))

rma(yi = yi, vi = vi, method = 'DL')

The rma results return a tau estimate of 0, which is confusing me. I have tried different variance estimators (e.g REML, ML) and also checked the model on big samples with many studies but it still does not work as I want (no heterogeneity). I can do this in base r without much issue, but I would really prefer to do it in STAN. Any guidance/help would be greatly appreciated!

wpetry · March 1, 2024, 3:26pm

Welcome, @tea_taul.

A couple of thoughts:

The sampler throws warning messages with your lower bound on tau of .0001. I get much better behavior with it set to zero, which is a typical way of bounding standard deviations. I’m not sure why it does this. Maybe others can offer insight?

parameters{
  real mu;
  real <lower=0> tau;
}

I’m confused why you take the mean and variance across your simulated draws.

(Note too there’s a small indexing typo here. It should be 13:17, but this isn’t the cause of the zero heterogeneity estimate.)
I think your Stan code is generating iter-n=700 datasets of 5 studies each. Each row of meta_data therefore has its own ‘true’ parameter values from which y.1 through y.5 are realized. Since each row has it’s own ‘true’ parameters, you are aggregating across this higher-level of heterogeneity that your rma model doesn’t know about.

I may be missing something here, but I think you want something like this below. I think too that you need to square the sigma values that come out of your Stan simulation because vi is a variance:

# dataset 1
rma(yi = unlist(meta_data[1, 13:17]), vi = unlist(meta_data[1, 8:12])^2,
    method = "DL")
meta_data[1, c("mu", "tau")]

# dataset 2
rma(yi = unlist(meta_data[2, 13:17]), vi = unlist(meta_data[2, 8:12])^2,
    method = "DL")
meta_data[2, c("mu", "tau")]

# ... and so on ...

For each of these, I get what seem like reasonable estimates for mu and tau, but it’d be best to check this against a broader set of parameter values.

tea_taul · March 1, 2024, 4:46pm

Thanks the answer @wpetry! I think I have figured it out.

The reason why I chose a non zero lower bound was because the model did not initialize with a zero value for some reason - the updated model did not have that problem.
I used the same estimation procedure for the simulation I used in base R which worked fine. I realize from your reply that I misunderstood the output I received from STAN and I adjusted the generated quantities block to better represent the situation of “drawing population effects” that I aimed to simulate. I think this was the cause behind why the indexing looked off to you as well - it made sense in how I interpreted the sampling.

Again, thanks for the help!

Topic		Replies	Views
Generating mock data from estimated quantities Modeling fitting-issues	3	410	October 19, 2020
Generated Quantities for prediction data General	6	1080	July 28, 2020
Sampling from a matrix in STAN Modeling rstan , techniques , performance	28	1314	September 1, 2021
"generated quantities" does not work Modeling rstan , techniques , fitting-issues	3	1069	April 29, 2021
Problem Simulating Data with Generated Quantities (Dimension mismatch in assignment; type = real; right-hand side type = real[ ]) RStan rstan , techniques , specification	3	857	April 29, 2021

Generating data for meta-analysis

Related topics