Hierarchical Model for Publication Bias

Hi everyone,
I am trying to code a relatively simple model to detect publication bias. The model does not seem to have any issue when running (no error messages, R_hat =1,…), however, when I test it on a simulated dataset it does not seem to detect a very simple publication bias process.


This is the very simple model I use, it is a hierarchical random effect model, where the true effects come from a student-t distribution. The publication bias is a function of the t-statistic, where papers with a t-statistic less than 1.95 have a publication probability less than one. I simulated a dataset from the same model (code below). To my surprise, the model does NOT detect the publication bias, i.e. \delta \approx 1 and no change in the estimate of \beta from the same model without the publication bias. I am not sure what I am doing wrong, any advice?

data {
  int<lower=0> N;              // num observations (published studies)
  int<lower=0> K;              // num of covariates
  vector[N] y;                 // observed outcome (published)
  vector[N] se;                // vector of standard error estimates (published)
  matrix[N, K] X;
}

parameters {
  vector[N] theta;                    // latent true effect sizes
  real<lower=0, upper=1> sigma_unif;  // Uniform variable
  vector[K] beta;                           // coefficients for covariates
  real<lower=0, upper=1> delta;                // effect of publication bias (strength of bias)
}

transformed parameters {
   real<lower=0> sigma_theta;
   sigma_theta = 2 * tan(sigma_unif * pi() / 2); // Transform sigma_unif to half-Cauchy
}

model {
  // Likelihood for observed (published) outcomes
  for (i in 1:N) {
    // Model the observed data conditional on publication
    target += normal_lpdf(y[i] | theta[i], se[i]);  // Likelihood of the observed data
    
    // Add the publication bias adjustment to the target
    if (y[i] / se[i] < 1.96) {
      target += log(delta);  // Probability of publication when t-stat is below -1.96
    } 
  }

  // 2nd hierarchical layer
  theta ~ normal(X*beta, sigma_theta);
  
  // 3rd hierarchical layer
  beta ~ normal(0, 2);
  sigma_unif ~ uniform(0, 1);
  
  // Prior on publication bias effect
  delta ~ beta(3, 7);  // Prior for delta
}

This is how I simulate the dataset:

# Step 1: Define parameters
N_total <- 120   # Total studies before publication bias
beta_A <- 0.25  # True effect for A studies
beta_B <- 0.10 # True effect for B studies
tau <- 0.1       # Between-study variance
df_t <- 5        # Degrees of freedom for Student's t (heavier tails if lower)

# Step 2: Assign covariates randomly
covariate <- sample(c("A", "B"), size = N_total, replace = TRUE)

# Create indicator variables
I_A <- ifelse(covariate == "A", 1, 0)
I_B <- ifelse(covariate == "B", 1, 0)

# Step 3: Simulate true treatment effects from a Student-t distribution
theta <- beta_A * I_A + beta_B * I_B + rt(N_total, df = df_t) * tau

# Step 4: Simulate standard errors
SE <- runif(N_total, min = 0.05, max = 0.2)  # Study precision varies

# Step 5: Generate observed effects
treatment_effect <- theta + rnorm(N_total, mean = 0, sd = SE)

# Step 6: Compute t-statistics
t_stat <- treatment_effect / SE

# Step 7: Apply publication bias: studies with |t| < 1.95 are less likely to be published
publication_prob <- ifelse(abs(t_stat) < 1.95, 0.3, 1)  # Lower prob if t-stat is small
selected <- sample(1:N_total, size = 90, prob = publication_prob, replace = FALSE)

# Step 8: Create final dataset
dt <- data.table(
  treatment_effect = treatment_effect[selected],
  standard_error = SE[selected],
  covariate = covariate[selected]
)
dt[, A := ifelse(covariate == "A", 1, 0)]
dt[, B := ifelse(covariate == "B", 1, 0)]
1 Like

delta is going to be drawn from its prior distribution on every iteration, with sample values that respect the beta(3,7) distribution (mean = 3/10, median = 2/7, mode = 1/4), and, on occassion, the resulting [-infinity,0] from log(delta) is also added to the log density.

So the delta values are only constrained by beta(3,7) when the conditional statement is false, and heavily drive toward 1 when the conditional is true. The more instances where the conditional is true, the more likely delta’s value is to approach 1, and the less relevant beta(3,7) will be to the posterior.

Is this the desired behavior?

I tried several parametrization of the beta prior and results do not differ. The idea is that we are observing a truncated process and delta is Pr(Study_i is observed). I am assuming for simplicity that when the t-statistic is greater or equal than 1.96 the study is observed with probability 1, while when it is less it is observed with probability delta, which is why the log-likelihood is inflated by a factor.

Do you suggest changing something? Again, I simulated my dataset using the same model, so I am surprised the model itself cannot correctly estimate it.

I’m not surprised that this is the case because log(delta), as a likelihood function, will always maximally prefer that delta is equal to 1. With each instance that instantiates that likelihood (i.e. wherever the conditional is true), delta is going to be strongly driven toward that value of 1, making the prior inceasingly inconsequential.

The model says: assuming no data, we expect our bias measure to be 3/10 with some degree of uncertainty, but when there are any instances that exhibit strong bias in the dataset, our bias measure should be heavily driven toward 1, even though our prior knowledge prescribes that almost no bias measures should be, a priori, equal to 1.

beta helps the model choose when the dataset meets the conditions for delta to be heavily driven toward 1, but it might be that it is largely unconstrained by delta because delta is always essentially fixed at 1 by the dataset. It may be that very few instances of the conditional being met are necessary to pull delta to 1, so almost any beta would result in that outcome. beta then is largely constrained by the normal_lpdf and not the downsteam effects of y and se on the conditional.

This shows the normalized density of prior + likelihood for different numbers of biased instances (parameter c):

After a single instance, the mode goes from 0.25 to over 0.33. After a second, it goes to 0.4. By five, it’s over 0.5. By ten it’s over 0.66. By fifty, it’s almost 0.9.

The Stan model doesn’t match the simulation, because the simulation generates a lot of data points that are “nonpublications” but the model never sees those. Following your data simulation, if you passed the model all of the nonpublications as data and incremented the target by log1m(delta) for each nonpublication, then you’d get correct inference.

What you’ve done here is essentially equivalent to saying that you want to estimate a binomial proportion while never giving the model access to any knowledge of how many zeros there are–only how many ones.

To make progress, you should think carefully about: what would be the telltale signal of publication bias in the data, and in what way should the likelihood be penalized for estimating a too-high value of delta in the presence of this signal? Currently the likelihood will always be maximized by estimating delta equal to one.

Thank you @jsocolar and @Corey.Plate for your answers. Let me be clear about what my model is: let’s assume that we have a population of studies from a hierarchical random effect model of the sort:

\begin{align*} &\textbf{Likelihood (First Hierarchical Layer)} \\ & \hat{TE}_i \mid \theta_i, \hat{se}_i \sim \mathcal{N}(\theta_i, \hat{se}_i), \quad i = 1, \dots, N. \\[10pt] &\textbf{Central Parameters (Second Hierarchical Layer)} \\ &\theta_i \mid X_i, \beta, \sigma_{\theta}, \nu \sim \mathcal{T}(X_i \beta, \sigma_{\theta}, \nu), \quad i = 1, \dots, N. \\[10pt] &\textbf{Prior Distributions (Third Hierarchical Layer)} \\ &\beta_k \sim \mathcal{N}(0, 4), \quad k = 1, \dots, K. \\ &\sigma_{\theta} \sim \text{Half-Cauchy}(0,4). \\ &\nu \sim \text{Gamma}(2, 0.1). \end{align*}

However, because of the publication process we do not observe part of the distribution, because it is not published (i.e. Z_i = 0), the real model likelihood is:

\begin{align*} p(\hat{TE}_i, Z_i \mid \theta_i, \psi, se_i, \delta) &= f(\hat{TE}_i \mid \theta_i, se_i) g(\theta_i \mid \psi) \mathbb{P}(Z_i = 1 | \delta, T_i)^{Z_i} (1 - \mathbb{P}(Z_i = 1 | \delta, T_i))^{1 - Z_i }. \end{align*}

But since we only observe (\hat{TE}_i, \hat{se}_i, Z_i = 1), then the conditional likelihood is:

\begin{align*} p(\hat{TE}_i \mid \theta_i, \psi, se_i, \delta, Z_i = 1) &= \frac{f(\hat{TE}_i \mid \theta_i, se_i) g(\theta_i \mid \psi) \mathbb{P}(Z_i = 1 | \delta, T_i)}{\mathbb{P}(Z_i = 1 \mid \delta)}. \end{align*}

Hence:

log(p(\hat{TE}_i \mid \theta_i, \psi, se_i, \delta, Z_i = 1)) \propto log(f(\hat{TE}_i \mid \theta_i, se_i) g(\theta_i \mid \psi)) + log(\mathbb{P}(Z_i = 1 | \delta, T_i))

where \mathbb{P}(Z_i = 1 | \delta, T_i) = \delta in my model.

So I am not sure how to change the model. I also tried other specifications for \mathbb{P}(Z_i = 1 | \delta, T_i) such as logit and splines, but results do not change, what do you suggest?

Hi, I have a book coming out in June with prof Gian Luca Di Tanna (SUPSI, CH) on Bayesian meta-analysis. We have a chapter on publication bias, and it’s a topic I teach at RSS courses. In general, without delving into your example and code, “detecting” (which sounds rather binary) publication bias is hard. I have had realistically subtle datasets where it just doesn’t work (in a Bayesian way, with a bias unknown). In the book, we had to adopt a rather extreme example to make it clear what’s going on, but in reality… it’s hard. There is very low study-level information in most real evidence bases, with which to estimate a distribution (density function).

It’s not an accident that fitting the conditional likelihood, conditional on Z_i = 1, yields a very high estimate for \delta. This is literally telling the likelihood to ignore every case where Z = 0. We can strip away most of the model structure and see why this fails with a simple model for a bernoulli outcome y, given as y_i \sim \textrm{bernoulli}(p). If we first throw out every case where y_i = 0, and then show the model only this conditional likelihood (conditional on y_i = 1), of course it will estimate a very high p. To get the right estimate for p we need to show the model the unconditional likelihood–the one that also includes the y = 0 outcomes.

In your case, those outcomes are unobserved, so you need to figure out what the telltale sign in your data of missing outcomes would be, and show the model that. Given some distribution of observed effect sizes and t statistics, how should the model know whether that is actually the distribution of effect sizes and t statistics under study or whether that distribution is truncated by publication bias?

Hi @robertgrant and @jsocolar, do you have some references I could check out? I was trying to come up with an approach similar to this one, but from a Bayesian perspective, rather than Frequentist.

@robertgrant my aim is not to “detect” publication bias, but rather reasonably control for it for the hyper mean of the model, given the available data. In this particular case, most of the information on the bias comes from discontinuity in the distribution of the t-statistics around 1.96. I have other versions of the model where I control for it using a logit or a spline approach and the issue persists.

People seem to have their preferences on this topic. Some like to adjust with an expert or exogenous empirical prior. Others like selection models (I’m in that camp). Others like weighting, though making it Bayesian is not so simple. Some refs: Using selection models to assess sensitivity to publication bias: A tutorial and call for more routine use - PMC
https://journals.sagepub.com/doi/full/10.1177/25152459221109259
Modeling publication bias using weighted distributions in a Bayesian framework - ScienceDirect
I don’t know how many “studies” there are in your evidence base (I think this is your N). If small, you are in a near hopeless situation. We will have a simple selection model example online between now and June.

2 Likes

FWIW: I implemented a Bayesian hierarchical meta-analysis model with publication bias in this (non-gated) article.

The logic is described in the subchapter “Correcting for publication bias”

All model code is in the supplementary materials here

3 Likes

Hi @Ole_Rogeberg, than you very much for sharing your work with the attach code, it is of great help. I looked at the code and I have a couple of questions:

  1. I am not sure that sampling from the original Rubin RE model:
TE \sim \mathcal{N}(\theta,se) \\ \theta \sim \mathcal{N}(\mu, \sigma_{\theta})

is the same as sampling from:

TE \sim \mathcal{N}(\mu, \sqrt{se^2 + \sigma^2_{\theta}})

like you do in your paper? Is it because you integrate out \theta? I guess in the Normal-Normal model it works. Do you know if computationally it makes a difference on Stan?

  1. Another question refers to your derivation in the paper of the probability of being published depending only on the standard error. I think the general assumption is that it depends on the statistical significant, hence on both the standard error and the magnitude itself. Would this change the way you code the model?
  1. This should be mathematically equivalent. You can see it with R using simulated numbers - it gives same mean and sd:
type1_a <- rnorm(1000, 1, 0.5)
type1_b <- rnorm(1000, type1_a, 0.4)

type_2 <- rnorm(1000, 1,  sqrt(0.5^2+0.4^2))

mean(type1_b)
sd(type1_b)

mean(type_2)
sd(type_2)
  1. You are right that the publication depends on significance - but that means that the publication probability of a study depends on its realized outcome. Before we know the outcome there is a probability that it will fall into each one of three ranges: significant negative, non-significant, significant positive. I calculate the probability that it will fall into each range (probability of actual results) - where the relative publication probability differs with range (relative publication prob lower if the result falls in the nonsignificant range)
1 Like

I thought I’d seen this discussion before (What is wrong with this marginalize-out trick – Yuling Yao's Blog) – though I am not sure if the setup is exactly the same.

1 Like

hi @Ole_Rogeberg, I do agree with @js592 that is not super trivial the equivalence between the two models, but it is not my major concern at the moment. I wanted to say that I updated my model in a similar way to yours, where I also consider the normalizing constant in the likelihood, but it now performs really bad, with lots of divergences, low ESS and R-hat.

My model is a simpler version of yours, where I assume that studies with |t_i| \geq 1.94 have publication probability \delta_{p} (which initially I assumed equal to one, for simplicity), on the other hand studies with |t_i| < 1.94 have publication probability \delta_{np}.

Thus, the likelihood becomes the following, if |t_i| < 1.94

p(\hat{TE}_i | \theta_i, \hat{se}_i, Z_i = 1) = \frac{\mathbb{P}(Z_i = 1| |t_i| < 1.94) \mathcal{N}(\theta_i, \hat{se}_i)}{\sum \limits_{i=1}^{N}\big[ \mathbb{P}(Z_i = 1| |t_i| < 1.94)\mathbb{P}(|t_i| < 1.94) + \mathbb{P}(Z_i = 1| |t_i| \geq 1.94)\mathbb{P}(|t_i| \geq 1.94) \big]} = \\ \frac{\delta_{np} \mathcal{N}(\theta_i, \hat{se}_i)}{\sum \limits_{i=1}^{N}\big[\delta_{np}\mathbb{P}(|\frac{\hat{TE}}{\hat{se}}| < 1.94) + \delta_p\mathbb{P}(|\frac{\hat{TE}}{\hat{se}}| \geq 1.94) \big]}

A similar reasoning holds for the case when |t_i| \geq 1.94. Since \hat{TE} | \theta, \hat{se}_i \sim \mathcal{N}(\theta_i, \hat{se}_i) , then \frac{\hat{TE}}{\hat{se}} \sim \mathcal{N}(\theta_i, 1) and the normalizing constant becomes:

\sum \limits_{i=1}^{N}\big[\delta_{np}(CDF_{\mathcal{N(\theta_i, 1)}}(1.94) - CDF_{\mathcal{N(\theta_i, 1)}}(-1.94)) + \delta_p(2*CDF_{\mathcal{N(\theta_i, 1)}}(-1.94) ) \big]

This is how I implement it in the model block:

model {
  // Calculating the normalization constant
  vector[N] log_normalization_individual_1;
  vector[N] log_normalization_individual_2;
  vector[N] log_normalization_individual;
  
for (i in 1:N) {
    log_normalization_individual_1[i] = log(delta_np) + log_diff_exp(normal_lcdf(1.96 | theta[i], 1), normal_lcdf(-1.96 | theta[i], 1)); 
    log_normalization_individual_2[i] = log(delta_p) + log(2) + normal_lcdf(-1.96 | theta[i], 1);
    log_normalization_individual[i] = log_sum_exp(log_normalization_individual_1[i], log_normalization_individual_2[i]);
  }
  
  real log_normalization = sum(log_normalization_individual);

  // Likelihood for observed (published) outcomes
  for (i in 1:N) {
    // Compute the t-statistic
    real t_stat = y[i] / se[i];
    real prob_pub_cond;
    if (fabs(t_stat) < 1.96) {
      prob_pub_cond = delta_np;  // Lower probability of publication for non-significant results
    } else {
      prob_pub_cond = delta_p;    // Assume significant results are always published
    }

    // Adjust the likelihood for publication bias
    target += normal_lpdf(y[i] | theta[i], se[i]) + log(prob_pub_cond) - log_normalization;
  }

  // 2nd hierarchical layer: latent true effects
  theta ~ normal(X * beta, sigma_theta);
  
  // 3rd hierarchical layer: priors
  beta ~ normal(0, 2);
  sigma_unif ~ uniform(0, 1);
  
  // Prior on publication bias effect
  delta_np ~ beta(2, 2);  // Prior for delta
  delta_p ~ beta(10, 2);  // Prior for delta concentrated more towards 1
}

I am not sure what is wrong, though, any idea?

Hi. I am a bit stretched on time these days, but asked ChatGPT (03 mini high) to compare the original and revised Stan code and explain potential issues:

The issue is twofold:

  1. The revised code computes a separate normalization constant for each observation but then sums these into one global constant and subtracts it in every likelihood term. This over-corrects the likelihood by effectively applying the normalization ( N ) times instead of once per observation.

  2. There’s an inconsistency in parameter naming. The likelihood uses a parameter called sigma_theta while the prior is assigned to sigma_unif. This mismatch can cause further problems in model estimation.

To fix the model, subtract the individual normalization term for each observation (i.e., use log_normalization_individual[i] in the likelihood for observation (i)) and ensure that the parameter name is consistent (e.g., use sigma_theta throughout).

Hope this is helpful.
PS: Does my code work if you just run it on your data? Is there something specific you need from your model that does not come out of this older one or could you potentialy just use that code?

Hi @Ole_Rogeberg, the sigma_theta is a transformed parameter to sample from a Half-Cauchy, it is not an issue. I am also pretty sure the normalization constant should be calculated summing over all i’s.

I need to adapt your code to my model, since I am not in general using a Normal-Normal model, plus the issue raised by @js592.

Let me know whenever you have more bandwidth and thanks again for the help.

I don’t quite follow your model, but you seem to have a parameter for the publication probability of both significant and non-significant studies. I don’t think that make sense. We only have data on published papers and cannot estimate the probability of publication - but we can estimate how much less likely non-significant (or significant-but-negative) results are to be published - but then only relative to the (unknown) publication probability of significant-and-positive results.

It’s been some time since I worked on the model I shared, so I needed to go through my own logic stepwise here:

  1. Data block:
    1. We start with study estimates (est_pars) and standard errors (est_se)
  2. Transformed data block. For each study, we
    1. use the standard error to define the thresholds where that study would become significant or not, and
    2. Class the study as being of type 1 (negative coefficient and significant), 2 (non-significant) or 3 (positive coefficient and significant)
  3. Parameters block. Define three parameters:
    1. Global mu: Average effect parameter across all study contexts
    2. mu_sigma: genuine parameter heterogeneity across study contexts
    3. rel_obsprob: taking the publication probs of a study with positive significant results as the reference - the relative probability that a study of type 1 or 2 would be published.
  4. Transformed parameters
    1. type_pub_prob and temp_pub_raw: When a study is initiated the coefficient is a result of two conceptual draws: first you draw a parameter that gives you the true (but unobserved) effect parameter valid in your current context, and then you draw an estimate of this parameter that will also reflect the standard error of your estimate. For any given standard error, we can thus calculate the probability that a new study with that standard error will have coefficients falling into each of the three ranges that we identified in the transformed data block (that is: (significant negative or s-, ns, significant positive or s+ coefficient). We use the normal_lcdf to calculate these probabilities and shift the probability of s- and ns being published down. These are on the log scale and are kept as temp_prob_raw (three numbers for each study), and they are transformed into probabilities using softmax, which gives us type_pub_prob. These are the probabilities that a randomly initiated study with a given standard error will be observed as a publication of type 1, 2 or 3 respectively.
  5. Model block. What we want to do in the model block is to express the probability of each observed study using the model parameters. For a study of type 1 (negative and significant) this is the probability of drawing a study context and parameter estimate that gives the observed coefficient - but scaled down by the relative observation probability of the s- type of results. We also normalise by temp_norm so that the sum of the probabilities across s-, ns and s+ jointly sum to 1. To do this we divide by the un-normalized sum of the temp_pub_raw for that study (using log_sum_exp).

Now - looking at my code this is NOT WHAT I DO. Instead I normalize in the model block using the log_sum_exp of temp_pub_raw across all studies. This is not correct and since the paper is published that is super annoying….

The normalisation factor needs to be study specific. To see this, let’s take an extreme example and assume no ns or s- studies are ever published. If I have a super-precise study (very low standard error) and the overall global_mu is non-zero and positive and there is little between-study genuine parameter variation, I will have almost zero probability of drawing a significant negative result. In the limit, I may be almost certain to have a significant and positive result. In that case, the study is not at risk of publication bias even before the estimate is observed. If I have a very imprecise study, on the other hand, and there is substantial parameter variation across contexts, then the study has a non-negligible probability of being of all three types (s-, ns, s+). We would only observe it if the realised coefficient was s+, however, and the probability of the observation would reduce to the conditional probability of taking the observed value conditional on being > 1.96 * est_se.

Went back to the data and ran the corrected model and luckily it doesn’t change much (in my case, nothing substantive) but still annoying.

Anyway - I hope that helps to make my model a bit more understandable.

I just want to second what @Corey.Playte and @jsocolar pointed out—this is the problem with the model.

Imagine you’re trying to estimate a rate and you only observe successes. Then you’re going to drive that rate to 1 with every observation and the only thing keeping it away is the prior. That means as your data size goes to infinity, your estimate goes to 1, which is not what you want. You want a model that converges to the true value as data size goes to infinity.

Thanks for the contribution—we really do appreciate people engaging on Discourse.

Going forward, please don’t just cut-and-paste GPT into replies if you can’t independently verify its answers. If people want unedited LLM output, they can generate it themselves.

2 Likes