Data augmentation model possible in Stan?


I have been trying to work out whether it is possible to write a data augmentation model for an unknown number of missing studies in Stan - specifically the one in Givens, Smith, and Tweedie here: The essence of the model is to simulate missing studies to correct for publication bias in meta-analysis. The approach is to iteratively simulate studies from the model, assuming the number of missing studies across a set of categories is distributed negative-binomially and then drawing a simulated study mean and S.D. from their respective distributions, and then use those simulated studies as data in the model. I have searched the forums and manual for ways of potentially doing this. I understand that it is not possible to use posterior simulations as data. And Stan can’t sample latent discrete variables. The examples of marginalization in the manual and online relate to latent discrete variables to select among a known number of categories. I was wondering if there might be any other possible solution to this with Stan. Any help appreciated!


To follow up - perhaps I’ve not got my head around the marginalisation of discrete latent variables. Does it make sense to treat the unknown number of unobserved studies, ‘n’, in the same way as categorical variables are treated in the manual? That is, to define a reasonable upper limit to the number of studies, then sum over all the values of ‘n’ up to the defined limit and then sum the log probabilities for each value of ‘n’?


This isn’t a direct answer to your question, but Kery and Schaub’s Bayesian Population Analysis book contains a number of data augmentation examples. And then you can find many of these examples worked up in Stan over on Github. That’s an interesting paper you reference on “publication bias.” And the topic has become even more timely lately with growing awareness of the replication crisis, so I sure hope you or someone comes back with this example worked up.


This is what’s typically done for a similar class of models in ecology (N-mixture models), e.g.,

For these types of models, the number of individuals is unknown and typically given a Poisson or negative binomial prior, and the observed number is a subset, with a binomial detection process (where the number of trials is the number of individuals).

Maybe that provides a useful starting point?


Thanks for the responses they’re very helpful. The problems you cite do seem very similar. The unknown number of studies is actually a vector of the number of missing studies across categories rather than a single integer. My strategy has been to generate all permutations of vectors up to a given number of missing studies and loop through these in Stan as you would values of a discrete random variable. I’ve copied the code below if it is of interest to anyone - it runs (very slowly) but gives seemingly sensible answers but I’m sure there’s a way to improve it - I’ll keep working on it.

  int N; //number of observed studies
  int K; //number of p-value categories
  int len; //length of permutations matrix
  vector[N] y; //observed log odds
  vector[N] se; //observed SEs
  vector[N] p; //observed p-values
  vector[K+1] thresh; //threholds of categories
  real Nk[K]; //numbers of observed studies in each category
  int perms[len,K]; //matrix of possible permutations of missing studies
  int m_max; //maximum number of missing studies in a category
transformed data{
  int m_total;
  m_total = m_max * K;
  real theta; //pooled mean
  real<lower=0> tau; //variance
  vector<lower=0.3,upper=1>[K] w; //probabilities for categories
  vector<lower=0>[m_total] sig_z;
  vector<lower=0,upper=1>[m_total] p_z_raw; //p-values different limits
transformed parameters{
  real<lower=0> mu[K]; //neg binomial scale parameter - convert from (n,p) parameterisation
  vector[m_total] p_z;
  for(k in 1:K)
    mu[k] = Nk[k]*(1-w[k])/w[k];
  for(k in 1:K){
    for(i in 1:m_max){
      p_z[(k+K*(i-1))] = thresh[k] + (thresh[k+1]-thresh[k])*p_z_raw[(k+K*(i-1))];
  // holders for obesrved plus missing data
  vector[N+m_total] X;
  vector[N+m_total] V;
  vector[N+m_total] P;
  vector[len] lp;
  lp = rep_vector(0, len);
  for(n in 1:N){
    X[n] = y[n];
    V[n] = se[n];
    P[n] = p[n];
  // place priors on missing data values data
  for(k in 1:K){
    for(i in 1:m_max){
      sig_z[(k+K*(i-1))]~ gamma(0.57415,5.647803);
      V[(N+k+K*(i-1))] = sig_z[(k+K*(i-1))] ;
      P[(N+k+K*(i-1))] = p_z[(k+K*(i-1))];
      X[(N+k+K*(i-1))] =  V[(N+k+K*(i-1))]*inv_Phi(P[(N+k+K*(i-1))]);
  //priors on parameters
  theta ~ normal(0,1);
  tau ~ normal(0,1);
  for(m in 1:len){
    for(j in 1:K){
      for(k in 1:K){
        for(i in 1:N){
          if(P[i]>=thresh[j] && P[i]<thresh[j+1]){
            lp[m] = lp[m] + normal_lpdf(X[i] | theta, V[i] + tau) + neg_binomial_2_lpmf(perms[m,k]|mu[k],Nk[k]);
        for(i in 1:perms[m,j]){
          lp[m] = lp[m] + normal_lpdf(X[(N+j+K*(i-1))] | theta, V[(N+j+K*(i-1))] + tau) + neg_binomial_2_lpmf(perms[m,k]|mu[k],Nk[k]);
  target += log_sum_exp(lp);


Was that mixture really intended to be up at the top level?

The main thing you want to do for efficiency is find repeated calculations and vectorization opportunities. When you have those conditionals on P[i] in the loops, this may be impossible—I don’t understand why some data points i are being excluded, but this kind of conditioning on parameters can lead to a non-differentiable density, which can reduce HMC to a random walk.

For instance, the priors on sig_z can just be a single statement and then you can assign with slicing syntax (no fater, but a lot less error prone).

inv_Phi is really inefficient. Yet it seems to be somehow modifying your data based on parameters.


I’d also recommend trying to lay out data structures in two dimensions rather than creating your own 2D indexing. You can always flatten or convert back to matrices using the built-ins.