A collaborator of mine has run an imputation model on a huge dataset of individual-level trait observations. The output of this model is 600k trait imputed trait values + posterior standard deviations.
I wish to post-process these imputed values with a hierarchical model to calculate species-level trait distributions that takes into account imputation uncertainty. I know I can achieve what I want (given infinite time) using a measurement error model such as the following (simplified) example:
data{
int<lower=0> N; // number of observations
int<lower=0> N_spec; // number of species
vector[N] trait; // trait values
vector[N] trait_se; // posterior standard deviation
int<lower=1,upper=N_spec> spec_ind[N]; // species indicator
}
parameters{
vector[N] true_trait;
vector[N_spec] species_mean;
real<lower=0> sigma;
}
model {
// Priors
species_mean ~ normal(0,1);
sigma ~ exponential(1);
// Measurement model
trait ~ normal(true_trait,trait_se);
// Species means
for(i in 1:N) true_trait ~ normal(species_mean[spec_ind[i]],sigma);
}
But the problem with the above specification is that it creates 600k latent variables that need to be sampled and stored in memory. Sampling takes forever and my machine frequently crashes, even with a reduced dataset and even if I tell rstan not to save the latent variables. With finite computational resources and time, that specification doesn’t seem like a good option for me.
I was thinking of using a random number generator to replace the original trait observations with draws from the normal approximated imputed posterior like this:
parameters{
vector[N_spec] species_mean;
real<lower=0> sigma;
}
model {
vector[N] true_trait = normal_rng(trait,trait_se); // Sample observations from the approximate imputed posterior;
// Priors
species_mean ~ normal(0,1);
sigma ~ exponential(1);
// Species means
for(i in 1:N) true_trait ~ normal(species_mean[spec_ind[i]],sigma);
}
But stan doesn’t allow using random number generators inside the model block. Is there any other way to incorporate uncertainty in the response value to such a model? Am I hopelessly stuck? Must I abandon the fully bayesian way?