Response variable with known error, alternatives for latent variable model?

A collaborator of mine has run an imputation model on a huge dataset of individual-level trait observations. The output of this model is 600k trait imputed trait values + posterior standard deviations.

I wish to post-process these imputed values with a hierarchical model to calculate species-level trait distributions that takes into account imputation uncertainty. I know I can achieve what I want (given infinite time) using a measurement error model such as the following (simplified) example:

data{
int<lower=0> N; // number of observations
int<lower=0> N_spec; // number of species
vector[N] trait; // trait values
vector[N] trait_se; // posterior standard deviation
int<lower=1,upper=N_spec> spec_ind[N]; // species indicator
}

parameters{
vector[N] true_trait;
vector[N_spec] species_mean;
real<lower=0> sigma;
}

model {
// Priors
species_mean ~ normal(0,1);
sigma ~ exponential(1);

// Measurement model
trait ~ normal(true_trait,trait_se);

// Species means
for(i in 1:N) true_trait ~ normal(species_mean[spec_ind[i]],sigma);
} 

But the problem with the above specification is that it creates 600k latent variables that need to be sampled and stored in memory. Sampling takes forever and my machine frequently crashes, even with a reduced dataset and even if I tell rstan not to save the latent variables. With finite computational resources and time, that specification doesn’t seem like a good option for me.

I was thinking of using a random number generator to replace the original trait observations with draws from the normal approximated imputed posterior like this:

parameters{
vector[N_spec] species_mean;
real<lower=0> sigma;
}

model {
vector[N] true_trait = normal_rng(trait,trait_se); // Sample observations from the approximate imputed posterior;

// Priors
species_mean ~ normal(0,1);
sigma ~ exponential(1);

// Species means
for(i in 1:N) true_trait ~ normal(species_mean[spec_ind[i]],sigma);
} 

But stan doesn’t allow using random number generators inside the model block. Is there any other way to incorporate uncertainty in the response value to such a model? Am I hopelessly stuck? Must I abandon the fully bayesian way?

I opted to weight the log posterior by the inverse of the posterior standard deviations of the imputed trait values instead.

transformed data {
  vector[N] weights;
  for(i in 1:N) weights[i] = 1/trait_se[i];
  }  
  for(i in 1:N){
    real linpred = mu + family_mu[family_ind[i]] + genus_mu[genus_ind[i]];
    target += normal_lpdf(trait[i]|linpred,sigma) * weights[i];
  }
1 Like

On further speculation, the actual weights should probably be inverses of the variances 1/trait_se[i]^2.

On even further thought, maybe it’s not a good idea to weight the log posterior by factors whose absolute values are effectively arbitrary. Instead, I checked how brms handles meta regression and am now using target += normal_lpdf(trait[i]|linpred,sqrt(square(sigmas[species_ind[i]]) + square(trait_se[i])));, so that the variance of each observation is the sum of imputation variance + intraspecific variance.