Measurement error for predictors in GLM

I am trying to construct a linear model with a poisson measurement error for both predictors and outcomes. The model currently looks like this:

X ~ Poisson(X_true)
Y_mean = X_true * beta
Y ~ Poisson(Y_mean)
beta ~ Normal(0, sigma)
X_true ~ Lognormal(a,b)
X_true, beta >= 0

where X and Y are both matrices of measured count data (single cell RNA-seq), X_true and Y_mean are parameters, a and b are guessed separately from data and sigma is a user-specified constant, currently set to 0.05 (I expect most coefficients to be close to 0).

I have two qeustions about this model:

  1. Is the prior for X_true necessary? The actual distribution of X is quite complex and hard to capture in a reasonable prior - my guess is that bad prior may hinder convergence. I also noticed than in some examples on the forums as well as in McElreath’s book (Chapter 14.1.), no prior is given for parameters corresponding to measurement error on predictors.

  2. The number of parameters grows quickly - it is dominated by the size of X_true - and it takes prohibitively long to fit real datasets (the smallest I am really interested in has X of dimension 20286 * 124, currently at 5% after 20hours of computation). Initializing X_true with X greatly sped up convergence for small simulated datasets, but seems to have made little difference for the real data. Since I am not really interested in the latent X_true, I tried to integrate it out, but failed. My calculus is however rusty - are there any examples of models where it is possible to integrate out non-normal measurement error on predictors? Or is it known to be impossible to integrate out the measurement error in some classes of models?

Thanks for any hints.

Below is the Stan model for completeness:

data {
  int num_cells;
  int num_regulators;
  int num_targets;
  int<lower=0> regulator_expression[num_regulators,num_cells];
  int<lower=0> target_expression[num_targets,num_cells];
  real<lower=0> w_prior_sigma;

  real<lower=0> regulator_prior_sigma;
  real<lower=0> regulator_prior_mean;
}

parameters {
  matrix<lower=0>[num_regulators, num_targets] w;
  matrix<lower=0>[num_cells, num_regulators] regulator_expression_mean;
}


model {
  matrix[num_cells, num_targets] target_expression_mean = regulator_expression_mean * w;
  for(reg in 1:num_regulators) {
    regulator_expression[reg,] ~ poisson(regulator_expression_mean[,reg]);
  }

  for(target in 1:num_targets) {
    target_expression[target,] ~ poisson(target_expression_mean[,target]);
  }

  for(reg in 1:num_regulators) {
    regulator_expression_mean[,reg] ~ lognormal(regulator_prior_mean, regulator_prior_sigma);
  }

  for(target in 1:num_targets){
    w[,target] ~ normal(0, w_prior_sigma);
  }
}

You can’t have Poisson measurement error in predictors because Stan doesn’t support discrete parameters. But you can declare X_true as continuous, as you seem to have done.

Priors are only necessary insofar as the posterior must be proper. If you want a generative joint model, then X_true has to be given a prior distribution because the flat prior on positive real values is improper. Gelman et al.'s BDA also leaves out a lot of priors, but I believe Gelman’s changing his recommended practice going forward to include priors everywhere (publishers already nagging him for BDA 4!).

The speed will depend on how efficiently you coded your model as a Stan program (scale, identification, vectorization, shared computation, initialization, etc.) as well as on how well the model fits the actual data.

How many iterations are you trying?

That’s the right question, but I don’t know the answer. It’d be more efficient to integrate out parameters (marginalize) due to Rao-Blackwell. Obviously, MCMC is integrating out the measurement error in the explicit encoding, but it’s not very efficient.

Thanks for the input. So I guess the message is that if you cannot formulate a prior for your noisy predictors then you should learn more about your data-generating process :-)

I am aiming for the default 4000.

You also know something from the scale of your predictors and output, since you know the form of your likelihood.

That’s the default number of post-warmup iterations with four chains (2000 iterations equally divided between warmup and sampling).