Error-in-variables regression with unobserved discrete predictors

Consider the following error-in-variables regression where we are interested in the linear effect of an integer-valued predictor with known misclassification rates

Y ~ Normal (a + b*X, sigma)
W|X=x_j ~ Discrete with support 1:k, probabilities P_j \in {0,1}^k s.t. \sum_k P_{jk} = 1

Y and W are both observed, but X is not. However, the unconditional (discrete) distribution of X is known, as are the k vectors of probabilities P_1,…,P_k for each possible value of W conditioned on X.

I’m struggling to implement this model as 1. including X as an integer valued parameter is illegal and 2. I can’t see how to estimate the slope b if I marginalize over the possible values of X as discussed in chapter 13 of the Stanual.

I.e., the model I’d like to be able to fit is

data {
    int<lower=0> J;
    real y[J];
    int w[J];
}

parameters {
    int x[J]; // uncontaminated predictor
    real mu_y; // prior on mean of Y
    real<lower=0,upper=1> maf;
    real sigma_y; // prior variance
    real alpha; // prior variance
    real beta; // prior variance
}

model {
      for (i in 1:J) {
      w[i] ~ imputationError(x[i], .2, .3, .5, .2, .4, .4, .1, .8, .9);
      x[i] ~ binomial(2, maf);
      y[i] ~ normal(alpha + beta*x[i], sigma_y);
    }
    alpha ~ cauchy(0,10);
    beta ~ cauchy(0,3);
}

Where imputationError is user-defined integer-valued mass function.

Am I missing something? Would love to know if this is within Stan’s capabilities!

Maybe. I’m not quite sure what your imputationError distribution is supposed to be. Is there a lower or upper bound on that somehow? If not, you won’t be able to marginalize. Assuming you know what your imputationError distribution is supposed to do, you can code this as:

for (j in 1:J) {
  vector[3 * K] lp;
  int pos = 1;
  for (i in 0:2)
    for (k in 1:K) { // or whatever the support of x[i] is
      lp[pos] = binomial_lpmf(i | 2, maf)
                + normal_lpdf(y[i] | alpha + beta * k, sigma_y)
                + imputationError_lpmf(k | ...);
  target += log_sum_exp(lp);
}

But this critically relies on being able to enumerate the possible w[i] values. Here I’ve assume w[i] takes on values 1:K, so there are K of them (that number’s used in the declaration of lp). If you have a more general set of possible w[i] values and there are K of them, then you can replace the k in beta * k and k | ...) withw_val[k]wherew_valis the vector or array of possible values forw`.

Fortunately w[i] has finite support so I’m hopeful that should work–will let you know. Thanks

I think your solution will work for my situation, but I’m now struggling with my lack of Stan knowledge (long time frequentist frustrated with the lack of flexibility of availabile software for dealing with unusual measurement error models). In particular, I’m unsure how the integer pos functions in the code you posted

I’ve created a simplified scenario with a reproducible example: We want to estimate beta in

Y|X=x ~ N(alpha + beta * x, sigma_Y)
W|X=x ~ Categorical(support = 1:3, probabilities = pi_x)

where W and X are both supported on {1,2,3}, only W and Y are observed, but the unconditional distribution of X and the conditional distributions W | x=1,2,3 are known.

Simulated data (in R):

set.seed(7571)
J     <- 1e3                ## sample size
pX    <- runif(1,.2,.5)     ## binomial p for unconditational X distr
pVec  <- c(pX^2,            ## for use with categorical()
           2*(1-pX)*pX, (1-pX)^2)
X     <- rbinom(J,2,pX)     ## true values of X minus u
W     <- ((X + rbinom(J,2,  ## missclassified values of X plus 1 for use
                    runif(1,.01,.1))) %% 3) + 1 ## with categorical()
piMat <- table(X,W)/        ## categorical probs (X = j | W = i)
    rowSums(table(X,W))     ## for i,j in 1:3
Y     <- 2*X + rnorm(J,0,12)## observed outcome

(lm(Y~I(1+X)))$coef         ## estimate of beta is attenuated
(naive <- lm(Y~W))$coef

## attempt to run
m_norm <- stan(file="imputation_error3.stan",
               data = list(J=J, y=Y, w=W,
                           pVec=pVec, piMat = piMat),
               pars = c("alpha","beta","sigma_y"))

Model definition:

data {
    int<lower=0> J;
    real y[J];
    int  w[J];
    simplex[3] pVec;
    matrix[3,3] piMat;
}

parameters {
    real<lower=0> sigma_y; // variance of Y
    real alpha; // intercept
    real beta; // slope
}

model {
    for (j in 1:J) {           // for each observation
        vector[3 * 3] lp;      // will contain marginal probabilities
        int pos = 1;
        for (i in 1:3) {       // support of w
            for (k in 1:3) {   // support of x
                pos = i*k;
                lp[pos] = categorical_lpmf(i | pVec) // Pr(X = i)
                + normal_lpdf(y[j] | alpha + beta * k, sigma_y) // Pr(Y | X = k)
                + bernoulli_lpmf(1 | piMat[k,w[j]]); // Pr(X = k | W = j)
            }
        }
        target += log_sum_exp(lp);
    }

    alpha ~ cauchy(0,5);
    beta ~ cauchy(0,5);
    sigma_y ~ normal(0,100);   // truncated to [0,infty)
}

Any insight any might have as to why the above generates -infty log likelihoods and fails would be greatly appreciated!

Alternatively, if this is better suited for cross validated/stack overflow, let me know!

P.S., here’s a working version of the model in JAGS in case what I’m trying is still unclear, but would love to get this working in Stan as it will be applied to datasets with hundreds of thousands of observations:

model {
    ## Priors
    alpha ~ dnorm(0, 1)
    beta ~ dnorm(0, 1)
    sdy ~ dunif(0, 100)
    tauy <- 1 / (sdy * sdy)
    tauw ~ dunif(.03, .05)

    ## Likelihood
    for (i in 1:J) {
        y[i] ~ dnorm(mu[i], tauy)
        mu[i] = alpha + beta*x[i]
        w[i] ~ dcat(piMat[x[i], ])
        x[i] ~ dcat(pVec)
    }
}

I now have Stan code that runs but gives me the wrong answer (the correct answer is checkable as the data are simulated… But for the life of me I can not figure out how it differs from the above JAGS code:

data {
    int<lower=0> J;
    real y[J];
    int  w[J];
    simplex[3] pVec;
    vector[3] piMat[3];
}

parameters {
    real<lower=0> sigma_y; // variance of Y
    real alpha; // intercept
    real beta; // slope
}

model {
    for (j in 1:J) {       // for each observation
        vector[9] lp;      // will contain marginal probabilities
        int pos;
        for (k in 1:3) {   // support of x
                lp[k] = normal_lpdf(y[j] | alpha + beta * k, sigma_y);
                pos = 3+k;
                lp[pos] = categorical_lpmf(w[j] | piMat[k]);
                pos = 6+k;
                lp[pos] = categorical_lpmf(k | pVec);
        }
        target += log_sum_exp(lp);
    }

    alpha ~ normal(0,1);
    beta ~ normal(0,1);
    sigma_y ~ normal(0,4);   // truncated to [0,infty)
}

Sorry, there should’ve been a pos += 1 in the loop to increment it! It’s keeping track of position.

That’s not going to work. The way you first had it a few posts ago was closer, but it should just have pos += 1 at the end of the for loop for k. We’re just marginalizing using the law of total probability:

p(a) = SUM_b p(a, b)

and doing it on the log scale, where

log(a + b) = log_sum_exp(log(a), log(b))

The manual chapter on latent discrete parameters should provide some more guidance.

What I’m missing is why there was nested loop in the example you posted when W is observed. Following the LoTP, shouldn’t we have

Pr(Y = alpha + beta X) 
    = sum_i sum _k Pr(Y = alpha + beta X | W = i, X = k)Pr(W = i , X = k)

as Pr(W \neq w[j]) = 0 we then have

    = sum _k Pr(Y = alpha + beta X | W = w[j], X = k)Pr(W = w[j] , X = k)
    = sum _k Pr(Y = alpha + beta X | W = w[j], X = k)Pr(W = w[j] | X = k)Pr(X = k)

which I think is translated into Stan as

for (j in 1:J) {       // for each observation
    vector[3] lp;      // will contain marginal probabilities
    int pos = 1;
        for (k in 1:3) {   // support of x
            lp[pos] = categorical_lpmf(w[j] | piMat[k]) // Pr (W = w[j] | X = k)
                + normal_lpdf(y[j] | alpha + beta * k, sigma_y) // f_Y(alpha + beta*X | X = k, sigma_y)
                + categorical_lpmf(k | pVec); // Pr(X = k)
            pos += 1;
        }
    target += log_sum_exp(lp);
}

though I’m clearly mistaken as this doesn’t produce the right answer.

In the chapter on latent discrete variables there’s a similar example (noisy misclassification error) where W (as well as X) is unobserved and a nested loop is used to marginalize over both W and X, but I don’t see how that would work in my example. (just to refresh your memory, W is an observed measurement of X with know misclassification probabilities conditional on X and the unconditional distribution of X is also known).

Sorry for the volume of posts, I really appreciate you help so far

???

That’ll be zero, because the exact identity will be a measure zero event. It’s going to be equal to those things plus some epislon.

You do have the right general form for marginalizing out two discrete variables,

I’m getting a bit lost in the chain of details for your model. If you can write it down in generative form with discrete parameters and sampling, as you do in BUGS, I can help with the marginalization.

LoTP = Law of Total Probability

Sure, of course, I mean the normal density evaluated at alpha + beta X. Sub “f” for “Pr” as in my code above

Sorry for the confusion. The generative form is

  • Y | X = k ~ Normal(alpha + beta k, sigma_y)
  • W | X = k ~ Discrete with support {1,2,3}, probabilities pi[k] (pi is an array of simplexes, pi[k] is a simplex)
  • X ~ Discrete with support {1,2,3}, probabilities theta (also a simplex)
  • plus some priors on alpha, beta, sigma_y

Y and W are observed. X is latent. The parameters pi[k] for k = 1,2,3 and theta are known.

The likelihood for Y = y[j], W = j given everything else should be:

sum over the support of X (indexed by k) of f(alpha + beta * k, sigma_y) * pi[k][j] * theta[k]

where f is the normal density, pi[k][j] = Pr(W = j | X = k), and theta[k] = Pr(X = k).

In Stan, I think this should be

for (j in 1:J) {       // for each observation
    vector[3] lp;      // will contain marginal probabilities
    int pos = 1;
        for (k in 1:3) {   // support of x
            lp[pos] = categorical_lpmf(w[j] | piMat[k]) // Pr (W = w[j] | X = k)
                + normal_lpdf(y[j] | alpha + beta * k, sigma_y) // f_Y(alpha + beta*X | X = k, sigma_y)
                + categorical_lpmf(k | pVec); // Pr(X = k)
            pos += 1;
        }
    target += log_sum_exp(lp);
}

Or, more simply:

for (j in 1:J) {       // for each observation
    vector[3] lp;      // will contain marginal probabilities
    int pos = 1;
        for (k in 1:3) {   // support of x
            lp[pos] = log(piMat[k][w[j]])
                + normal_lpdf(y[j] | alpha + beta * k, sigma_y) // f_Y(alpha + beta*X | X = k, sigma_y)
                + log(pVec[k]);
            pos += 1;
        }
    target += log_sum_exp(lp);
}

Note that pVec is theta. I know this is a simple model and shouldn’t be so hard to implement, but I can’t seem to get the right answer. Thanks again

Thanks for the patient explanation. You’re right as to the form. I didn’t catch that W was observed. I think your indexing is getting messed up—too many variables named j for example. Did you intend for y and w to be conditionally independent given x?

It’d unusual to see a latent discrete parameter being used as a predictor.

OK, I figured it’d be easier to just do it than to keep going back and forth, so here’s the R simulation:

N <- 1000
alpha <- 1
beta <- 1
sigma <- 1
theta <- c(0.2, 0.3, 0.5)
pi <- matrix(c(0.2, 0.2, 0.6,
               0.7, 0.2, 0.1,
	       0.25, 0.5, 0.25),
	      nrow = 3, ncol = 3, byrow = TRUE)
y <- rep(NA, N);
w <- rep(NA, N)
for (n in 1:N) {
  x <- sample(1:3, size = 1, replace = FALSE, prob = theta)
  w[n] <- sample(1:3, size = 1, replace = FALSE, prob = pi[x, ])
  y[n] <- rnorm(1, alpha + beta * x, sigma)
}

data <- list(N = N, y = y, w = w)
fit <- stan("marg.stan", data=data, chains=1)

Sorry my R is so primitive.

And here’s the Stan program:

data {
  int<lower = 0> N;
  vector[N] y;
  int<lower = 1, upper = 3> w[N];
}
parameters {
  real alpha;
  real beta;
  real<lower = 0> sigma;
  simplex[3] theta;
  simplex[3] pi[3];
}
model {
  alpha ~ normal(0, 1);
  beta ~ normal(0, 1);
  sigma ~ normal(0, 10);
  for (n in 1:N) {
    real lp[3];
    for (k in 1:3)
      lp[k] = categorical_lpmf(k | theta)
          + categorical_lpmf(w[n] | pi[k])
          + normal_lpdf(y[n] | alpha + beta * k, sigma);
    target += log_sum_exp(lp);
  }
}

And here’s the fit:

             mean se_mean   sd      10%      50%      90% n_eff Rhat
alpha        0.71    0.04 0.58    -0.09     0.71     1.44   171    1
beta         1.09    0.02 0.23     0.81     1.07     1.41   163    1
sigma        1.05    0.01 0.08     0.94     1.05     1.15   173    1
theta[1]     0.14    0.01 0.08     0.06     0.13     0.25   251    1
theta[2]     0.38    0.01 0.11     0.23     0.38     0.52   162    1
theta[3]     0.48    0.01 0.11     0.36     0.48     0.61   246    1
pi[1,1]      0.21    0.01 0.12     0.06     0.21     0.37   453    1
pi[1,2]      0.17    0.00 0.09     0.06     0.16     0.28   562    1
pi[1,3]      0.62    0.01 0.13     0.46     0.62     0.78   570    1
pi[2,1]      0.60    0.01 0.13     0.46     0.58     0.78   341    1
pi[2,2]      0.15    0.01 0.09     0.03     0.15     0.26   310    1
pi[2,3]      0.25    0.01 0.11     0.10     0.25     0.38   382    1
pi[3,1]      0.32    0.00 0.06     0.25     0.33     0.39   502    1
pi[3,2]      0.48    0.00 0.07     0.43     0.48     0.56   333    1
pi[3,3]      0.19    0.00 0.06     0.14     0.19     0.24   291    1

No, thank you! This is very helpful!

I did. The application is a bit unusual–we’ve imputed genetic polymorphisms (W) and have some external validation data in which both the imputed polymorphisms (W) and the actual polymorphisms (X) are measured. We then want to make inferences about the relationship between some trait Y and X in a sample where only Y and W are available. But the imputation is totally independent of Y so Y and W should be conditionally independent given X.

The samples we’ll be applying these analyses to have hundreds of thousands of observations, so getting it working in Stan is appealing rather than using JAGS/BUGS.

Thanks again for your help!