Measurement error - manual example on pp. 203-4

Dear Stan community,

I am interested in measurement error models, in particular conditional logit models, when some of the explanatory variables are measured with error. In order to get there, I have started from a far more simple example, planning to build up the complexity of the model. This simple example is a linear model with measurement error in x, when the variance of the measurement error is known. This is included in the manual on page 203.

The model is:
y_i = \alpha + \beta * x_i + \epsilon_i

i = 1, ..., N, where N is the number of observations

I am interested in estimating \alpha and \beta when x_i is measured with error. The most simple situation is that x_i^{meas} \sim N(x_i, \tau). Following Stan conventions, \tau is the standard deviation.

I have simulated data and then tried to recover the true parameters using the code in the manual on page 204, after minor adjustments to variables type to solve compile errors:

// y, x_meas, and x are changed to vector as compared "real" in the manual due to compile errors
data {
    int<lower=0> N;      // number of observations
    vector[N] y;         // outcome
    vector[N] x_meas;   // measurements of x
    real<lower=0> tau;  // measurement noise
}

parameters {
    real alpha;          // intercept
    real beta;           // slope
    real<lower=0> sigma; // variance of the error term in y = alpha + beta*x_true + eps
    vector[N] x;           // unknown true value
    real mu_x;             // prior location
    real<lower=0> sigma_x; // prior scale
}

model {
    alpha ~ normal(0, 10);
    beta  ~ normal(0, 10);
    sigma ~ cauchy(0, 5);
    mu_x  ~ normal(0, 10);    
    sigma_x ~ cauchy(0, 5);   

    x ~ normal(mu_x, sigma_x); // prior
    x_meas ~ normal(x, tau); // measurement model   
    y ~ normal(alpha + beta * x, sigma);      
}

I seem to have some problems retrieving the parameters:

  • there are divergent transitions after warmup
  • n_eff is very low and Rhat is too high
Warning messages:
1: There were 42 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: There were 58 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 
3: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low 
4: Examine the pairs() plot to diagnose sampling problems
 
> print(res_LM_stan_example_V1, pars = c("alpha", "beta", "sigma", "mu_x", "sigma_x"))
Inference for Stan model: LM_stan_example_V1.
4 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=8000.

        mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
alpha   0.87    0.02 0.17 0.54 0.77 0.87 0.98  1.21    57 1.04
beta    4.04    0.00 0.03 3.97 4.01 4.04 4.06  4.10    65 1.04
sigma   0.65    0.07 0.23 0.25 0.47 0.65 0.80  1.15    13 1.57
mu_x    4.05    0.00 0.07 3.92 4.01 4.05 4.10  4.19  1573 1.00
sigma_x 2.98    0.00 0.05 2.89 2.95 2.98 3.02  3.09   813 1.00

Samples were drawn using NUTS(diag_e) at Wed Nov 14 18:29:56 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

> true_alpha
[1] 1
> true_beta
[1] 4
> true_sigma
[1] 1
> true_mu_x
[1] 4
> true_sigma_x
[1] 3

Based on the example on page 152 in the manual I have reparametrized the cauchy and moved x ~ normal(mu_x, sigma_x) to the transformed parameters block. These changes are included in LM_stan_example_V2.stan. The results don’t improve that much. I still get divergent transitions and now n_eff is very low for sigma, mu_x, and sigma_x, even though it improves for alpha and beta.

> print(res_LM_stan_example_V2, pars = c("alpha", "beta", "sigma", "mu_x", "sigma_x"))
Inference for Stan model: LM_stan_example_V2.
4 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=8000.

        mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
alpha   0.86    0.00 0.16 0.55 0.75 0.86 0.97  1.16  1434 1.00
beta    4.04    0.00 0.03 3.98 4.02 4.04 4.06  4.10  1112 1.00
sigma   0.62    0.04 0.18 0.25 0.49 0.62 0.75  0.97    18 1.18
mu_x    4.04    0.01 0.07 3.91 4.00 4.04 4.09  4.20    42 1.10
sigma_x 2.97    0.01 0.06 2.87 2.94 2.97 3.01  3.09    65 1.06

Samples were drawn using NUTS(diag_e) at Wed Nov 14 18:32:51 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1). 

There’s probably a silly error in my code, but I can’t seem to find it. Or maybe I should use some other priors. It can also be that I am missing out something more fundamental. Either way, I would really appreciate some suggestions on what I can improve. I have attached the R and Stan files.

Best,
Ana

LM_stan_example.R (2.0 KB)
LM_stan_example_V1.stan (896 Bytes)
LM_stan_example_V2.stan (1.1 KB)

1 Like

One solution might be to use stronger priors in place of mu_x ~ normal(0, 10); and sigma_x ~ cauchy(0, 5);. Another might be abandoning the hierarchical structure like in this post. So instead of:

x^{meas} = \mu_x+\epsilon_{\tau}+\epsilon_{\sigma_x}

you would just have

x^{meas} = \mu_x+\epsilon_{\sigma_x}

with \tau potentially informing the prior on \sigma_x

Edit: even with stronger priors mu_x ~ normal(4, 1); and sigma_x ~ cauchy(3, 1); I still got poor results with 100 divergent transitions after warmup.

        mean se_mean   sd 2.5%  25%  50%  75% 97.5%
alpha   0.95    0.01 0.16 0.64 0.84 0.94 1.06  1.27
beta    4.02    0.00 0.03 3.96 4.00 4.02 4.05  4.09
sigma   0.83    0.05 0.24 0.41 0.65 0.82 0.98  1.37
mu_x    4.00    0.00 0.07 3.86 3.96 4.00 4.05  4.14
sigma_x 2.98    0.00 0.05 2.88 2.95 2.98 3.02  3.09
        n_eff Rhat
alpha     113 1.04
beta      111 1.04
sigma      26 1.18
mu_x     8000 1.00
sigma_x  1753 1.00

Hi Scott,

Thank you for taking the time to check this.
I have implemented your first suggestion (to use stronger priors) and indeed, the results are still poor (_V3.stan).

Regarding the post you refer to:

  • \epsilon_{\sigma_x} in your example is the same as \tau in the Stan manual example. They are both the variance of the measurement error. In the Stan manual, \tau is known and provided as data in the estimation.
  • if I understand correctly, in Maxwell’s example there is no explicit prior on the true/latent value of x. Which means that Stan assigns a uniform prior to it. So instead of:
parameters {
...
vector[N] x;
real mu_x;
real<lower=0> sigma_x;
...
}

model {
...
mu_x ~ normal(0, 10);
sigma_x ~ cauchy(0, 5);
x ~ normal(mu_x, sigma_x);
...
}

he uses:

parameters {
...
real x[N];
...
}

model {
...
x ~ uniform; this declaration is implicit since Stan assigns uniform priors whenever no explicit priors are provided
...
}

I am not sure a uniform prior would help in this situation, but what I have used from the post is the idea to use repeated measurements. As this should provide more information about the true/latent value of x. New version of the file is _V4.stan. The results are better, but not that great given the low complexity of the model:

> print(res_LM_stan_example_V4, pars = c("alpha", "beta", "sigma", "mu_x", "sigma_x"))
Inference for Stan model: LM_stan_example_V4.
4 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=8000.

        mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
alpha   1.00    0.00 0.08 0.83 0.94 1.00 1.06  1.17  8000 1.00
beta    4.01    0.00 0.02 3.98 4.00 4.01 4.03  4.05  8000 1.00
sigma   1.00    0.01 0.08 0.83 0.94 1.00 1.05  1.15   181 1.01
mu_x    3.80    0.01 0.07 3.67 3.76 3.80 3.85  3.92   128 1.04
sigma_x 2.92    0.00 0.05 2.83 2.89 2.92 2.96  3.02   197 1.01

Samples were drawn using NUTS(diag_e) at Thu Nov 15 10:59:39 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
> true_alpha
[1] 1
> true_beta
[1] 4
> true_sigma
[1] 1
> true_mu_x
[1] 4
> true_sigma_x
[1] 3

I am still trying out some other things, I will update once I get to better results.

LM_stan_example.R (3.9 KB)
LM_stan_example_V3.stan (1.1 KB)
LM_stan_example_V4.stan (1.2 KB)

If I am not mistaken, if the relation between x and y is something like this

linea

Stan will have problems. My attempts with pymc3

traza1

and

traza2

Very, very ugly. A reparametrization is needed, the question is how.

I think the worst part is that the code is in the Stan manual…

Hi,

In this analysis divergences happen at very small values of sigma.

If you use these priors:

model {
    alpha ~ normal(0, 3);
    beta  ~ normal(0, 3);
    sigma ~ gamma(2, 1);
    mu_x  ~ normal(0, 3);    
    sigma_x ~ normal(0, 1);   

    x ~ normal(mu_x, sigma_x); 
    x_meas ~ normal(x, tau);   
    y ~ normal(alpha + beta * x, sigma);      
}

and 4000 iterations and set control = list(adapt_delta = .99) the divergent iterations go away. (low BFMI remains)

Keep in mind however, that this solution is kind of a hack: I figured out where the problem is (using shinystan, you can also use bayesplot) and just chose a prior that avoids very low sigma values. (If one also puts tight priors on the other parameters, a normal prior for sigma also works.) I say this is a hack, because it is not based on a good understanding of the parameters effect in the model. Optimally, one should do a prior predictive check to better understand the implication of different prior specification.

For example, if you run your first model and just comment out this in the model section:

// x_meas ~ normal(x, tau); // measurement model   
// y ~ normal(alpha + beta * x, sigma); 

You’ll see that the x values implied by you priors have a standard deviation of 17 and a range from -200 to +200 (1000 iterations). This here compares the densities of the true_x and the x implied by only the original priors:

curve(dnorm(x,mean = 3, sd = 17), from = -250, to = 250, add = T, col = "red")
curve(dnorm(x,mean = 4, sd = 3), from = -250, to = 250, log = "y")

So what might be going on is that your priors allow very large (positive or negative) values of x, which, if combined with a small sigma, lead to zero densities (-Inf lpdf) for at least some y.

Indeed, if one checks the y implied by the priors by adding this to the generated quantities block:

generated quantities {
  vector[N] mus = alpha + beta * x;
}

One can see that the y implied by the original priors range from ca -1000 to +1000. Now if one assume a true y of 60 (which is at the upper limit of possible y values given the data generating parameters and process) and an sd of .1 (which is entirely possible give the original priors) , then one finds that dnorm(60, mean = -1000, sd = .1) = 0.

So it looks as if the origin of the problem is not how the model is specified in the Stan Reference. Instead, there is a mismatch between the parameters governing the data generating process for the simulated data (in the R script above) and the priors shown in the Stan reference.

2 Likes

I’m not sure what the code in the Stan manual is trying to do, but I posted a simple regression model with measurement errors in both variables along with a reproducible example a while back: https://discourse.mc-stan.org/t/estimating-slope-and-intercept-linear-regression-for-data-x-vs-y-with-uncertainties-in-both-x-and-y/5457/4

Here are the parameter and model blocks:

parameters {
  vector[N] x_lat;
  vector[N] y_lat;
  real beta0;
  real beta1;
  real<lower=0> sigma;
}
transformed parameters {
  vector[N] mu_yhat = beta0 + beta1 * x_lat;
}
model {
  beta0 ~ normal(0., 2.);
  beta1 ~ normal(0., 5.);
  sigma ~ normal(0., 2.);
  
  xhat ~ normal(x_lat, sd_xhat);
  y_lat ~ normal(mu_yhat, sigma);
  yhat ~ normal(y_lat, sd_yhat);
  
} 

In this model the latent values of y are connected to the latent values of x through a simple linear regression, while the measured values are generated from the latent ones with known variances. The transformed parameter mu_yhat could have been left out. Here’s a traceplot of a few parameters:

trace_memodel

The variance parameter sigma has a slightly low n_eff and is biased on the low side (the input was 0.1), but no warnings are triggered.

I think there’s an identification issue with the model in the manual. It would help if authors were asked to produce real or fake data examples.

As far as I can tell, the model in the Stan Reference implements an errors in variables model as described here: Errors-in-variables models - Wikipedia.

y = \alpha + x^*\beta + \epsilon
x = x^* + \eta

Where y and x are observed outcomes and variables, and x^* is the modeled “true” unobserved predictor.

My intuition is that this is how a model that incorporates measurement error for predictors should be specified: y is not the directly predicted with the observed measurement x, but with the modeled “true” x^*.

In contrast, I do not understand how the model posted by @Michael_Peck incorporates uncertainty about the true value x^* into the regression of y on x.

The model I posted differs in two ways from the one in the manual. I have latent real values for the variate and covariate x* and y* that are related through a simple regression. There are measurement errors in both x and y with known and in general unequal uncertainties for each observation. So the full model looks something like:

y* = α + x*β + ε
x = x* + η
y = y* + ζ

The example in the manual also has a hierarchical model for how the latent predictor is generated, which is lacking in mine. I probably should have an explicit prior for the latent predictor, but in the real data examples I’m working with the data certainly aren’t generated like the manual example.

My lack of understanding of the manual example was mostly because of coding inconsistencies between the simple regression example and the measurement error example, along with at least one outright error. Actually coding it myself and running some fake data examples helped.

Hm, are wae talking about different examples in the manual?
The one I am referring to (page 203 in the Stan references 2.17.1, which is also the model referred to in the title of this thread) has no hierarchical structure.

data {
  ...
  real x_meas[N]; // measurement of x
  real<lower=0> tau; // measurement noise
}
parameters {
  real x[N]; // unknown true value
  real mu_x; // prior location
  real sigma_x; // prior scale
  ...
}
model {
  x ~ normal(mu_x, sigma_x); // prior
  x_meas ~ normal(x, tau); // measurement model
  y ~ normal(alpha + beta * x, sigma);
  ...
}

(I haven’t tried to run this model, but I don’t see an obvious error. I did run the first model attached by @anamartinovici, which as far as I can tell is basically identical with this, without any troubles when I used the right priors.)

I think I understand your model now better. The differences between your model and the one on page 203 in the Stan reference is that you model 2 error components for y^*: The fixed, known error (sd_hat, \zeta) for the estimation of the of y = y^* + \zeta, and the estimated error variance \epsilon for the estimation of the linear model y^* = \alpha + x^*\beta + \epsilon. I guess if one want’s to model these error sources separately depends on if one is interested in the size of \epsilon specifically, but I am not sure.

Hi,

Thank you for your answer. The prior predictive check suggestion helps a lot! I have used it also for some other models I am estimating and it really helped me understand the models better.

Best,
Ana