I am trying to model the difference between two variables, y1 and y2:
y1 -y2 = N(\mu, \sigma)
What makes this tricky for me is that both y1 and y2 are measured variables, but the measures are proxies. I don’t have the “true” measurement, but I do have data from a previous study that tells me how the use of that proxy relates to “true” measurement:
proxy = \alpha + \beta * measured + \rho
I wrote a model (below) that treats the “true” measurement as a parameter whose mean is the proxy value and whose standard deviation is \rho from above.
The model compiles and samples but I got several issues related to sampling efficiency, which makes me think I wrote the model wrong. Here’s the model:
data {
int<lower=0> N; // length of both y1_obs and y2_obs
vector[N] y1_obs;
vector[N] y2_obs;
// rho from the above model relating proxy measure to a true measure
real<lower=0> meas_error;
}
parameters {
vector[N] y1_true; // the unobserved true measure of y1
vector[N] y2_true; // the unobserved true measure of y2
real mu;
real<lower=0> sigma;
}
transformed parameters {
vector[N] y_diff = y1_true - y2_true;
}
model {
for (n in 1:N) {
y1_true[n] ~ normal(y1_obs[n], meas_error);
}
for (n in 1:N) {
y2_true[n] ~ normal(y2_obs[n], meas_error);
}
y_diff ~ normal(mu, sigma);
}
These are the errors:
Warning messages:
1: There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low
2: Examine the pairs() plot to diagnose sampling problems
3: The largest R-hat is 1.06, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat
4: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess
5: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess
And this is the R code I used with some reproducible data:
data = list(
N = 200,
meas_error = 2.58,
y1_obs = rnorm(200,8.88,7.21),
y2_obs = rnorm(200,10.33,8.32)
)
stan(file = "diff-model-meas-err.stan",
data = data,
iter = 5000,
warmup = 2000,
chains = 3,
control = list(adapt_delta = 0.99,
max_treedepth = 15)
)
Does anything stand out as a clear error I’m making in thinking about how to generate a “true observed” variable for which each of the values is a distribution determined by the measured proxy value and rho?
Thank you.