I’m new at Stan and bayesian modeling.
I want to fit a linear mixed model on a dataset containing repeated measurements of multiple items by different systems.
I want to compute an estimation of the correct measure of each item taking into account biais and deviation of the different systems.
My problem is that the different systems give result with (very) different rounding policy and I would like to take into account this information in my model.
I found this discussion (https://groups.google.com/forum/#!topic/stan-users/KIH26xC_3UM) about incorporating rounding in Stan but I’m wondering if there is a way to mix rounded and non rounded observation in the same model.
I’m wondering if there is a way to mix rounded and non rounded observation in the same model.
Yeah, but you probably want to first start by building models that estimate your parameter from each system. There’s a section in the manual “Rounding” (page 204 in version 2.16) that has some info on how you might do this. It’s a little bit manual.
Once you think your estimates for each thing are reasonable, then you can just make it hierarchical (so they’ll inform each other).
Something like:
model {
theta ~ normal(0, 1);
tau ~ normal(0, 1);
sigma ~ normal(0, 1);
mu ~ normal(theta, tau);
measurements1 ~ normal(mu[1], sigma[1]);
measurements2 ~ normal(mu[2], sigma[2]);
}
The hope in such a model is that tau in this tells you about the scale of the difference in accuracies of both experiments and the sigmas tell you about the precision.
Just keep in mind the relative volumes of data you have for each method. That’ll be important in interpreting the results.
That’s a standard meta-analysis problem. There’s a chapter in the manual explaining the basics, but it sounds like you won’t need that.
That’s the usual approach.
That just means there are different likelihood functions in the different systems, which is also typical. Solving rounding (follow the links Ben provided) can then just be combined as one of the systems in the meta-analysis.
Be careful to use the non-centered parameterization, so that’d be
I came with this piece of code implementing the centered parameterization as a starting point :
data {
int<lower=0> N; //number of observations
int<lower=0> n_items; //number of items
int<lower=0> n_systems; //number of systems
int<lower=0> n_rounding; //number of rounded measures
int<lower=1,upper=n_items> item_id[N]; //vector of item indices
int<lower=1,upper=n_systems> system_id[N]; //vector of system indices
int<lower=1,upper=n_rounding> rounding_id[N]; //vector of rounding indices
vector[N] y;
}
parameters {
vector[n_items] gamma; //vector of item deviations from the average
real<lower=0> sigma_gamma; //standard deviation of the gamma
vector[n_systems] delta; //vector of system deviations from the average
real<lower=0> sigma_delta; //standard deviation of the delta
real<lower=0> mu; //average item size value
real<lower=0> sigma_z; //standard deviation of the observations
vector<lower=0, upper=1>[n_rounding] error; //vector of rounding errors
}
transformed parameters {
vector[N] y_hat;
vector[N] z;
for (i in 1:N)
y_hat[i] = mu + gamma[item_id[i]] + delta[system_id[i]];
for (i in 1:N)
z[i] = y[i] - error[rounding_id[N]];
}
model {
//priors on deviations
sigma_gamma ~ cauchy(0,2.5);
sigma_delta ~ cauchy(0,2.5);
sigma_y ~ gamma(2,0.1);
//sample of deviations
gamma ~ normal(0, sigma_gamma);
delta ~ normal(0, sigma_delta);
//likelihood
z ~ normal(y_hat, sigma_z);
}
data{
int<lower=0> N;
vector[N] y;
}
parameters {
real mu;
real<lower=0> sigma;
vector<lower=-0.5,upper=0.5>[N] error;
}
transformed parameters {
vector[N] z;
z ← y - error;
}
model {
z ~ normal(mu, sigma);
}
I understand that unlike others common example the likelihood is not specifically define so stan must sample posterior without parametric specification for the likelihood.
But another solution is :
data{
int<lower=0> N;
vector[N] y;
}
parameters {
real mu;
real<lower=0> sigma;
vector[N] z;
}
model {
z ~ uniform(y - 0.5, y + 0.5);
z ~ normal(mu, sigma);
}
In this case it’s not clear to me how the sampler take into account the fact that 2 different distributions are defined on z.
What we know is that the true value z being measured is somewhere in (y - 0.5, y + 0.5) and log Pr[z in (y - 0.5, y + 0.5)] is the right-hand side of the above.
The nice thing about @dlakelan’s coding here is that you don’t need varying bounds like if you did it directly. And you don’t even need the uniform sampling statement—that’ll follow from declaring z with bounds matching the uniform.
You’ll probably get a Jacobian warning, but you can ignore it here as the Jacobian is constant in (z, y) -> (z, y-z).