I suspect this is not a new question, but maybe one of you can point me to further information on this.
I’m looking for coding a model that includes parameters that are uncertain but not updated based on data (typically including literature values). A minimum example could be like this:
I have an instrument that measures variable x and and gives a signal y with interference i.
we know that y_i = x_i * \alpha + i_i + \epsilon_i
where \epsilon_i \sim normal(0,\sigma)
The manufacturer told us that i_i \sim normal(\gamma, \delta) and the values of \gamma and \delta, and we have no way to improve that knowledge with our data.
My goal is to infer \alpha and \sigma, taking into account the uncertainty in i_i.
I’ve seen that this is solved by using code similar to
target += normal_lpdf(i|\gamma, \delta)
but I don’t fully understand that syntax yet and would appreciate if one of you could point me to more information.
The most straightforward solution to the general problem is to multiply impute the values of i, and fit the model many times, once for each imputed dataset, and then combine the posterior chains for inference. Note that this is computationally expensive–you’ll probably want to impute and fit hundreds of times.
BUGS had a cut function that was designed to achieve the same thing in a single fit, but to the best of my vague knowledge it didn’t reliably work properly.
~ is shorthand notation telling Stan to add the log density of a distribution to the model’s total log probability but drops additive constants, while target += keeps them and shows the actual math Stan adds internally to build the log posterior density.
To put the manufactures information into your model, you can just write in your model block exactly what you’ve typed where you insert “the values of 𝛾 and 𝛿” that’s your prior information and just do not have new data to update them.
Wait, why would you not just write something like this?
data {
int I;
vector[I] X, Y;
real gamma, delta;
}
parameters {
real alpha;
real<lower=0> sigma;
vector[I] i;
}
model {
alpha ~ std_normal(); // some prior for alpha
sigma ~ exponential(1); // some prior sigma
i ~ normal(gamma, delta); // impute i
y ~ normal(X * alpha + i, sigma);
}
Because (I think) the OP doesn’t want the likelihood to inform the values of i.
I’m looking for coding a model that includes parameters that are uncertain but not updated based on data
If the OP merely means that the likelihood shouldn’t inform the values of \gamma and \delta, then I agree, this whole thing can just be written as you suggest, or more efficiently as
model {
alpha ~ std_normal(); // some prior for alpha
sigma ~ exponential(1); // some prior sigma
y ~ normal(X * alpha + gamma, sqrt(sigma^2 + delta^2));
}
I actually had to do something like this not too long ago and had some code for it, so I adapted it to your example (but I really didn’t have time to check it much at all, so definitely double check it for mistakes!).
If the model really is as simple as the one in your original post, then you can do the imputation approach that @jsocolar suggested easily just in R using rstanarm (I agree that imputation is the right approach if you don’t want the likelihood to inform i, otherwise you can use the other suggestions) :
library(rstanarm)
library(posterior)
library(pbapply) # for optional progress bar (does add a bit of overhead)
# simulate some fake data to test everything
N <- 100
x <- rnorm(N)
alpha_true <- 1.5
sigma_true <- 1
gamma <- 0.3
delta <- 0.4
i_true <- rnorm(N, mean = gamma, sd = delta)
epsilon <- rnorm(N, mean = 0, sd = sigma_true)
y <- alpha_true * x + i_true + epsilon
fake_data <- data.frame(y = y, x = x)
fit_one_imputation <- function(m, data, gamma, delta) {
data$i_imp <- rnorm(n = nrow(data), mean = gamma, sd = delta)
# assuming x_i*alpha + i_i + epsilon_i as written in the OP
fit <- stan_glm(y ~ 0 + x + offset(i_imp), data = data, refresh = 0)
as_draws_matrix(fit) # combines the draws from all chains
}
# number of imputations to do
M <- 100
# this will give you a list of M matrices of posterior draws that we then
# bind together into a single draws object
list_of_posterior_draws <- pblapply(
seq_len(M),
fit_one_imputation,
data = fake_data,
gamma = gamma,
delta = delta
)
posterior_draws <-
do.call(bind_draws, list(list_of_posterior_draws, along = "draw")) |>
rename_variables(alpha = "x")
summarize_draws(posterior_draws, default_summary_measures())
I simulated data with N=100 and then did M=100 rounds of imputation. Running everything took about 1 minute on my laptop. I’m guessing your N is bigger though?
I didn’t use multiple cores when running rstanarm because with N=100 it’s faster to run chains sequentially then to start up the multiple processes the way rstanarm does it. If your N is really large it could be faster to use multiple cores.
You can probably use fewer than the default number of warmup and post-warmup draws per chain if you really just have a single-level linear regression model. I just left the defaults for now.
Using something like parallel::parLapply or similar, you could parallelize the whole procedure, I just ran the fit_one_imputation calls sequentially.
Just hypothetically, what’s the big deal with estimating i while treating \delta, \gamma as known? It seems a lot more straightforward to get the answer and I don’t immediately understand why it matters to condition i on data.
I agree it’s subtle, not immediately obvious. But if we let the model estimate i then instead of forcing the model to learn the relationship between x and y through alpha, we’re also allowing it to explain the residual variation by changing the i values. So it adds extra flexibility compared to the imputation approach. If we use imputation then each imputed i is fixed when fitting the Stan model but we account for the uncertainty in i by doing it many times.
Thanks for all your help! The implementation of a ‘cut’ model was exactly what I was looking for. Jonah’s point is quite important: in my case both i and alpha can be adjusted to fit the data, and I want only alpha to be fit to the data.
I’m a bit surprised this is not a more common problem (propagating the uncertainty of a constant-with-error-term into a stan model) - if someone has suggestions how to best do this computationally efficient in more complex models it would be nice to hear.
My actual model is a bit more complicated, but not by much. In the end, I wrote this as a transformed data statement:
data {
int<lower=0> N; // number of measurements
int<lower=0> T; // number of treatments
array[N] int <lower=1, upper=T> t; //assigning measurement to treatments
vector[N] d; // depth
vector[N] c; // dependent variable
vector[T] BD; // a constant
real c_a; // a constant
real Da; // a constant
vector[T] R_mu; // mean of the variable that is sampled before the cut
vector[T] R_se; // SD of the variable that is sampled before the cut
int<lower=0> Nrandom; number of samples before the cut
}
transformed data{ // samples taken before the cut
matrix[T,Nrandom] Rrandom;
for(i in 1:T) {
for (j in 1:Nrandom) {
Rrandom[i,j] = normal_rng(R_mu[i], R_se[i]);
if(Rrandom[i,j]<0) {
Rrandom[i,j]=1e-30;
}
}
}
}
parameters {
vector<lower=0, upper=1>[Nrandom] DsD0;
vector<lower=0>[Nrandom] sigma;
}
model {
matrix[N,Nrandom] mu; //estimated dependent variable
// likelyhood
for (i in 1:N) {
for (j in 1:Nrandom) {
mu[i,j]=c_a*exp(-1*sqrt(Rrandom[t[i],j] * BD[t[i]]/ c_a / Da / DsD0[j])*d[i]); // depth dependence of concentration: c = c0 * exp (-sqrt(k) * z)
}
}
for (i in 1:N) {
for (j in 1:Nrandom) {
c[i] ~ normal(mu[i,j], sigma[j]);
}
}
//priors
for (j in 1:Nrandom) {
DsD0[j] ~ beta(1,1);
sigma[j] ~ lognormal(log(1e-5),10);
}
}
Credit to @jsocolar for pointing it out initially.
I don’t have my computer so I’m typing on my phone so pardon this informal explanation of the issue.
It’s a common enough situation that, like @jsocolar said, BUGS tried to provide a solution to this with cut(). But cut() is an algorithmic hack that is known to have problems. It basically tries to run a sampler where half the model says “pretend this likelihood term isn’t there!” but the other half still uses the consequences of that term. This doesn’t necessarily fit together into one coherent posterior (I don’t think it’s guaranteed to preserve a single well-defined stationary distribution).
Just a quick comment about this (I don’t have my computer or much time so I haven’t looked at your code thoroughly so I could be wrong about this). By doing the random number generation in transformed data I think you’re essentially saying that each Markov chain is doing a different imputed data set (worth double checking the documentation on RNGs and transformed data to make sure I’m remembering correctly, I could be misremembering). If you want all the chains using the same imputed data set you can provide the imputations in the data block and generate them in R or python or Julia or whatever. I guess if the target distribution is easy to sample from and so one chain per imputation is sufficient (not so worried about multi chain convergence diagnostics) then this approach isn’t necessarily invalid, but it’s something to keep in mind (and may be hard to know in advance) and it’s avoidable by doing the imputations outside of Stan.
I would also be interested to know if anyone has suggestions for a more computationally efficient approach.
But we “know” or have been told what the distribution of i is, and we know i is part of the data we observe. So why would it not make sense to just model y with unknown i? I do understand statistically how it’s different but not why the imputation of many datasets would be preferred. It’s a lot of effort and I suppose I don’t totally see why you wouldn’t consider i as an unknown with known hyper parameters.
I haven’t tested this, but here’s my guess. I would guess that the point estimate of alpha would probably be similar if we limit ourselves to simple linear models and i is independent of x, but it could be different in nonlinear or constrained models and posterior uncertainty could also be different. I also think we’d probably see a bigger difference in sigma compared to alpha. If we put i in the parameters block then the model can use i to absorb residual variation, so I’d guess that sigma would tend to be smaller than if we do the imputation approach where i can’t be adjusted to get a better fit. But I’d be interested to see if that intuition is right or not!
You’re right that fixing i isn’t necessarily admissible from a strict Bayesian perspective. If the data generating process is truly “draw i from the normal, then draw y from the likelihood”, then clearly yis informative about i.
However, when we aren’t confident that our statistical model has adequately captured the data generating process, we might want to approach this cautiously. For example, suppose we are investigating a speculative relationship between y and some covariate x that is measured with uncertainty. How comfortable should we be using a linear relationship between x and y to pull around the values of x to fit the assumed linear relationship, which itself is speculative in the first place? If we do this, we’ll get out a relationship that:
Tends to look like it’s well approximated by the homoskedastic linear model that we’ve assumed, even if that’s wrong.
Tends to think that the relationship is strongly predictive by putting some of the unexplained variance into x rather than into the errors-in-y.
Both in the intro and near the end of the discussion of his paper explaining why cut doesn’t work, Martyn Plummer has a short discussion of why people are interested in these cut models.
As an aside, that paper is very math-y, focused on the stationary distributions that these BUGS models converge to as a function of the choice of sampler and tuning parameters. But the fundamental intuition of what is going on is quite simple (and IMO not readily apparent from the paper–at least not to me). The issue is that cut coherently updates the model upstream of the cut (in this case i) while ignoring the downstream part. But then when it comes time to update the downstream part, cut only draws one single iteration with no burn-in. What’s needed is a burn-in to converge to the correct posterior in the downstream part after the update in the upstream part. This is why the details of the sampler and tuning matter crucially.
The solution–run a burn-in after each update to i–is exactly the multiple imputation solution. Plummer also proposes a couple of other solutions, but all of them are variations on finding a good burn-in algorithm to yield samples from the updated target that don’t retain memory of the previous upstream parameters.
A contrasting pair of real-world examples, knowing that you work on wildlife stats @mhollanders:
In this paper, I had extensive survey data of birds that allowed me to estimate with uncertainty, which species were “winners” or “losers” across an agricultural land-use transition. I then wanted to regress counts of “loser” species (those that decline across the transition) that nevertheless persist in agricultural mosaics against landscape-scale covariates (e.g. distance to primary forest, etc). As this analysis was highly speculative, I didn’t trust these covariate relationships to actually be informative about which species were “losers.” So I multiply imputed the “loser” status. NB: In many other respects, I would not today uphold this paper as an example of best statistical practice.
Here’s a look at the other side of the coin: In this paper, we had an uncertain measure of how bird populations responded to winter temperature on a grid across eastern North America. We placed a prior on the true underlying response consisting of an ICAR component plus a nonspatial residual term. And the model was very happy to assume that true response varied very smoothly in space, effectively lining up the uncertain cell-level responses along a smooth surface. Look how smooth the outputs in figure 4 are! In presenting results that are this smooth, we are really trusting the ICAR process to be the appropriate model for the underlying response (maybe trusting too much?). Alternatively, we could have multiply imputed the cell-specific values and used the ICAR model merely to understand and quantify the amount of spatial structure inherent in the multiply imputed responses.