This question is about intuition more than the actual lines of code.
Consider the normal linear regression setup as follows:
data {
int<lower=1> D;
int<lower=1> N;
matrix[N, D] X;
vector[N] Y;
}
parameters {
real a;
vector[D] b;
real<lower=0> s;
}
model {
vector[N] mu = a + (X * b);
y ~ normal(mu, s);
}
For a particular problem that I’m working on, there is a strong belief that the model will be most useful if it gives more weight to recent observations when fitting its parameters. Outside of the Bayesian context, it’d be quite natural to talk about doing weighted linear regression and then stick an exponentially decaying weight on the observations as a function of their age with a certain half-life. I understand that one approach (let’s call this “target hacking”) to this in STAN is to simply add in a snippet like this instead of the simple sampling statement above:
for (n in 1:N)
target += w[n] * normal_lpdf(Y[n] | mu[n], s);
where w
is a vector of weights specifying the importance of each data point.
Another alternative would be to say that I have a new data object
vector[N] s_weight;
that encodes how much relative uncertainty we have about each point. In that context you might have a modeling section that goes something like:
vector[N] mu = a + (X * b);
vector[N] s_extended = s .* s_weight;
y ~ normal(mu, s_extended);
So really there are three approaches here:
- Non-bayesian, weighted linear regression with weights on the square error terms
- Bayesian with stuffed-in multipliers on the target log-probability-increments of
w[n]
- Bayesian with stuffed-in multipliers on the scale of the sampling statement per observation
s_weight[n]
above
Can anyone offer an intuitive sense of the extent to which these three approaches are doing fundamentally the same thing or fundamentally different things? If you wanted to do something in the STAN/Bayes context that is most intuitively similar to weighting square errors by a vector w
, which approach is most appropriate? Multiplying the target probabilities by something (if so, is it just w
or some function of w
)? Multiplying the scale of each observation by something (if so, is it just w
or sqrt(w)
or other? Or are these three things just very fundamentally different? If so, intuitively, how so?
Thanks!