Bayesian parallels of weighted regression

This question is about intuition more than the actual lines of code.

Consider the normal linear regression setup as follows:

data {
    int<lower=1> D;
    int<lower=1> N;
    matrix[N, D] X;
    vector[N] Y;
}

parameters {
    real a;
    vector[D] b;
    real<lower=0> s;
}

model {
    vector[N] mu = a + (X * b);
    y ~ normal(mu, s);
}

For a particular problem that I’m working on, there is a strong belief that the model will be most useful if it gives more weight to recent observations when fitting its parameters. Outside of the Bayesian context, it’d be quite natural to talk about doing weighted linear regression and then stick an exponentially decaying weight on the observations as a function of their age with a certain half-life. I understand that one approach (let’s call this “target hacking”) to this in STAN is to simply add in a snippet like this instead of the simple sampling statement above:

for (n in 1:N) 
    target += w[n] * normal_lpdf(Y[n] | mu[n], s); 

where w is a vector of weights specifying the importance of each data point.

Another alternative would be to say that I have a new data object

vector[N] s_weight;

that encodes how much relative uncertainty we have about each point. In that context you might have a modeling section that goes something like:

vector[N] mu = a + (X * b);
vector[N] s_extended = s .* s_weight;
y ~ normal(mu, s_extended);

So really there are three approaches here:

  1. Non-bayesian, weighted linear regression with weights on the square error terms
  2. Bayesian with stuffed-in multipliers on the target log-probability-increments of w[n]
  3. Bayesian with stuffed-in multipliers on the scale of the sampling statement per observation s_weight[n] above

Can anyone offer an intuitive sense of the extent to which these three approaches are doing fundamentally the same thing or fundamentally different things? If you wanted to do something in the STAN/Bayes context that is most intuitively similar to weighting square errors by a vector w, which approach is most appropriate? Multiplying the target probabilities by something (if so, is it just w or some function of w)? Multiplying the scale of each observation by something (if so, is it just w or sqrt(w) or other? Or are these three things just very fundamentally different? If so, intuitively, how so?

Thanks!

2 Likes

Seems up @jonah, @lauren and @bgoodri’s alleys.

Most of those weighting schemes are intended to make point estimates have known properties across datasets. My view is the only valid weights for a Bayesian analysis is like in a binomial when you actually observe N_j cases with the same success probability in the j-th group. Multiplying contributions to target by other fixed quantities isn’t generative. If you have a time-series where you think the parameters vary over time, then you have to model those dynamics.

2 Likes

So say, in my context, I think that at each time step there is a probability that the process that generates my data undergoes some kind of parameter shift. I.e. that the parameters aren’t really constant through time but some appropriately-parameterized gaussian process or random-walk type thing. Then you’re saying the right thing to do is to model b (and a and s) not as a constant but actually to write it out as what I just said (full on Gaussian process or other). And maybe even to do so as a multidimensional random field that handles the fact that large jumps in one parameter are probably coincident with large jumps in other parameters?

That seems beautiful and cool from a theoretical point of view. But a pain in the butt from a programming point of view. And likely to be huge downer from a how-long-does-it-take-to-run point of view.

I was hoping there was some way in which that kind of intuition would translate into a relatively-light-to-code and relatively-fast-to-run model. Is there a happy middle ground?

1 Like

You could remove all but the last few observations from a regression model and accept the fact that there is not enough information to estimate anything precisely.

1 Like

If more recent observations are more useful for inference, then it seems necessarily the case that the parameters themselves should be thought of as changing over time. If b is static, what makes more recent observations more useful?

If b is static, then it seems it must be either that the noise in y is decreasing up to the present (your 3rd option) or that there is measurement error on X which is decreasing (errors in variables).

For your second suggestion, scaling the log-likelihood is the same as raising the likelihood to a power which might have unexpected effects since you’re changing the curvature of the distribution.

Just wanted to quickly add: weighting the target by an integer w[n] (what you call “target hacking” and also how weights are implemented in brms) is exactly the same as putting the same datapoint w[n] times in the data. I think this intuition generalizes to any non-negative real w[n], so if you reduce the weight on old observation you take them as “half a data point”. I can imagine you could find theoretical justification to choose your weighting scheme with that in mind.

I agree with @bgoodri that a fully generative approach with parameters changing over time should be your first consideration, but I have sympathy for @8one6’s concerns about model complexity and runtime. (though in my experience, splines can be very efficient). I don’t think weights are necessarily a terrible solution - I would actually guess that some forms of weights increasing over time could yield identical inferences about the parameters at the latest time point as some time series structure of the parameters with a fixed “flexibility”. (though this is just a conjecture and I don’t pretend to posses the math-fu necessary to even begin to tackle this question).

Practically, going forward, I think you can either try brms which has a ton of features to model splines/gps/time series, reducing the coding burden substantially. Alternatively, you may want to stick to weights, but you should be aware that this is a tricky territory and extra caution should be taken - at the very least, I would do some leave-future-out cross-validation and see how the inferences change under multiple plausible weighting schemes.

Best of luck with your model!

2 Likes

If you have k observations your are trying to model, each with t temporally ordered predictors p, you could calculate a weighted average of p to obtain only one predictor.

If you use e.g. exponentially decaying weights, you can parameterize the weighting scheme and make the decay rate part of the model.

This gives you a fully Bayesian model where you weight predictors, but not the log-pdf.

1 Like

I have a similar problem and I think option 3 might make the most sense.

The reason is that changing the standard deviation/variance will actually implicitly weight each observation. To give an intuition, suppose you have one observation y and you put n different sampling statements on y.

So then, you would have y ~ N(\mu_i, \sigma_i) where i = 1,2,...,n and \sigma_i is the variance of the i^{th} sampling statement. Then let g_i = \sigma_1 * \sigma_2 * ... *\sigma_{i-1} * \sigma_{i+1} *...*\sigma_n. Then, the relative weighting of the i^{th} observation is \frac{g_i}{g_1 + g_2 + .... + g_n}. Thus, increasing the variance of a sampling statement will increase the denominator but not increase the numerator and hence reduce its ‘impact’. You can easily create toy examples in STAN that mimic this setup or you can do the math yourself and you will get this.

I’m sure the dynamics/exact formula will be slightly different with n different observations, but the logic should still follow.