Using weights to handle varying regret across data points

@martinmodrak what if we want to encode that some datapoint are more important than others, i.e a wrong prediction from the model for those datapoint incurs a big cost? In this use case, does it make sense to weight the log-likelihood? Does weighting the target implies the same thing and makes sure that the parameters for more important datapoint are inferred more accurately?

To give more context, in my use case, my regression should map x to y (i.e x → y). However, I want to maximize A * y where A is constant per datapoint. If my prediction for y is off by \delta, the regret would be A\delta and so the bigger the A, the bigger the regret. For some reasons, I am not supposed to model A*y directly even though A is defined per datapoint. So, I was thinking about weighting the log-likihood to make sure datapoint with big A are inferred more accurately to bring down the regret. But I am not sure whether weighting serves that need or not.

1 Like

Unless I am misunderstanding your question, I think you are mixing modelling and decision theory, maybe disentangling them could clarify the issue. When you are doing modelling, you are trying to capture the data generating process as good as you can. Loss function (or regret, …) enters only once you need to make a decision based on your model.

If you have good reasons to assume that there is an (approximately) linear relationship with Gaussian noise between x and y, then modelling y \sim N(a + bx, \sigma) makes the most sense, even if your loss function depends on Ay. You would fit the linear model and then make the prediction(s) that minimizes the expected loss, where expected loss can be computed by taking each sample in turn, computing the loss assuming this sample represents “the truth” and then averaging over those losses. This means that the same fit can be used to make decisions under many different loss functions (and those decisions will be different).

If you however believe that there is in fact an (approximately) linear relationship with constant noise between Ay and x, you should definitely model it this way - even if your loss function would be different. I think (but please double check my math) you can build an equivalent model by either regressing Ay on x with fixed variance or by regressing y on x with the variance scaled by A.

Hope that helps at least a bit.

2 Likes

Thanks for your reply @martinmodrak and your great points.

make the prediction(s) that minimizes the expected loss, where expected loss can be computed by taking each sample in turn, computing the loss assuming this sample represents “the truth” and then averaging over those losses.

can you elaborate more? I can not quite follow it.

Note that in my use case the predictions come with HIGH uncertainty bound. If the 95% CI is [a, b] and a << b, then what value I should pick between a and b to compute my loss? The posterior mean is really far from lower and upper bounds. So, I thought I should focus on reducing this bound for datapoint that are more important, i.e have bigger A.

To make it clear, lets assume I have two items with these info: (x_i, A_i, y_{i}^{gt}) and (x_j, A_j, y_{j}^{gt}). If I do train to find x->y relation, then the regret for each of them would be (y_{i} - y_{i}^{gt}) \times A_i and (y_{j} - y_{j}^{gt}) \times A_j given that y_{i}^{gt} and y_{j}^{gt} are the ground truth values for these items and y_{i} and y_{j} are the posterior mean for these two items.
Now assume that 1) A_{i} << A_{j} . note that A values are given per item and are not random and 2) the predictions for both datapoints are equally good, i.e (y_{i} - y_{i}^{gt}) = (y_{j} - y_{j}^{gt}). In this case the loss for item j is much bigger, meaning that I should focus on making a better prediction for j than i.

I came across this paper that has a kinda similar use case and tries to incorporate the importance of items into training but not quite working for me.

As this is getting a bit away from the original topic, I moved this into a new one - hope the title is OK with you.

I am not completely sure I follow you exactly. It seems like you are only trying to make “predictions” for the already observed points. Is that correct? Or will there be some new (A, x) pairs coming where you will want to predict y? If you are only trying to retrodict the past, and the only metric you are trying to optimize is the regret, than the crucial question is: Why wouldn’t you just use the observed y_i^{gt} as the output (as this will have zero regret)? Probably what you are doing is trying to do some smoothing and balance the simplicity of the model against the regret on the observed data, so there is something else you optimize beyond the fit to the old data. What would that be? What is the final goal?

The following only makes sense if you are trying to make predictions for new data. Let us assume the model is of the form y \sim N(a + bx, \sigma), we observed a new value x_{new}, A_{new} and need to make prediction \hat y to minimize the expected value of |\hat y - y_{true}| \times A_{new} (or any other loss function). Using Stan to fit the model the the previously observed data will give you S samples (by default 4000). For each sample s, you’ll get a specific value of a_s, b_s, \sigma_s. Assuming the model is correct, the expected regret for this sample will be R_s = E(|\hat y - y_{pred,s}| \times A_{new}) where y_{pred,s} \sim N(a_s + b_s x_{new}, \sigma_s). The expected overall regret is then the mean across all samples R = \frac{1}{S} \sum_s R_s. You then need to find \hat y that minimizes R.

The absolute value in the loss function complicates things a bit and I am honestly unsure (and cannot do the maths properly) whether there is an analytic expression or whether you would need to find the minimum numerically (if the loss was square error, I think the best prediction will be just the posterior mean, but I am once again not sure). Interestingly, you’ll notice that the prediction that minimizes R does not depend on A_{new} as that’s a constant multiplicative factor.

Does that make sense?

2 Likes

Thanks @martinmodrak for taking the time replying to me. I really appreciate it.

It seems like you are only trying to make “predictions” for the already observed points. Is that correct?

No, it is not correct. I need to predict over a set of new datpoints (lets call it X^{*}) where each of them comes as (x^{*}, A^{*}). Note that A^{*} is given and not random. Then I need to pick the datapoint that provides the highest reward among that set of unobserved data X^{*}. For datapoint x^{*}, reward is defined as y^{*}A^{*} where y^{*} is its predicted value. Here is the sketch of the problem. I guess it makes it more clear that why this problem is not an actual Bayesian Optimization problem.

Also note that I never get the actual y^{gt} for (x^{*}, A^{*}) unless it comes with the highest reward among all predicted rewards over X^{*} and so in that case I choose it and its ground truth will be revealed to me.

Your explanation for finding the expected regret was great and very detailed. Thanks for that. I also have another question. If I incorporate the A value to training and prediction (by multiplying the sqrt(A) to the feature set and also the observed value like the suggestion in Sec 4.3 of this paper, the uncertainty bound over datapoints with high A would be smaller, i.e the predictions are more accurate for them. to be precise, this modification predicts sqrt(A)*y^{*} and so I divide the predictions by sqrt(A) to get the predicted y^{*} and then from that I calculate the reward and pick the datapoint with the highest reward. So, my point is, it seems incorporating A is really necessary and so the problem can not be decoupled as 1) finding x->y relations and 2) find the reward, 3) pick the point with the highest reward. It seems to me that 1 and 2 should be done together. Does that make sense?

Sorry, it was lengthy and thanks for helping me understand my problem better.

1 Like

Hi, sorry I was busy and dropped the ball on this.

This can’t be true as that would imply reward does not depend on the true value, only on the predicted value and I can thus choose any reward I want.

I think this only makes sense if you believe that there is some systematic relationship between x and A. If you believe that there is a (potentially complex) function A = g(x) then you indeed may want to make the fit “tighter” at x values where you observe large A. If A and x are not coupled, why should the past observed (x, A) pairs tell you anything about where you want to have good predictions in the future?

Or are you referring to using the the newly observed A^* as weights for previously observed x and ignore the previously observed A? I.e. I want to force tighter predictions at points that matter for my current decision? Maybe this could make some sense, but I am not sure - it still feels like this can improve your predictions only if your model for f is substantially wrong (e.g. if you are fitting a linear model to a non-linear relationship, than the weighting is more likely to help). However, I can’t think clearly about this case right now.

This probably makes the problem substantially more complex as if this is an iterated scheme, you may want to choose an option that does not have the highest expected value but will reduce your uncertainty about some part of the parameter space and thus improve the expected value of your future choices (as in multi-armed bandit problems). I will ignore this aspect for now and assume you are making just a single choice.

Putting more weights on certain datapoints will not magically make the predictions more accurate. The predictions will be more accurate only if your model is not wrong. If it is somehow importantly wrong, tighter predictions may in fact be misleading.

And this circles back to my original point - if an oracle told you the exact form of f, there will be no reason to weight anything by A - you would just know how the system behaves and will be able to act accordingly. If you knew f exactly, A would enter only when making decisions. Right? Or if that is not the case, why? So why would you want something different than approximating the true form of f, regardless of A?

I guess I might be missing something very basic about the setup?