Weighted regression using number of observations

Hi all,

I am running a model predicting mean_y of an individual, calculated from multiple observations y of that individual.

for(i in 1:N_obs ){
pred[i] = a[i] + b[i] * x[i];
mean_y[i] ~ normal (pred[i], sigma[i]);
}

Every individual is measured a certain number of times. Now, I would like to weight each individual according to the number of observations, using some kind of weight[i]. I was thinking to adapt the model like this:

for(i in 1:N_obs){
pred[i] = a[i] + b[i] * x[i];
target += normal_lpdf(mean_y[i] | pred[i] , sigma[i]) * weight[i];
}

However, I read in the linked discussion that using weights is generally not recommended, as it ‘is not a generative model’:
https://groups.google.com/forum/#!topic/stan-users/v4CoBWUehwU

In the same discussion, it is said that it can be modeled if variances vary between observations, which, if I understand it well, is the case for my individuals: when there are more observations per individual, the variance generally decreases.

Now, I was wondering if

  1. I understood the discussion well and that what I want to model is possible and appropriate
  2. if so, how to implement this in my model.

It is important to note that my individuals are indeed measured repeatedly over time, but that I don’t want to weight more recent observations ‘heavier’ than previous ones. I just want to weight according to the number of observations, or some sort of measure related to it.

Thank you for your help!

Do you actually have an observed measurement for each instance an observation unit is measured? Or do you only have the observed mean of multiple measurements and information on how many measurements were made? If you have an observed measurement for each instance for each individual then there is no need to use weighting. Rather a multilevel model would be more appropriate.

1 Like

Thanks for your answer. You were right, I did not need to use weighting in the end, I have used a multilevel model instead.

Yes, that’s right. This collects the sufficient statistics (number of observations) and is more efficient than just iterating over all the observations. There’s a section on this in the efficiency chapter of the manual.

That’s when they’re part of the model and not part of the generative story. Here, they’re just part of an efficiency improvement for the implementation.

1 Like