I have a relatively simple regression model that I have fit to two separate datasets from the same underlying population. The questions in each dataset are essentially the same, but the results from each model are somewhat different. I suspect that much of this difference has to do with differential survey response, because one sample appears to rely heavily on survey weights to approximate representativeness. I assume that one can do survey-weighted regression by weighting the log density through “target += …”? I can’t find anything in the manual about this, but I might be missing something.
Any help would be appreciated. Thanks!
Note that I don’t want to include the weighting variables in the model specification itself to circumvent the problem.
Yes, you can do that. The closest the manual comes is a section on “Exploiting sufficient statistics”. I haven’t added anyting on weighted regression since our regression experts, Ben Goodrich and Andrew Gelman, don’t like the weightings (other than those based on sufficient stats) because the resulting model isn’t properly Bayesian in that there’s no generative process for the weights.
As far as I can tell there are 2 possibilities:
If you want to stay fully Bayesian, you can use multilevel regression and post-stratification.
If you definitively want to use weights (knowing that soem people do not like it) the approach would be, as described above, to use the target += approach, whereby you multiply the log posterior for each case with its weights, e.g.
Yes, this is what I had thought to do, weighting the log posterior by the survey weight.
MRP is great, but it has drawbacks in that you need the joint distribution of all characteristics in the model, which I don’t have (I need them for cross-sections going back decades), and because fitting the model I want to fit with covariates will be extremely messy and would be a separate paper unto itself.
in your weightless ;-) model this line increments the log posterior:
to implement weighting, you replace it with
for(i in 1:N){
target += normal_lpdf(yHat[n], sigma) * weights[n];
}
You could also first write all results of normal_lpdf(yHat[n], sigma) to a vector and do a dot product with the weights, but I don’t think this will make the model faster (but the model would be harder to read).
[Your proposal added the sum of the weighted outcomes y to the log posterior, which is not what you want to do.]
@Guido_Biele I’m trying to implement this on a very simple model. Your code chunk says yHat[n] but I’m guessing that is a typo and it should be i instead of n. I tried making that change and i get the following error:
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Probabilty functions with suffixes _lpdf, _lpmf, _lcdf, and _lccdf,
require a vertical bar (|) between the first two arguments.
error in 'model_no_groups_with_weights' at line 21, column 35
-------------------------------------------------
19: }
20: for(i in 1:N){
21: target += normal_lpdf(yHat[i], sigma) * weights[i];
^
22: }
-------------------------------------------------
PARSER EXPECTED: "|"
Then I replaced the , for a | as indicated in the error message, but I got a different error:
error in 'model_no_groups_with_weights' at line 21, column 42
-------------------------------------------------
19: }
20: for(i in 1:N){
21: target += normal_lpdf(yHat[i]|sigma) * weights[i];
^
22: }
-------------------------------------------------
In case is helpful, this is my full stan code:
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
vector[N] y; // outcome vector
vector[K] x[N]; // predictor matrix
vector<lower=0>[N] weights; // model weights
}
parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error sd
}
model {
real yHat[N];
for(i in 1:N){
yHat[i] = alpha + dot_product(x[i], beta);
}
for(i in 1:N){
target += normal_lpdf(yHat[i]|sigma) * weights[i];
}
beta ~ normal(0,1);
}
That is not what rstanarm does when weights are included. The stan_glm function does something like
Consequently, this makes no sense from a Bayesian perspective unless the original dataset has been collapsed to its unique rows except for a column of weights that counts the number of times that row appears in the original dataset. Otherwise, you are conditioning on something that you did not observe.
Instead, one has to calculate the standard deviation for the effect of each study outside Stan (see e.g. the compute.es R package, or equation 2 here) and provide it as data.
Then, the following model estimates the effect size such that the study weight depends on the studies effect sizes and associated standard deviations.
data {
int N; // number of studies
vector[N] y; // study effect sizes
vector[N] sigmas; // standard deviations of study effect sizes
}
parameters {
real mu;
}
model {
mu ~ normal(0,2);
y ~ normal(mu,sigmas);
}
Here, weighting is “implicit”, because larger studies have smaller variances, as you can see from the equation for the variance of Cohen’s d:
\sigma^2_d=\frac{n_a+n_b}{n_a*n_b}+\frac{d^2}{2(n_a+n_b)}
where n_a and n_b are sample sizes of the groups compared to obtain d. (details)
Not sure I follow. If I’ve “observed” the sampling, can I not include that information in my estimation?. Granted that you won’t have a posterior distribution, but rather a pseudo-posterior. The pseudo-posterior enjoys many desirable properties (reference) and can still be useful, I think.
It isn’t Bayesian. The W_i people in the population that have the same background characteristics as the i-th person in the sample do not all have y_i as their outcome but multiplying the log-likelihood for the i-th person by W_i assumes so. I would rather post-stratify predictions from a real posterior distribution than work with something that isn’t a real posterior distribution.
Thanks for clarifying! My confusion about how weights are treated by rstanarm stems from the fact that ?stan_lm says that they’re treated the “same as lm”.
I don’t think lm can be taking an approach analogous to normal_lpdf(y[i] | yHat[i], sigma) * weights[i] though, because lm weights are invariant to scale:
Perhaps it would be clearest to say that weights in rstanarm are replication counts that treat each observation as one or more real observations (analogous to Stata’s fweights).
This would help clarify that scale very much does matter for rstanarm weights – the sum of the weights is equal to the number of observations in the original dataset.