Survey weighted regression

Hello,

I have a relatively simple regression model that I have fit to two separate datasets from the same underlying population. The questions in each dataset are essentially the same, but the results from each model are somewhat different. I suspect that much of this difference has to do with differential survey response, because one sample appears to rely heavily on survey weights to approximate representativeness. I assume that one can do survey-weighted regression by weighting the log density through “target += …”? I can’t find anything in the manual about this, but I might be missing something.

Any help would be appreciated. Thanks!

Note that I don’t want to include the weighting variables in the model specification itself to circumvent the problem.

3 Likes

Yes, you can do that. The closest the manual comes is a section on “Exploiting sufficient statistics”. I haven’t added anyting on weighted regression since our regression experts, Ben Goodrich and Andrew Gelman, don’t like the weightings (other than those based on sufficient stats) because the resulting model isn’t properly Bayesian in that there’s no generative process for the weights.

@Gregory were you able to solve this? If so, I would love to see your solution.

As far as I can tell there are 2 possibilities:
If you want to stay fully Bayesian, you can use multilevel regression and post-stratification.

If you definitively want to use weights (knowing that soem people do not like it) the approach would be, as described above, to use the target += approach, whereby you multiply the log posterior for each case with its weights, e.g.

target += normal_lpdf(y[n] | eta[n] , sigma) * weight[n]

where n is the index for your case.

As far as I know, this is what rstanarm and brms are doing when you are telling them to use weights.

[A third way might be to add a regression for participation to your model and derive weights from this selection model.]

1 Like

Yes, this is what I had thought to do, weighting the log posterior by the survey weight.

MRP is great, but it has drawbacks in that you need the joint distribution of all characteristics in the model, which I don’t have (I need them for cross-sections going back decades), and because fitting the model I want to fit with covariates will be extremely messy and would be a separate paper unto itself.

I’m trying to figure out how to implement this in a very simple example. Suppose this is my stan model without weights:

data {
int<lower=0> N;    // number of data items
int<lower=0> K;    // number of predictors
vector[N] y;       // outcome vector
vector[K] x[N];    // predictor matrix
}

parameters {
real alpha;                   // intercept
vector[K] beta;               // coefficients for predictors
real<lower=0> sigma;          // error sd
}

model {
real yHat[N];
for(i in 1:N){
yHat[i] = alpha + dot_product(x[i], beta); 
}
y ~ normal(yHat, sigma); // likelihood
beta ~ normal(0,1); 
}

Is this how I modify that code to weight the log posterior by the survey weights?

data {
int<lower=0> N;    // number of data items
int<lower=0> K;    // number of predictors
vector[N] y;       // outcome vector
vector[K] x[N];    // predictor matrix
vector<lower=0>[N] weights;  // model weights 
}

parameters {
real alpha;                   // intercept
vector[K] beta;               // coefficients for predictors
real<lower=0> sigma;          // error sd
}

model {
real yHat[N];
for(i in 1:N){
yHat[i] = alpha + dot_product(x[i], beta); 
}
y ~ normal(yHat, sigma); // likelihood
target += dot_product(weights, y); \\ Do I add target here?
beta ~ normal(0,1); 
}

Thanks for the help!

in your weightless ;-) model this line increments the log posterior:

to implement weighting, you replace it with

for(i in 1:N){
  target +=  normal_lpdf(yHat[n], sigma) * weights[n];
}

You could also first write all results of normal_lpdf(yHat[n], sigma) to a vector and do a dot product with the weights, but I don’t think this will make the model faster (but the model would be harder to read).

[Your proposal added the sum of the weighted outcomes y to the log posterior, which is not what you want to do.]

1 Like

@Guido_Biele I’m trying to implement this on a very simple model. Your code chunk says yHat[n] but I’m guessing that is a typo and it should be i instead of n. I tried making that change and i get the following error:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

Probabilty functions with suffixes _lpdf, _lpmf, _lcdf, and _lccdf,
require a vertical bar (|) between the first two arguments.
  error in 'model_no_groups_with_weights' at line 21, column 35
  -------------------------------------------------
    19:   }
    20:   for(i in 1:N){
    21:     target +=  normal_lpdf(yHat[i], sigma) * weights[i];
                                          ^
    22:   }
  -------------------------------------------------

PARSER EXPECTED: "|"

Then I replaced the , for a | as indicated in the error message, but I got a different error:


  error in 'model_no_groups_with_weights' at line 21, column 42
  -------------------------------------------------
    19:   }
    20:   for(i in 1:N){
    21:     target +=  normal_lpdf(yHat[i]|sigma) * weights[i];
                                                 ^
    22:   }
  -------------------------------------------------

In case is helpful, this is my full stan code:

data {
  int<lower=0> N;    // number of data items
  int<lower=0> K;    // number of predictors
  vector[N] y;       // outcome vector
  vector[K] x[N];    // predictor matrix
  vector<lower=0>[N] weights;  // model weights 
}

parameters {
  real alpha;                   // intercept
  vector[K] beta;               // coefficients for predictors
  real<lower=0> sigma;          // error sd
}

model {
  real yHat[N];
  for(i in 1:N){
    yHat[i] = alpha + dot_product(x[i], beta); 
  }
  for(i in 1:N){
    target +=  normal_lpdf(yHat[i]|sigma) * weights[i];
  }
  beta ~ normal(0,1); 
}

What am I doing wrong?

Thanks again!

Try

  for(i in 1:N){
    target +=  normal_lpdf(y[i] | yHat[i], sigma) * weights[i];
  }
1 Like

Really @Guido-Biele’s post had the solution with a typo.

Hi,

(First-time poster, long-time user, and big-time appreciate-er of Stan. Thanks so much for all you guys do!)

Is the above suggestion equivalent to a standard inverse-variance weighting approach?

 for(i in 1:N){
    target +=  normal_lpdf(y[i] | yHat[i], sigma / sqrt(weights[i]));
  }

I may well have misunderstood, but I thought the latter was what @bgoodri does in rstanarm when weights are included.

Thanks and best,

Mariel

That is not what rstanarm does when weights are included. The stan_glm function does something like

Consequently, this makes no sense from a Bayesian perspective unless the original dataset has been collapsed to its unique rows except for a column of weights that counts the number of times that row appears in the original dataset. Otherwise, you are conditioning on something that you did not observe.

1 Like

How does this

do inverse-variance weighting?

It does not.

Instead, one has to calculate the standard deviation for the effect of each study outside Stan (see e.g. the compute.es R package, or equation 2 here) and provide it as data.

Then, the following model estimates the effect size such that the study weight depends on the studies effect sizes and associated standard deviations.

data {
  int N;            // number of studies
  vector[N] y;      // study effect sizes
  vector[N] sigmas; // standard deviations of study effect sizes
}
parameters {
  real mu;
}
model {
  mu ~ normal(0,2);
  y ~ normal(mu,sigmas);
}

Here, weighting is “implicit”, because larger studies have smaller variances, as you can see from the equation for the variance of Cohen’s d:

\sigma^2_d=\frac{n_a+n_b}{n_a*n_b}+\frac{d^2}{2(n_a+n_b)}
where n_a and n_b are sample sizes of the groups compared to obtain d. (details)

1 Like

Not sure I follow. If I’ve “observed” the sampling, can I not include that information in my estimation?. Granted that you won’t have a posterior distribution, but rather a pseudo-posterior. The pseudo-posterior enjoys many desirable properties (reference) and can still be useful, I think.

1 Like

Very helpful insights, @Guido_Biele, @maxbiostat, @bgoodri.

How could " the …inclusion probabilities [be used] to exponentiate the likelihood contribution of each sampled unit" in a stan program?

Thanks in advance.

In log space that would be

 for(i in 1:N){
    target +=  normal_lpdf(y[i] | yHat[i], sigma) * weights[i];
  }

as pointed out above.

1 Like

Thanks, @maxbiostat. That is very helpful.

It isn’t Bayesian. The W_i people in the population that have the same background characteristics as the i-th person in the sample do not all have y_i as their outcome but multiplying the log-likelihood for the i-th person by W_i assumes so. I would rather post-stratify predictions from a real posterior distribution than work with something that isn’t a real posterior distribution.

2 Likes

Thanks for clarifying! My confusion about how weights are treated by rstanarm stems from the fact that ?stan_lm says that they’re treated the “same as lm”.

I don’t think lm can be taking an approach analogous to normal_lpdf(y[i] | yHat[i], sigma) * weights[i] though, because lm weights are invariant to scale:

> y <- rnorm(100)
> w <- runif(100)
> fit_w   <- lm(y ~ 1, weights = w)
> fit_10w <- lm(y ~ 1, weights = 10*w)
> SE <- function(fit) summary(fit)$coef['(Intercept)', 'Std. Error'] 
> SE(fit_w) == SE(fit_10w)
[1] TRUE

Perhaps it would be clearest to say that weights in rstanarm are replication counts that treat each observation as one or more real observations (analogous to Stata’s fweights).

This would help clarify that scale very much does matter for rstanarm weights – the sum of the weights is equal to the number of observations in the original dataset.

1 Like