Survey weighted regression

Gregory · August 23, 2017, 2:37pm

Hello,

I have a relatively simple regression model that I have fit to two separate datasets from the same underlying population. The questions in each dataset are essentially the same, but the results from each model are somewhat different. I suspect that much of this difference has to do with differential survey response, because one sample appears to rely heavily on survey weights to approximate representativeness. I assume that one can do survey-weighted regression by weighting the log density through “target += …”? I can’t find anything in the manual about this, but I might be missing something.

Any help would be appreciated. Thanks!

Note that I don’t want to include the weighting variables in the model specification itself to circumvent the problem.

Bob_Carpenter · August 28, 2017, 8:14pm

Yes, you can do that. The closest the manual comes is a section on “Exploiting sufficient statistics”. I haven’t added anyting on weighted regression since our regression experts, Ben Goodrich and Andrew Gelman, don’t like the weightings (other than those based on sufficient stats) because the resulting model isn’t properly Bayesian in that there’s no generative process for the weights.

ignacio · September 27, 2017, 3:18pm

@Gregory were you able to solve this? If so, I would love to see your solution.

Guido_Biele · September 27, 2017, 5:02pm

As far as I can tell there are 2 possibilities:
If you want to stay fully Bayesian, you can use multilevel regression and post-stratification.

If you definitively want to use weights (knowing that soem people do not like it) the approach would be, as described above, to use the target += approach, whereby you multiply the log posterior for each case with its weights, e.g.

target += normal_lpdf(y[n] | eta[n] , sigma) * weight[n]

where n is the index for your case.

As far as I know, this is what rstanarm and brms are doing when you are telling them to use weights.

[A third way might be to add a regression for participation to your model and derive weights from this selection model.]

Gregory · September 28, 2017, 1:29pm

Yes, this is what I had thought to do, weighting the log posterior by the survey weight.

MRP is great, but it has drawbacks in that you need the joint distribution of all characteristics in the model, which I don’t have (I need them for cross-sections going back decades), and because fitting the model I want to fit with covariates will be extremely messy and would be a separate paper unto itself.

ignacio · September 28, 2017, 7:37pm

I’m trying to figure out how to implement this in a very simple example. Suppose this is my stan model without weights:

data {
int<lower=0> N;    // number of data items
int<lower=0> K;    // number of predictors
vector[N] y;       // outcome vector
vector[K] x[N];    // predictor matrix
}

parameters {
real alpha;                   // intercept
vector[K] beta;               // coefficients for predictors
real<lower=0> sigma;          // error sd
}

model {
real yHat[N];
for(i in 1:N){
yHat[i] = alpha + dot_product(x[i], beta); 
}
y ~ normal(yHat, sigma); // likelihood
beta ~ normal(0,1); 
}

Is this how I modify that code to weight the log posterior by the survey weights?

data {
int<lower=0> N;    // number of data items
int<lower=0> K;    // number of predictors
vector[N] y;       // outcome vector
vector[K] x[N];    // predictor matrix
vector<lower=0>[N] weights;  // model weights 
}

parameters {
real alpha;                   // intercept
vector[K] beta;               // coefficients for predictors
real<lower=0> sigma;          // error sd
}

model {
real yHat[N];
for(i in 1:N){
yHat[i] = alpha + dot_product(x[i], beta); 
}
y ~ normal(yHat, sigma); // likelihood
target += dot_product(weights, y); \\ Do I add target here?
beta ~ normal(0,1); 
}

Thanks for the help!

Guido_Biele · September 29, 2017, 6:36am

in your weightless ;-) model this line increments the log posterior:

to implement weighting, you replace it with

for(i in 1:N){
  target +=  normal_lpdf(yHat[n], sigma) * weights[n];
}

You could also first write all results of normal_lpdf(yHat[n], sigma) to a vector and do a dot product with the weights, but I don’t think this will make the model faster (but the model would be harder to read).

[Your proposal added the sum of the weighted outcomes y to the log posterior, which is not what you want to do.]

ignacio · September 30, 2017, 11:32pm

@Guido_Biele I’m trying to implement this on a very simple model. Your code chunk says yHat[n] but I’m guessing that is a typo and it should be i instead of n. I tried making that change and i get the following error:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

Probabilty functions with suffixes _lpdf, _lpmf, _lcdf, and _lccdf,
require a vertical bar (|) between the first two arguments.
  error in 'model_no_groups_with_weights' at line 21, column 35
  -------------------------------------------------
    19:   }
    20:   for(i in 1:N){
    21:     target +=  normal_lpdf(yHat[i], sigma) * weights[i];
                                          ^
    22:   }
  -------------------------------------------------

PARSER EXPECTED: "|"

Then I replaced the , for a | as indicated in the error message, but I got a different error:


  error in 'model_no_groups_with_weights' at line 21, column 42
  -------------------------------------------------
    19:   }
    20:   for(i in 1:N){
    21:     target +=  normal_lpdf(yHat[i]|sigma) * weights[i];
                                                 ^
    22:   }
  -------------------------------------------------

In case is helpful, this is my full stan code:

data {
  int<lower=0> N;    // number of data items
  int<lower=0> K;    // number of predictors
  vector[N] y;       // outcome vector
  vector[K] x[N];    // predictor matrix
  vector<lower=0>[N] weights;  // model weights 
}

parameters {
  real alpha;                   // intercept
  vector[K] beta;               // coefficients for predictors
  real<lower=0> sigma;          // error sd
}

model {
  real yHat[N];
  for(i in 1:N){
    yHat[i] = alpha + dot_product(x[i], beta); 
  }
  for(i in 1:N){
    target +=  normal_lpdf(yHat[i]|sigma) * weights[i];
  }
  beta ~ normal(0,1); 
}

What am I doing wrong?

Thanks again!

stijn · October 1, 2017, 12:54am

Try

  for(i in 1:N){
    target +=  normal_lpdf(y[i] | yHat[i], sigma) * weights[i];
  }

Bob_Carpenter · October 2, 2017, 12:14am

Really @Guido-Biele’s post had the solution with a typo.

mariel · January 7, 2019, 9:47pm

Hi,

(First-time poster, long-time user, and big-time appreciate-er of Stan. Thanks so much for all you guys do!)

Is the above suggestion equivalent to a standard inverse-variance weighting approach?

 for(i in 1:N){
    target +=  normal_lpdf(y[i] | yHat[i], sigma / sqrt(weights[i]));
  }

I may well have misunderstood, but I thought the latter was what @bgoodri does in rstanarm when weights are included.

Thanks and best,

Mariel

bgoodri · January 7, 2019, 9:54pm

That is not what rstanarm does when weights are included. The stan_glm function does something like

Consequently, this makes no sense from a Bayesian perspective unless the original dataset has been collapsed to its unique rows except for a column of weights that counts the number of times that row appears in the original dataset. Otherwise, you are conditioning on something that you did not observe.

realkrantz · January 8, 2019, 12:13am

How does this

do inverse-variance weighting?

Guido_Biele · January 8, 2019, 9:48am

It does not.

Instead, one has to calculate the standard deviation for the effect of each study outside Stan (see e.g. the compute.es R package, or equation 2 here) and provide it as data.

Then, the following model estimates the effect size such that the study weight depends on the studies effect sizes and associated standard deviations.

data {
  int N;            // number of studies
  vector[N] y;      // study effect sizes
  vector[N] sigmas; // standard deviations of study effect sizes
}
parameters {
  real mu;
}
model {
  mu ~ normal(0,2);
  y ~ normal(mu,sigmas);
}

Here, weighting is “implicit”, because larger studies have smaller variances, as you can see from the equation for the variance of Cohen’s d:

\sigma^2_d=\frac{n_a+n_b}{n_a*n_b}+\frac{d^2}{2(n_a+n_b)}
where n_a and n_b are sample sizes of the groups compared to obtain d. (details)

maxbiostat · January 8, 2019, 10:08am

Not sure I follow. If I’ve “observed” the sampling, can I not include that information in my estimation?. Granted that you won’t have a posterior distribution, but rather a pseudo-posterior. The pseudo-posterior enjoys many desirable properties (reference) and can still be useful, I think.

realkrantz · January 8, 2019, 11:10am

Very helpful insights, @Guido_Biele, @maxbiostat, @bgoodri.

How could " the …inclusion probabilities [be used] to exponentiate the likelihood contribution of each sampled unit" in a stan program?

Thanks in advance.

maxbiostat · January 8, 2019, 11:14am

In log space that would be

 for(i in 1:N){
    target +=  normal_lpdf(y[i] | yHat[i], sigma) * weights[i];
  }

as pointed out above.

realkrantz · January 8, 2019, 11:36am

Thanks, @maxbiostat. That is very helpful.

bgoodri · January 8, 2019, 12:04pm

It isn’t Bayesian. The W_i people in the population that have the same background characteristics as the i-th person in the sample do not all have y_i as their outcome but multiplying the log-likelihood for the i-th person by W_i assumes so. I would rather post-stratify predictions from a real posterior distribution than work with something that isn’t a real posterior distribution.

mariel · January 11, 2019, 2:21pm

Thanks for clarifying! My confusion about how weights are treated by rstanarm stems from the fact that ?stan_lm says that they’re treated the “same as lm”.

I don’t think lm can be taking an approach analogous to normal_lpdf(y[i] | yHat[i], sigma) * weights[i] though, because lm weights are invariant to scale:

> y <- rnorm(100)
> w <- runif(100)
> fit_w   <- lm(y ~ 1, weights = w)
> fit_10w <- lm(y ~ 1, weights = 10*w)
> SE <- function(fit) summary(fit)$coef['(Intercept)', 'Std. Error'] 
> SE(fit_w) == SE(fit_10w)
[1] TRUE

Perhaps it would be clearest to say that weights in rstanarm are replication counts that treat each observation as one or more real observations (analogous to Stata’s fweights).

This would help clarify that scale very much does matter for rstanarm weights – the sum of the weights is equal to the number of observations in the original dataset.

Topic		Replies	Views
Struggles with Survey weighting and Regression Modeling General	6	515	March 24, 2020
MRP for average predictive comparisons? rstanarm surveys	6	708	December 15, 2020
What are the "weights" in rstanarm rstanarm	20	1863	November 21, 2020
Why shouldn't I use weights? General	6	533	January 24, 2020
Improving precision when post-stratifying with MRP Modeling techniques , specification , brms	1	300	July 31, 2023

Survey weighted regression

Related Topics