Choice between weights, MRP or other methods

Hi Stan developers and users,

I have some general questions about which weighting method is appropriate to use for our data and research question. We are interested in the association between disease interventions and disease rate in the population. We have the county-level data. In each county, there are 7 age groups and quarterly disease rates. Therefore, the total number of rows was ~3130 counties x 7 age groups x 4 quarterly rates. We also have the ACS population estimates for each age group in every county.

If we were using Frequentist method, I think we would use the weighted least square regression using the population size as the weights in the regression model. Then the coefficients of the intervention variables would be the association that we want to estimate in the population.

I tried specifying the weights() using population size by age group and county in brm function. However, the model failed to converge. I guess it was because there were a lot of variation in population size by age group and county. I also read several posts that compared weighted regression and MrP. But, I am not sure whether MrP is applicable in our case since our data are not survey data with individual respondents. I was wondering whether you could give us suggestion about which method (or a different method) is appropriate in Bayesian framework.

2 Likes

Hi @Zoe_Kao,

I think the right answer will depend on the kind of intervention your studying and the data collected. Do you have data for before and after the intervention? Or is this more of a cross-sectional study?

The data structure that you describe is really complex! To start with, most Bayesian analyses of county disease incidence use something like this:

y_i \sim \text{Poisson}( P_i \times e^{\lambda_i})
\boldsymbol \lambda \sim \text{Gaussian}\big(\boldsymbol \mu, \boldsymbol \Sigma \big)

where P_i is the population at risk for the i^{th} county, \lambda_i is the log-incidence rate (log-risk), and \boldsymbol{\Sigma} has a spatial autocorrelation structure to it, as in a conditional autoregressive specification (CAR).

If you have observations for multiple time points, then there will also be a temporal autocorrelation structure which can enter into \boldsymbol \mu, but how it enters in depends on the type of data you have, what kind of knowledge you have about disease incidence (mainly, infectious disease modeling might proceed differently than chronic disease modeling, see paper below), and your research questions.

Adding in multiple age groups introduces other challenges, though you don’t necessarily have to get into that. For example, some would not want to model seven observations (age groups) from the same county as independent from each other; I think one way this is addressed is with a more complicate CAR model for multiple outcomes (MCAR), which is not implemented in Stan, as far as I’m aware.

1 Like

The weights() function in brm is scale-dependent, unlike many frequentist methods. Better to think of them as specifying frequency weights than sampling or post-stratification weights. When I generate a simple model, it includes the line target += weights[n] * (normal_lpdf(Y[n] | mu[n], sigma));, which is the same as if I had weights[n] rows of the same value of Y[n].

If your weights are truly frequency weights then it should be appropriate–though that doesn’t solve your computation problem. Otherwise, the weights() option probably isn’t ideal, as it would be equivalent to having observed the entire population.

the multi-level regression is appropriate, the post-stratification not necessary since you don’t have individual respondents. post-stratification adjusts for non-representative sampling. the assumption is that all counties are reporting on the full population and you have no way of knowing otherwise.

1 Like