Combining data-sets

Hi all. I’m modeling voter turnout in the US and I want to use 2 different surveys (CPS and CCES) as input. I’ve transformed them both so that I have the same variables for each (age, sex, education, race/ethnicity and population density). For now, I just “stack” the data, creating a design matrix for each data-set and then appending them (along with the number of possible voters and votes cast). Then I fit a hierarchical model since I have this data for each state.

But somehow just stacking the data like it’s one data-set seems…suspect. Is there a standard way to handle this? Another parameter that I would then somehow average over before post-stratification? Is there a good place to read more about this?

Thanks!

What you’re essentially doing here is a kind of meta-analysis (with the raw data) and there are a lot of ways to do it. With Bayes, you can write separate likelihoods for each data set and share as many parameters as you want. If you share all the parameters and model structure, you get your solution that amounts to just combining the data. If you don’t share any of them, you get completely independent predictions.

If you’re going to poststratify, then you need to share the variables used for poststrat. One way to do that is to assume that there’s a mean parameterization across data sets and then a difference between the data sets. So if you have an intercept you can write it as \mu +\beta for one data set and \mu - \beta for the other data set, then use \mu to post-stratify. If the data sets are commensurate, the \beta should be small. If the \beta are large, you probably want to do some error analysis to see what’s going on.

@andrewgelman and @Lauren are the experts here, so I’m pinging them in case they have anything to add.

Stacking the data is indeed the right thing to do. I wrote about this in 2018, " Hey! Here’s what to do when you have two or more surveys on the same population!": Hey! Here’s what to do when you have two or more surveys on the same population! | Statistical Modeling, Causal Inference, and Social Science
Also relevant is this post from 2011, " Combining survey data obtained using different modes of sampling": Combining survey data obtained using different modes of sampling | Statistical Modeling, Causal Inference, and Social Science
Lauren and I have also talked about writing a paper on the topic.

1 Like

We are writing a paper on the topic! :)

1 Like

Thanks for all the replies! I look forward to the paper @Lauren and @andrewgelman !

I am doing something like what @Bob_Carpenter suggested, though his suggestion is better. I added a single parameter, with a multiplier set to 0 for one data-set and 1 for the other, to allow the data sets to have some overall offset in predicted probability.

Anyway, it helps to know I’m on the right track!