Election forecast estimation using survey and census data applying MrP method in R

Hey all!

For a project I want to calculate an election forecast. I run a survey and collected data (sampling process was highly selective, so data is biased). I also have access to a census-dataset, so I know really well about the real population (the electorate). Now I want to do a MrP estimation in R relying on my sample (the survey) and the census.


I already coded something (see below) but I’m unsure whether this is the best and only way to go…

My questions are:

  1. How is it in practice possible to get estimates for a categorical variable like party preference applying MrP? I usually found on the internet examples where there was a bivariate variable like yes/no. In my use case, I have multiple parties to estimate.
  2. Is the coding below appropriate and useful for my use case? With this code I run MrP for each party (e.g. the German “union” party in the example). So I get for all 7 parties estimates - problematic is that by summing the party-estimates for my final joint estimation I am below or above 100%. How can I get at once the estimate for all parties with MrP, so I get as total value 1 (100%)?
  3. How is MrP when fitting the model with variables with NA values in the sample, which can’t matched to the census? Are they simply weighted as 1?
  4. Running the code for my whole dataset (n=10.000) takes really long (more than 10 minutes) in my R despite I have quite good hardware. Why is this so?

Here the actual code for MrP party estimation for “union” party. (As described above I’m not convinced of this way, instead I would like to calculate the vote share for all parties in one MrP model.)

# create model: estimate union vote share by gender, age group, last voting decision (party_2017) and state
fit_model_sample_union <- stan_glmer(union_vote ~ 1 + (1|gender) + (1|agegroup) + (1|party_2017) + (1|state),
family = binomial(link = "logit"),
data = sample_mrp_union,
prior = normal(0, 1, autoscale = TRUE),
prior_covariance = decov(scale = 0.50),
adapt_delta = 0.99,
refresh = 0,
seed = 111)
print(fit_model_sample_union)

# calculate MRP estimate mean and sd
epred_mat <- posterior_epred(fit_model_sample_union, newdata = census_mrp, draws = 1000)
mrp_estimates_vector <- epred_mat %*% ((census_mrp$votes_valid_party / sum(census_mrp$votes_valid_party)))
mrp_estimate <- c(mean = mean(mrp_estimates_vector), sd = sd(mrp_estimates_vector))
cat("MRP estimate mean, sd: ", round(mrp_estimate, 5))

Looking forward to your ideas / help :-)

1 Like

@lauren @jonah

Does multinomial regression in brms work? This will constrain party probabilities to sum to 1.

So I’m going to tackle these in order. Regarding estimation of non-census variables, you will need to first estimate a model predicting the variable to include it in your post-stratification table. See @andrewgelman’s MrP Case Studies section which details how to do this for a multi-category party identification variable using multinomial logistic regression via brms here: Chapter 2 MRP with Noncensus Variables | Multilevel Regression and Poststratification Case Studies

It also sounds like you’ll want to use brms for your final model since your response isn’t dichotomous but does consist of mutually exclusive categories and rstanarm doesn’t have support for multinomial logit.

Second, generally speaking Stan does not support missing values and the default behavior is that these will be dropped from the model matrix. One exception to this is that you can specify a multivariate model to impute the missing data at the cost of increased computational time. See the brms vignette for more details on that: Handle Missing Values with brms

If you attempt to pass a data frame (i.e., your post-stratification table) to the predict method and it contains missing values, you will simply receive an error saying missing values aren’t allowed.

Finally, MrP models tend to be incredibly complex (varying effects for basically everything) and thus computationally intensive. I have a Ryzen 9 5900X and an RTX 3090, and even after manually vectorizing my Stan code and utilizing GPU-based computation via OpenCL, a multinomial logistic regression model for MrP takes about three full days to finish (N = ~30,000). Patience is a virtue that complex models are effectively impossible to fit via MCMC without.

Oh, and regarding your model specification, drop the varying intercept gender since there’s nothing gained by including varying intercepts for any variable with less than three groups. Does your census data have the marginal for education by age group? That’s probably pretty predictive of vote choice and should be included in your model.

Ignoring party_2017, which I’m assuming is the non-census variable you mentioned, your equation should look something like this, though I am admittedly not an expert on German politics

vote ~ gender + (1 | agegroup) + (1 | educ) + (1 | agegroup:educ) + (1 | gender:educ) + (1 | gender:agegroup) + (1 | state)

However, you should also add at least one state-level predictor to attempt to predict the variance between states.

A couple quick comments:

  1. Give strong priors on the group-level sd parameters and that can help.

  2. You can even set them to fixed values. This wasn’t so clear in my book with Jennifer, where the group-level sd parameters were all estimates solely from data, but I think it can make a lot of sense to just set them to reasonable values. Also this will speed computation.

  3. Quick comment regarding statistical practice. It’s not good to have a binary variable called gender as then you have to remember the coding. Instead create a variable called male (so that female is the base category). Or a variable called female. Either way, this will make your model more interpretable.

If your model with only 30,000 respondents is taking 3 days to fit, I think you should be changing your model. Folk theorem and all that.

1 Like