Regressing observed rates on self-reported frequencies measured in two stages

Hi! First post; apologies as I’m still learning the community norms. (Pointers welcome!)

I have some survey data, collected during winter months across many years, where the individual responses include:

  • burn_ydy: whether the respondent burned wood yesterday
  • burn_freq_lvl: self-reported “typical” frequency during winter months

Just looking at the means by burn_freq_lvl, it seems that self-reported “typical” rates generally exceed observed rates. See tabulation below, where wtd_prop*7 is the mean of burn_ydy, weighted by srv_wt, multiplied by 7. For example, among people who report burning “5 day/week”, the observed frequency is more like 3.3 day/week.

Regressing burn_ydy on burn_freq_lvl would be somewhat straightforward except that burn_freq_lvl is measured as follows:

First, the respondent is asked whether they burn wood:
a. “Never”
b. “At least once a week”
c. “Less than once a week”

If “b”, the follow-up question is:

  • About how many days per week?

If “c”, the follow-up is:

  • Less than once per month, once per month, or 2-3x per month?

Complicating the analysis is that the respondent can decline to answer at either stage. So the possible responses and their counts are:

                  burn_freq_lvl     n  sum_wt wtd_prop*7
1                         Never 19864 20272.3     0.0000
2      Less than once per month   534   610.4     0.0737
3                Once per month   731   697.3     0.0934
4  Two to three times per month   761   722.5     0.4935
5         Less than once a week    41    35.1     1.1764
6          At least once a week   123    96.8     1.0461
7                    1 day/week   365   358.9     0.8565
8                    2 day/week   427   387.1     1.0310
9                    3 day/week   390   319.5     2.1829
10                   4 day/week   202   140.3     2.9454
11                   5 day/week   175   126.2     3.3151
12                   6 day/week    58    29.2     4.3718
13                   7 day/week   323   221.1     5.2995
14                      Unknown   254   193.4     0.9614

My questions:

Can this kind of two-stage measurement (gross, then refined) be explicitly modeled in Stan, and if so, how? brms syntax is appreciated. If the dataset needs reshaping for convenience, I’m more than happy to do that … here’s a link to the data, which looks like this:

  srv_year srv_wt     cnty_name ZCTA_id burn_ydy        burn_freq_lvl
1     2013  0.672 San Francisco   94131    FALSE                Never
2     2018  0.585     San Mateo   94074     TRUE           7 day/week
3     2011  1.181 San Francisco   94127    FALSE                Never
4     2015  0.224          Napa   94558    FALSE At least once a week
5     2010  3.067       Alameda   94568    FALSE                Never
6     2009  0.117          Napa   94515    FALSE                Never

As you can see, there are other variables that I’d also like to include (survey year and ZIP code, nested within county). And weights, provided by the consultancy that administers the survey for the past 15+ years.

The reason for doing this regression at all is that it is part of a two-step approach, which I inherited from a previous analyst (since retired). That approach is to take the fitted values from this step, and then multiply them by another value measured at individual level, which is the self-reported typical amount of wood consumed during a burn event (# of logs). The ultimate goal is to predict a ZCTA (ZIP Code Tabulation Area) level distribution of the amount of wood consumed per week, for each ZCTA polygon, “circa 2020”. There is an association between Y1 (days per week) and Y2 (# logs per typical burn-day), in that people who report burning more frequently also tend to report burning more at a time (i.e., more logs per fire.) Doing the calculation at an individual level preserves this association; otherwise the estimate of the product Y1*Y2 is biased low.

I’m trying just to focus first on this two-stage measurement problem/technique for the first step of the approach. But if you think the two-step approach is flawed, and have a suggestion for an alternative, I’d love to hear it! There are issues with the second step, which I’m trying to reserve for another post. (One is missingness not at random, and another is “humanization” of values; unsurprisingly, it appears that people report in colloquial fashion, so values like “10 logs” are much more common than values like “9 logs” or “11 logs”.)

Yes. One way to do it is to just treat it as a 14-way response. That’ll probably work in brms, too. In Stan itself, it’d be easy enough to model the progression of choices separately rather than just taking it as a 14-way choice. Looks like that’s how your data’s organized, too :-). And you can do the weighting in Stan directly—I don’t know how flexible brms is for that.

I’d suggest just modeling directly rather than in two stages. If you do use two stages, you should try to propagate the uncertainty from the first stage (i.e., not take point estimates), so you get uncertainty propagated to your final estimates.

To model it directly, think of the process generatively. There’s an amount of wood that a person burns, then conditioned on that, you get a potentially biased answer to a survey question. In that way, it’s like any other measurement model. You want to infer the underlying amount of wood per person. You’re going to get uncertainty because the model of their response to a survey question is going to be deterministic. That is, given an amount of wood burned, there’s going to be a distribution over potential responses.

This is another measurement error problem that can be handled the same way. You assume there’s a true number of logs, and then you get a report like “10 logs” being more common than “9 logs”. The trick is always figuring out how to estimate this stuff, but I like to follow the advice I got from @andrewgelman, who said he got it from Don Rubin: think about what you would do if you had all the data, then just turn the Bayesian crank to infer the rest.

I hope they randomized the days! We used to have fires in the fireplace on the weekends when my parents had more time in the evening.

We’re pretty flexible, so please don’t worry about it. The best thing to do is provide enough info for us to answer the question, which you did.

1 Like

@Bob_Carpenter, thank you so much for your welcome! I’m reading up on Stan. And BDA, and the burgeoning ecosystem of related packages like bayesplot, tidybayes, marginaleffects, rvar, brms, easystats, etc. — understanding of course that this isn’t an R forum!

I guess my follow-up question is “How?” Any suggestions for cases to look at where people have implemented this in up-to-date Stan syntax would be more than welcome. Toy examples or real ones. Reading other people’s code is great … the trick is finding the right example! I’m happy to go straight to Stan code, don’t need brms or any other wrappers in particular.

I’m happy to hear about meta-help resources too. IIRC you were exploring using GPT-4 to write Stan code from NL prompts. How’s that going? Anyone else exploring Claude or other LLM assistants?

I love this way of thinking about things. Are you suggesting a multivariate outcome, with Y1 = burn days/week and Y2 = logs/burn? Or a univariate Y = logs/week, with both pieces of the envisioned data-generating process encoded in the (Stan syntax) model specification? If the latter, can I still recover/generate average marginal effects for Y1 and Y2?

Sorry again – I’m sure I’m using all the wrong words! I’ve very little stats training, only what I picked up in grad school for public health. I’ll get there. Or at least get farther along. Thanks again for your patience!

P.S. We used to have fires on weekends, too. The survey has a more complicated temporal design, having to do with “Spare the Air Days.” Those are when the agency I work for issues a temporary ban on wood-burning. The short story is that they get called when our meteorologists think air quality will be a problem. And then the survey administrators poll a bunch of people the next day. The original intent of the survey was to assess awareness, attitudes, and behavioral changes in relationship to these alerts. More recently we’ve thought about it as a means of informing overall wood-burning emission estimates, including spatial patterns. (SAE of fireplace prevalence is another thing I hope to leverage this dataset for.) People are polled on other dates too, and the dates are known; I just tried to keep it simple here. I’m hopeful that, eventually, a lot of these wrinkles can be part of the model specification.

P.P.S. The dataset is public, since we’re a government agency. So if it can help others by serving double duty as a learning resource, I’d be happy to support that, longer-term, by writing up stuff and sharing it back here. Who knows, maybe it’s a good case study for something you’re working on?

Following this up: Custom function to reallocate probability mass from certain integers to others. Thanks!