Hi! First post; apologies as I’m still learning the community norms. (Pointers welcome!)
I have some survey data, collected during winter months across many years, where the individual responses include:
burn_ydy
: whether the respondent burned wood yesterdayburn_freq_lvl
: self-reported “typical” frequency during winter months
Just looking at the means by burn_freq_lvl
, it seems that self-reported “typical” rates generally exceed observed rates. See tabulation below, where wtd_prop*7
is the mean of burn_ydy
, weighted by srv_wt
, multiplied by 7. For example, among people who report burning “5 day/week”, the observed frequency is more like 3.3 day/week.
Regressing burn_ydy
on burn_freq_lvl
would be somewhat straightforward except that burn_freq_lvl
is measured as follows:
First, the respondent is asked whether they burn wood:
a. “Never”
b. “At least once a week”
c. “Less than once a week”
If “b”, the follow-up question is:
- About how many days per week?
If “c”, the follow-up is:
- Less than once per month, once per month, or 2-3x per month?
Complicating the analysis is that the respondent can decline to answer at either stage. So the possible responses and their counts are:
burn_freq_lvl n sum_wt wtd_prop*7
1 Never 19864 20272.3 0.0000
2 Less than once per month 534 610.4 0.0737
3 Once per month 731 697.3 0.0934
4 Two to three times per month 761 722.5 0.4935
5 Less than once a week 41 35.1 1.1764
6 At least once a week 123 96.8 1.0461
7 1 day/week 365 358.9 0.8565
8 2 day/week 427 387.1 1.0310
9 3 day/week 390 319.5 2.1829
10 4 day/week 202 140.3 2.9454
11 5 day/week 175 126.2 3.3151
12 6 day/week 58 29.2 4.3718
13 7 day/week 323 221.1 5.2995
14 Unknown 254 193.4 0.9614
My questions:
Can this kind of two-stage measurement (gross, then refined) be explicitly modeled in Stan, and if so, how? brms
syntax is appreciated. If the dataset needs reshaping for convenience, I’m more than happy to do that … here’s a link to the data, which looks like this:
srv_year srv_wt cnty_name ZCTA_id burn_ydy burn_freq_lvl
1 2013 0.672 San Francisco 94131 FALSE Never
2 2018 0.585 San Mateo 94074 TRUE 7 day/week
3 2011 1.181 San Francisco 94127 FALSE Never
4 2015 0.224 Napa 94558 FALSE At least once a week
5 2010 3.067 Alameda 94568 FALSE Never
6 2009 0.117 Napa 94515 FALSE Never
As you can see, there are other variables that I’d also like to include (survey year and ZIP code, nested within county). And weights, provided by the consultancy that administers the survey for the past 15+ years.
The reason for doing this regression at all is that it is part of a two-step approach, which I inherited from a previous analyst (since retired). That approach is to take the fitted values from this step, and then multiply them by another value measured at individual level, which is the self-reported typical amount of wood consumed during a burn event (# of logs). The ultimate goal is to predict a ZCTA (ZIP Code Tabulation Area) level distribution of the amount of wood consumed per week, for each ZCTA polygon, “circa 2020”. There is an association between Y1 (days per week) and Y2 (# logs per typical burn-day), in that people who report burning more frequently also tend to report burning more at a time (i.e., more logs per fire.) Doing the calculation at an individual level preserves this association; otherwise the estimate of the product Y1*Y2 is biased low.
I’m trying just to focus first on this two-stage measurement problem/technique for the first step of the approach. But if you think the two-step approach is flawed, and have a suggestion for an alternative, I’d love to hear it! There are issues with the second step, which I’m trying to reserve for another post. (One is missingness not at random, and another is “humanization” of values; unsurprisingly, it appears that people report in colloquial fashion, so values like “10 logs” are much more common than values like “9 logs” or “11 logs”.)