Modelling with non-binary proportions as outcome

Please also provide the following information in addition to your question:

  • Operating System: Win 10
  • brms Version: 2.1

Hi everyone,

The Stan forum has been a fantastic resource as I moved from mostly frequentist/some JAGS to almost-entirely-Bayesian with Stan/brms. While all my other questions had already been asked and answered, I couldn’t find any help for my issue below.

I’m working with output data from a Hidden Markov Model. That is, for every timepoint, the HMM outputs the probability of being in state 1,2 …N. The model is fit over the entire cohort, with the same number of timepoints per participant. The net result is that, for each participant, we have a ‘fractional occupancy’ or ‘dwell time’ across the states. This is simply the average probability of that participant being in a given state across the participant’s timepoints.

There are a few quirks to my data:

  1. Some participants don’t visit a state at all (i.e. there are zeros)
  2. The fractional occupancies for a given participant sum to 1, so the fractional occupancy of a state is not independent of other states
  3. There are >2 states, so it’s not a simply success/failure model situation

I am interested in demographic and biological factors which increase/decrease the time spent in a given state / the distribution of dwell times across states. So, for instance, does a score on a test result in more time spent in State A, or does higher weight increase dwell time in state B, etc.

I found it hard to come up with a reproducible example, but here is an attempt:

set.seed(10101)

Fake data

df<-data.frame(SubNo=c(1:50),
Weight=rnorm(50,mean=50,sd=10),
Score=rnorm(50,mean=10,sd=3))

Set up true state probabilities

df<-df %>%
mutate(StateA=1/3+0.001Weight-0.02Score,
StateB=1/3-0.001Weight,
StateC=1/3+0.02
Score)

Get the dwell times

df_states<-data.frame(StateA_dt=rep(0,50),
StateB_dt=rep(0,50),
StateC_dt=rep(0,50))

for (i in 1:50){
df_states[i,]<-sample_states(100,c(“A”,“B”,“C”),df[i,4:6])
}

Combine

final_df<-cbind(df,df_states)

Silly function we will use

sample_states<-function(N, states, probs){
tmp_df<-data.frame(state=sample(states,N,probs,replace = TRUE))
tmp_df$state<-factor(tmp_df$state,levels = states)
freq_vec=as.data.frame(table(tmp_df$state)/N)$Freq
return(freq_vec)
}

It’s a bad example, but at least gives some data to illustrate my attempts.

Initially I thought I would just model each state separately using a zero_inflated_beta family. For instance:

model_1<-brm(StateA_dt~1+Weight+Score,data = final_df,family = zero_inflated_beta(),
save_all_pars = TRUE,cores = 4,chains = 4,iter = 5000, warmup = 3000,
control = list(adapt_delta = 0.95,max_treedepth = 15))

But since a positive association between Weight & dwell time in State A would imply a negative association between Weight and dwell time in some/all other states, this doesn’t seem right.

Then I considered making the data ‘long’

long_df<-final_df %>% mutate(SubNo=as.character(SubNo)) %>%
select(-StateA,-StateB,-StateC) %>%
gather(State,Proportion,-SubNo,-Weight,-Score)

And modelling it using state interactions, e.g.

model_1<-brm(Proportion~0+State+WeightState+ScoreState,data = long_df,family = zero_inflated_beta(), save_all_pars = TRUE,cores = 4,chains = 4,iter = 5000, warmup = 3000,
control = list(adapt_delta = 0.95,max_treedepth = 15))

I’m not sure if that makes any sense however, since it involves fitting an effect for Weight/Score/etc over all states whereas we know they always sum to 1 within participant and as such an increase in one comes at the expense of some/all others.

How would you model outcome data such as mine? Or more generally, if the outcome data are proportions over a set of categorical outcomes which must sum to 1 within participant, how would you approach it?

(I posted in brms since that what I am using, but if the answer is to use some other package/base Stan I’m open to it!)

Thanks

Hugo

Hello Hugo,

From a brief reading, it seems a multinomial, multilevel model is appropriate for your data. Discrete time points with k possible states and multiple measures within subjects. Using a multinomial likelihood, you can infer the latent subject-specific probabilities of being observed in each state, while also accounting for various fixed effects influencing transition probabilities (and potential random subject slopes as well). There is no issue with not visiting particular states, as the model will explicitly account for this without the need for zero-inflation parameters (which reflect theoretical assumptions about “false zeroes” that seem inappropriate in your case).

Here is a brief example of the syntax in brms, and here is a very thorough article explaining how to estimate and interpret a multinomial, multilevel model in the context of repeated subject measures across a fixed number of possible behavioral states.

Koster, J., & McElreath, R. (2017). Multinomial analysis of behavior: statistical methods. Behavioral Ecology and Sociobiology , 71 (9), 138.

2 Likes

Thanks so much for the detailed reply. It’ll take me a while to wrap my head around the Koster and McElreath paper, but given how much I love his ‘rethinking’ I’m sure it’ll be worth it!

The post by Paul is also great. I’ve attached a small and anonymised subset (10) of the data - 157 timepoints per participant, 10 states.

>  A tibble: 1,570 x 12
>    sub_no   Age state_bin_01 state_bin_02 state_bin_03 state_bin_04 state_bin_05 state_bin_06 state_bin_07 state_bin_08 state_bin_09 state_bin_10
>    <chr>  <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
>  1 a       94.8            0            0            0            0            0            0            0            1            0            0
>  2 a       94.8            0            0            0            0            0            0            0            1            0            0
>  3 a       94.8            0            0            1            0            0            0            0            0            0            0
>  4 a       94.8            0            0            1            0            0            0            0            0            0            0
>  5 a       94.8            1            0            0            0            0            0            0            0            0            0
>  6 a       94.8            1            0            0            0            0            0            0            0            0            0
>  7 a       94.8            0            0            1            0            0            0            0            0            0            0
>  8 a       94.8            0            0            1            0            0            0            0            0            0            0
>  9 a       94.8            1            0            0            0            0            0            0            0            0            0
> 10 a       94.8            0            0            0            0            0            0            0            1            0            0
>  … with 1,560 more rows

Stealing from Paul’s post:

test_subset$y<-with(test_subset,cbind(state_bin_01,state_bin_02,state_bin_03,state_bin_04,state_bin_05,state_bin_06,state_bin_07,state_bin_08,state_bin_09,state_bin_10))

If I understand correctly, the idea would be to fit something like:

fit ← brm(bf(y | trials(1) ~ (1|ID|sub_no)), data = test_subset, family = multinomial(),save_all_pars = TRUE,cores = 4,chains = 4)

That was pretty slow to run (~1H15MIN) but the results make sense. I suspect the correlations between conditions - which as recommended in another post - makes it far slower?

Since I am interested in the effect of ‘fixed’ effects on the dwell times, the final model would look more like:

fit ← brm(bf(y | trials(1) ~ Var1+ Var2 + Var3 …(1|ID|sub_no)), data = actual_data, family = multinomial(),save_all_pars = TRUE,cores = 4,chains = 4)

I’ll let a model with ‘Age’ run overnight. Anything you would change with the model specifications above?

I suppose another option, more in keeping with Paul’s post, would be to use the participant level summary of the count data:

   sub_no state_bin_01 state_bin_02 state_bin_03 state_bin_04 state_bin_05 state_bin_06 state_bin_07 state_bin_08 state_bin_09 state_bin_10
   <chr>         <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
 1 a                43            0           17            0            0           38            0           55            0            4
 2 b               106            0            0            0           18            0           11           22            0            0
 3 c                 0          132            0            0            0            0           22            0            0            3
 4 d                57            0           27            9            0           26            2           12            6           18
 5 e                 0            0            0            0          154            0            0            3            0            0
 6 f               104            0            3            0           11            7            2           30            0            0
 7 g                20           15            0            0           17            0          102            1            0            2
 8 h               101            0            0            2           18            2           13           21            0            0
 9 i                51            0           50            0            0            0            0           54            0            2
10 j                 0            0            0            0          113            0           44            0            0            0

Which (guessing here) may run faster? Perhaps I need to understand the Koster paper better to know what I’d lose here.

Thanks again

Hugo

test_subset.csv (61.5 KB)

1 Like