Please also provide the following information in addition to your question:
- Operating System: Win 10
- brms Version: 2.1
Hi everyone,
The Stan forum has been a fantastic resource as I moved from mostly frequentist/some JAGS to almost-entirely-Bayesian with Stan/brms. While all my other questions had already been asked and answered, I couldn’t find any help for my issue below.
I’m working with output data from a Hidden Markov Model. That is, for every timepoint, the HMM outputs the probability of being in state 1,2 …N. The model is fit over the entire cohort, with the same number of timepoints per participant. The net result is that, for each participant, we have a ‘fractional occupancy’ or ‘dwell time’ across the states. This is simply the average probability of that participant being in a given state across the participant’s timepoints.
There are a few quirks to my data:
- Some participants don’t visit a state at all (i.e. there are zeros)
- The fractional occupancies for a given participant sum to 1, so the fractional occupancy of a state is not independent of other states
- There are >2 states, so it’s not a simply success/failure model situation
I am interested in demographic and biological factors which increase/decrease the time spent in a given state / the distribution of dwell times across states. So, for instance, does a score on a test result in more time spent in State A, or does higher weight increase dwell time in state B, etc.
I found it hard to come up with a reproducible example, but here is an attempt:
set.seed(10101)
Fake data
df<-data.frame(SubNo=c(1:50),
Weight=rnorm(50,mean=50,sd=10),
Score=rnorm(50,mean=10,sd=3))Set up true state probabilities
df<-df %>%
mutate(StateA=1/3+0.001Weight-0.02Score,
StateB=1/3-0.001Weight,
StateC=1/3+0.02Score)Get the dwell times
df_states<-data.frame(StateA_dt=rep(0,50),
StateB_dt=rep(0,50),
StateC_dt=rep(0,50))for (i in 1:50){
df_states[i,]<-sample_states(100,c(“A”,“B”,“C”),df[i,4:6])
}Combine
final_df<-cbind(df,df_states)
Silly function we will use
sample_states<-function(N, states, probs){
tmp_df<-data.frame(state=sample(states,N,probs,replace = TRUE))
tmp_df$state<-factor(tmp_df$state,levels = states)
freq_vec=as.data.frame(table(tmp_df$state)/N)$Freq
return(freq_vec)
}
It’s a bad example, but at least gives some data to illustrate my attempts.
Initially I thought I would just model each state separately using a zero_inflated_beta family. For instance:
model_1<-brm(StateA_dt~1+Weight+Score,data = final_df,family = zero_inflated_beta(),
save_all_pars = TRUE,cores = 4,chains = 4,iter = 5000, warmup = 3000,
control = list(adapt_delta = 0.95,max_treedepth = 15))
But since a positive association between Weight & dwell time in State A would imply a negative association between Weight and dwell time in some/all other states, this doesn’t seem right.
Then I considered making the data ‘long’
long_df<-final_df %>% mutate(SubNo=as.character(SubNo)) %>%
select(-StateA,-StateB,-StateC) %>%
gather(State,Proportion,-SubNo,-Weight,-Score)
And modelling it using state interactions, e.g.
model_1<-brm(Proportion~0+State+WeightState+ScoreState,data = long_df,family = zero_inflated_beta(), save_all_pars = TRUE,cores = 4,chains = 4,iter = 5000, warmup = 3000,
control = list(adapt_delta = 0.95,max_treedepth = 15))
I’m not sure if that makes any sense however, since it involves fitting an effect for Weight/Score/etc over all states whereas we know they always sum to 1 within participant and as such an increase in one comes at the expense of some/all others.
How would you model outcome data such as mine? Or more generally, if the outcome data are proportions over a set of categorical outcomes which must sum to 1 within participant, how would you approach it?
(I posted in brms since that what I am using, but if the answer is to use some other package/base Stan I’m open to it!)
Thanks
Hugo