I only just discovered that I can run a binomial model (passing in data that is a summary of e.g., hits and misses) rather than a bernoulli model (passing is separate observations with 1 row in the data for each hit or miss per respondent). With the bernoulli version, the data was 700k rows, whereas it was about 20 for the binomial model. The binomial ran in 3 seconds whereas I had to wait so long that I stopped trying the bernoulli model.
Is there a similar way of passing in summaries for categorical models and cumulative/ordinal models? It seemed like the sampling was just so much slower with such a large amount of data, and I wondered if these other types of models could also be sped up if one could just work with a summary of the categories selected.
I use brms rather than Stan directly.
Hi @JimBob for categorical responses there is support in brms using
family = multinomial() where the response variable is expressed as a matrix, I used this model here. Post-processing from brms was really convenient for this.
Not sure about ordinal responses though, I don’t think I’ve seen a situation where multiple trials per observation from ordered categories would fit, but makes sense. As the
cumulative() family for example accepts an ordered factor, it would need a different format for the input data.
That’s great, thank you! I think using multinomial could massively speed up some worryingly long sampling times I had for large datasets.
I would have thought it might also be possible to just extend this kind of data input format to cumulative by just having some additional input that specifies an order for the responses, or specifying the factor order in the data that is put in. I don’t know much about how that would all work under the hood though!
For any outcome where observations are independent (including the cumulative logit case), you can use
weights() in the model formula to specify frequency weights. For example
y | weights(w) ~ x1 + x2
From page 42 in the
brms CRAN documentation
For all families, weighted regression may be performed using weights in the aterms part. Internally, this is implemented by multiplying the log-posterior values of each observation by their
corresponding weights. Suppose that variable wei contains the weights and that yi is the response
variable. Then, formula yi | weights(wei) ~ predictors implements a weighted regression.
This would work if you have a large number of observations that are identical on
x2. You could then count the number of rows that correspond to each combination and treat that as your frequency weight variable (
w in the formula above).
That is a very clever way of thinking about it! So I could, for example, take my very long data frame, but then just summarise it according to the number of people falling into each unique combination of outcome and predictors. Then use the number of people in each said combination as a weight. That is really cool!