How to best analyse a table of frequencies in brms

#1

Please also provide the following information in addition to your question:

  • Operating System: linux
  • brms Version: 2.8.0

I have a fairly large data set of 45000 observations of deviance from reference performance.
Deviance is a non-normal skewed distribution with an lower bound at zero, dense at the lower devances but with a long tail towards higher deviances. For the analysis, I have divided deviance into meaningful ranges (severity) : No, Low, Medium, and High deviance.

The data set comprise four variables that I would like to investigate the effect of: customer (the customer that is using the component), customer_product (the product for which the component is used at the customer), component_used (the actual component used), component_type_used (the type of the component). These are all categorical factors. Not all customers make products that use the same components, and some customers have provided more data than other customers.

Just to give you an idea of the distribution of counts within the various categories, here’s a table of counts:

                                       severity      N     L     M     H

customer product component_type
A A 1962 42 9 0
B 7111 84 56 10
C 4836 14 0 0
B A 400 42 14 0
B 629 30 8 0
C 999 2 1 0
C A 3744 62 12 0
B 13639 145 94 17
C 11399 47 1 0

First I would like to predict the probability of No, Low, Medium, and High deviance given the customer, the customer_product, and the component_type used.

I guess there could be several approaches to analyzing such a data set:

  1. continuous data: model the parameters of a suitable distribution, compare, and use to predict the probability of an event to lie in a certain range?
  2. count data: model each category as a bernouili trial, and get the probability of an event directly?
  3. aggregated count data: model each category as binomial trial, then transform the posterior logodds to probabilities?
  4. aggregated count data: model the response as ordinal data, then somehow transform into probabilities?
  5. frequency data: model the response as a probabilities using dirichlet regression?

Any suggestion on how to best and most efficiently approach this is most welcome…

Thanks

/Jannik

#2

Why categorized your response in the first place? This will likely result in a lot of information loss so I would go with suggestion (1) of yours.