Using Stan for modelling changes in categorical outcomes: am I doing it right?

Hi everyone,

As will probably be quite obvious from this post, I’m very new to Stan and still wrapping my head around a lot of Bayesian modelling concepts, so please forgive me if my question is better suited to a more general stats forum.

Essentially, in my field of research, we use a lot of intermittent experience sampling probes to measure people’s states of attention throughout different tasks. These probes are usually categorical, forcing people to select one of N possible options from a list that best represents their state of focus just prior (e.g. “Yes, I was focused on the task”, “No, my mind had wandered from the task inadvertently”, “No, I deliberately decided to think about something else”, “No, I was distracted by something else around me”). We’re usually interested in how different variables (some categorical, some continuous) affect how people respond to these thought probes, and typically take the approach of calculating the proportions of each response type for each participant (e.g. 50% on-task, 20% distracted, etc.) and running different frequentist tests on them that way. Of course, this type of approach ignores the fact the proportions of all the different types of response must sum to 1 and that an increase in the probability of one response type will have an effect on the probability of others, so I’m turning to Stan and Bayesian modelling to try and come up with a more holistic and sane approach to probe response analysis that models what we know about the data better.

Based on different tutorials and examples, I’ve pieced together a simple model that I think does this properly, but it works a little differently from many of the example models I’ve seen and I want to make sure I’m not doing something patently wrong or bad so far. Here’s my current model code:

data {
  int<lower=0> N; // number of trials
  int<lower=1> K; // number of categories
  int<lower=1,upper=2> cond[N]; // condition identifier
  int<lower=1,upper=3> resp[N];


parameters {
  simplex[K] resp_p[2];

model {
  for (i in 1:2) {
	  resp_p[i] ~ dirichlet(rep_vector(1, K));
  for (i in 1:N) {
	  resp[i] ~ categorical(resp_p[cond[i]]);

Basically, for this specific data, there are two between-subjects conditions and 3 different possible responses that people can make. To model the effect of the condition on the likelihood of the possible response categories, I make a simplex for each condition and then model them separately based on the condition of each trial, giving me two probability distributions for each response that I can then subtract in R and get the differences for. Although this seems to work fine, I realize I’m not actually modelling the effects of condition on response types in my model: just modelling the parameter values for each condition separately and then comparing them. However, I’m not really sure how I could model the differences in response proportions directly given the I’d have to account for the fact that an increase in one response type would mean a decrease in the proportions in the others. Is what I’m currently doing valid? If not, what would a better approach here be?

Also, I’ve been trying to figure out how I might go about implementing a mixed-effects model for data of this type, i.e. accounting for both within-subject variance and between-subject variance in the model. I’ve looked over the helpful “Hierarchical Partial Pooling for Repeated Binary Trials” tutorial from Bob Carpenter several times, but it cautions that low observation counts per participant can create major problems with hierarchical pooling in models for binary outcomes, and in the case of these experiments we often only end up with 8-20 thought probes per participant (since you end up altering people’s attention during a task if you query them too frequently). Does this mean that hierarchical models might be a worse fit for this type of data than completely pooled ones, or are there just certain things I need to be careful about if I’m going to account for within-subject variance?

Thanks in advance for any advice you have to offer! I’m really excited to delve deeper into Stan.


  • Austin

If you can recover parameters from simulated data, good stuff. Seems reasonable to me.

Usually the trick with link functions is find something that goes from an unconstrained space to the constrained space and just put a linear model in the unconstrained space.

softmax is the thing that takes you from N unconstrained parameters to an N simplex, but I’d be scared if you didn’t do this right you’d end up with an unidentifiability. Something like:

resp ~ categorical(softmax(B * X));

where B is a matrix of parameters and X are your covariates. Maybe the IRT docs would give you some leads: Stan User’s Guide ?

Check the Polytomous links here: . Maybe thumb through the software too to get an idea what folks are working on. The author for GitHub - saudiwin/idealstan: idealstan offers item-response theory (IRT) ideal-point estimation for binary, ordinal, counts and continuous responses with time-varying and missing-data inference. Latent space model also included. Full and approximate Bayesian sampling with 'Stan' ( is pretty active around here. I dunno if your models fit in exactly that framework but at worst you could get some good mileage looking at the idealstan models.

It’s hard to even estimate the parameter of a Bernoulli. A single yes or no just doesn’t say that much at the end of the day.

You can do some hypothetical experiments with a Bernoulli that are pretty revealing. Just write a little R code and try to compute how many shots you’d want to see from an IID basketball player before you’re confident your estimator is within two percentage points or something.

The message is probably that hierarchical will only get you so far. At some point you’ll just be looking at noise and then you really only have the option of looking at completely pooled models.

Resampling might not be such a practical (or possible) thing in reality, but it’s a fine thing to do for simulation studies. If you’re scared of something like this, set up fake data simulations and try to trick yourself.


Disclaimer: My expertise in hierarchical Bayesian models is also intermediate at best.

It looks to me like the response categories do not have an ordering. This is unfortunate from a methodological perspective, since an ordered categorical response is much simpler to model (and the results simpler to interpret) than if no ordering exists.

Your problem seems to be exactly what multinomial regression is for. Confusingly, these models are called by different names in different quarters – sometimes multinomial, sometimes polytomous, sometimes categorical.

Anyway, I think a hierarchical (mixed in frequentist terms) multinomial (categorical in brms terminology) logit or probit model in the correct approach. In the simplest approach, the responses would be a single vector (‘factor’ in R-speak) containing all the nominal-scale responses occurring in the data. Subject id would indeed constitute the group-level (random in frequentist terms) effect(s), and the uh…test conditions/stimuli would be the population-level (fixed in frequentist terms) effects. At least the brms package fits these models just fine. The results of such models are a little complex to interpret due to what you state above, i.e. the dependence of every category probability on every other category probability. The Nobel prize will surely go to whoever can come up with a solution to this. I’m not holding my breath. Meanwhile, third-party libraries exist to aid in the interpretation of these models.

Computational costs decrease if the data are grouped such that every distinct combination of explanatory variable values (including Subject) are grouped so that the multinomial response is a C-column matrix whose C columns represent the counts in each response category for each distinct combination of explanatory variable values. brms can do this for binomial outcomes, but I don’t know if this format is supported for categorical responses yet. Either way, the results for group and population-level effects should be identical in both approaches.

1 Like

@a-hurst You might take a look at some of the multinomial examples here It also has advice for the identifiability issues that arise with multinomial models.