Nominal data and Kruschke's "conditional logistic" approach

The current brms way to model nominal data (family = categorical) uses the softmax approach where

f(y) = \mu_y = \frac{\exp(\eta_y)}{\sum_{k=1}^K \exp(\eta_k)}.

I know there are some abilities to fit conditional logistic models in brms, based on GitHub issue #560. I’m wondering if we can use brms to fit a conditional logistic model using Kruschke’s strategy where we, “divide the set of outcomes into a hierarchy of two-set divisions, and use a logistic to describe the probability of each branch of the two-set divisions” (2015, p. 655; book link). Kruschke further explained

The underlying propensity to respond with any outcome in the subset of outcomes S^* relative to (i.e., conditional on) a larger set S is denoted

\lambda_{S^* | S} = \beta_{0, S^* | S} + \beta_{1, S^* | S}

and the conditional response probability is

\phi_{S^* | S} = \operatorname{logistic}(\lambda_{S^* | S}).

Take the case where you have a K = 4 category nominal variable y, one might divide up the hierarchy of divisions as

and thus

\begin{align*} \phi_{\{ 1 \} | \{ 1, 2, 3, 4 \}} & = \operatorname{logistic}(\eta_{1 | \{ 1, 2, 3, 4 \}}) \\ \phi_{\{ 2 \} | \{ 2, 3, 4 \}} & = \operatorname{logistic}(\eta_{2 | \{ 2, 3, 4 \}}) \\ \phi_{\{ 3 \} | \{ 3, 4 \}} & = \operatorname{logistic}(\eta_{3 | \{ 3, 4 \}}), \end{align*}

where the \eta_k is the k^\text{th} predictor term and we can recover the conditional probabilities because they obey the rule

\begin{align*} \phi_1 & = \phi_{\{ 1 \} | \{ 1, 2, 3, 4 \}} \\ \phi_2 & = \phi_{\{ 2 \} | \{ 2, 3, 4 \}} \cdot (1 - \phi_{\{ 1 \} | \{ 1, 2, 3, 4 \}}) \\ \phi_3 & = \phi_{\{ 3 \} | \{ 3, 4 \}} \cdot (1 - \phi_{\{ 2 \} | \{ 2, 3, 4 \}}) \cdot (1 - \phi_{\{ 1 \} | \{ 1, 2, 3, 4 \}}) \\ \phi_4 & = (1 - \phi_{\{ 3 \} | \{ 3, 4 \}}) \cdot (1 - \phi_{\{ 2 \} | \{ 2, 3, 4 \}}) \cdot (1 - \phi_{\{ 1 \} | \{ 1, 2, 3, 4 \}}). \end{align*}

The model block from Kruschke’s example JAGS code for this was

model {
  for ( i in 1:Ntotal ) {
    
    y[i] ~ dcat( mu[1:Nout,i] )
    mu[1,i] <- phi[1,i]
    mu[2,i] <- phi[2,i] * (1-phi[1,i])
    mu[3,i] <- phi[3,i] * (1-phi[2,i]) * (1-phi[1,i])
    mu[4,i] <- (1-phi[3,i]) * (1-phi[2,i]) * (1-phi[1,i]) 
    
    for ( r in 1:(Nout-1) ) {
      phi[r,i] <- ilogit( zbeta0[r] + sum( zbeta[r,1:Nx] * zx[i,1:Nx] ) ) 
    }
  }

I have fumbled around trying to repurpose the code from issue #560 to make this work, but it’s clear this is beyond my skill set. Who has a solution? If possible, I’d really like to see a simple intercepts-only version of the model. I propose the example data set might be:

dat <- data.frame(Y = rep(1:4, times = c(80, 114, 138, 143)))
2 Likes

Hi @Solomon this is a cool idea. I am in a bit of a hurry but just wondered if you have thought about implementing this using a nonlinear formula? Seems like you could do it that way… (Or maybe not, I’ll have to think about it a bit more when I get some time.)

Hey @Matti. Yep, I tried adjusting the non-liner approach explored in GitHub issue #560, to no success. It might well be possible, but my non-linear chops are limited.

One approach for doing so (that is admittedly a bit cumbersome) is using the custom family functionality of brms. The idea is to use the tree structure of the conditional logistic model and treat it like a multinomial processing tree model.

Note also that the Fox & Weisberg books (i.e., the basis of the car package) call this type of analysis “nested dichotomies”. Therefore, we can use their example data which uses a categorical response with three categories:

library("tidyverse")
library("brms")

womenlf <- carData::Womenlf
womenlf <- womenlf %>% 
  mutate(work2 = factor(partic, 
                        levels = c("not.work", "parttime", "fulltime"))) %>% 
  mutate(work3 = as.numeric(work2))
str(womenlf)
# 'data.frame':	263 obs. of  6 variables:
#  $ partic  : Factor w/ 3 levels "fulltime","not.work",..: 2 2 2 2 2 2 2 1 2 2 ...
#  $ hincome : int  15 13 45 23 19 7 15 7 15 23 ...
#  $ children: Factor w/ 2 levels "absent","present": 2 2 2 2 2 2 2 2 2 2 ...
#  $ region  : Factor w/ 5 levels "Atlantic","BC",..: 3 3 3 3 3 3 3 3 3 3 ...
#  $ work2   : Factor w/ 3 levels "not.work","parttime",..: 1 1 1 1 1 1 1 3 1 1 ...
#  $ work3   : num  1 1 1 1 1 1 1 3 1 1 ...

The dependent variable here is the level of participation of women in the workplace with three levels. We follow the analysis strategy in Fox and Weisberg (2010, pp. 259) and use the factor ordering "not.work","parttime", "fulltime" (i.e., not.work is the first factor level).

Because we have only three categories, the formula simplifies to:
\phi_1 = mu \\ \phi_2 = (1 - mu) * mub \\ \phi_3 = (1 - mu) * (1 - mub)

Also note that I already include the parameter names we will be using in brms where the first parameter of a custom family is always called mu. We then call the second parameter mub (as we have only three categories, only two independent parameters are needed).

We can then set up the model as follows:

mpt_cond <- custom_family("mpt_cond", 
                          links = "identity", dpars = c("mu", "mub"), 
                          type = "int",
                          vars = c("n_cat"))

stanvars <- stanvar(x = 3, name = "n_cat", scode = "  int n_cat;")

stan_lpmf <- stanvar(block = "functions", scode = "real mpt_cond_lpmf(int y, real mu, real mu_b, int n_cat) {
  real p_mu = inv_logit(mu);
  real p_mub = inv_logit(mu_b);
  vector[n_cat] prob;
  prob[1] = p_mu;
  prob[2] = (1 - p_mu) * p_mub;
  prob[3] = (1 - p_mu) * (1 - p_mub);
  return(categorical_lpmf(y | prob));
}") 

The key part is the new lpmf function that contains the formula given above on the probability scale (i.e., after applying the inverse logistic function to transform from linear to probability scale).

We can then specify one formula per parameter and fit the model. We use the same formula as Fox and Weisberg (2010, pp. 265) for this data.

my_formula <- bf(work3 ~ hincome + children + region, 
                 mub ~ hincome + children + region)
fit1 <- brm(my_formula, data = womenlf, family = mpt_cond, 
              stanvars = stanvars + stan_lpmf)

This gives us the following results:

> fit1
 Family: mpt_cond 
  Links: mu = identity; mub = identity 
Formula: work3 ~ hincome + children + region 
         mub ~ hincome + children + region
   Data: womenlf (Number of observations: 263) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              -1.31      0.57    -2.43    -0.19 1.00     2262     2570
mub_Intercept          -4.12      1.12    -6.43    -2.02 1.00     2380     2610
hincome                 0.05      0.02     0.01     0.09 1.00     4468     2724
childrenpresent         1.65      0.30     1.06     2.27 1.00     3903     2898
regionBC               -0.36      0.59    -1.52     0.80 1.00     2489     2800
regionOntario          -0.20      0.47    -1.11     0.70 1.00     2231     2611
regionPrairie          -0.49      0.57    -1.65     0.63 1.00     2588     3002
regionQuebec            0.17      0.51    -0.84     1.17 1.00     2340     2740
mub_hincome             0.12      0.04     0.03     0.20 1.00     4090     2784
mub_childrenpresent     3.01      0.60     1.90     4.22 1.00     3654     2695
mub_regionBC            1.28      1.09    -0.72     3.44 1.00     2257     2778
mub_regionOntario       0.16      0.91    -1.62     1.96 1.00     1963     2580
mub_regionPrairie       0.43      1.02    -1.52     2.49 1.00     2517     2910
mub_regionQuebec       -0.17      1.01    -2.18     1.76 1.00     2270     2793

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The results for the first dichotomy (level 1 versus 2 & 3) match the results reported in Fox and Weisberg (2010, pp. 265) and only differ in sign:
image
The results for the second dichotomy (level 2 versus 3), which are indicated by mub_, are a bit off, but still similar:
image

3 Likes

Hyea 👋

Using the sequential ordinal regression (ref. https://doi.org/10.1177/2515245918823199):

library(brms)
library(dplyr)
library(purrr)

dat <- data.frame(Y = ordered(1:4),
                  w = c(80, 114, 138, 143))

fit <- brm(Y | resp_weights(w) ~ 1, 
           family = sratio(), 
           data = dat,
           backend = "cmdstanr")

# True parameters (in Odds Ratios):
map_dbl(1:3, ~ dat$w[.x] / sum(dat$w[-(1:.x)]))
#> [1] 0.2025316 0.4056940 0.9650350

# Recoverd parameters
parameters::model_parameters(fit, test = NULL) |> 
  mutate(across(c(2,3:4), exp))
#> Parameter    | Median |        95% CI |  Rhat |     ESS
#> -------------------------------------------------------
#> Intercept[1] |   0.20 | [0.16, -1.36] | 1.000 | 3875.00
#> Intercept[2] |   0.41 | [0.32, -0.68] | 1.000 | 3687.00
#> Intercept[3] |   0.97 | [0.77,  0.21] | 0.999 | 4128.00

4 Likes

Does this do the job?

https://kevinstadler.github.io/notes/bayesian-ordinal-regression-with-random-effects-using-brms/

1 Like

Indeed, that also works for the data from Fox and Weisberg:

library("tidyverse")
library("brms")

womenlf <- carData::Womenlf
womenlf <- womenlf %>% 
  mutate(work2 = factor(partic, 
                        levels = c("not.work", "parttime", "fulltime"))) %>% 
  mutate(work3 = as.numeric(work2))
fit2 <- brm(work3 ~ cs(hincome + children + region), 
           family = sratio(), 
           data = womenlf)

The important bit is the category specific formula using cs(). This gives results similar to the one in my previous post (which are again also quite similar to the results of Fox and Weisberg).

> fit2
 Family: sratio 
  Links: mu = logit; disc = identity 
Formula: work3 ~ cs(hincome + children + region) 
   Data: womenlf (Number of observations: 263) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]          -1.24      0.56    -2.30    -0.15 1.00     1516     2209
Intercept[2]          -3.58      1.06    -5.87    -1.63 1.00     1536     1915
hincome[1]            -0.05      0.02    -0.09    -0.01 1.00     3038     3036
hincome[2]            -0.10      0.04    -0.19    -0.03 1.00     2976     2763
childrenpresent[1]    -1.64      0.30    -2.23    -1.06 1.00     2844     2842
childrenpresent[2]    -2.84      0.59    -4.04    -1.72 1.00     2711     2554
regionBC[1]            0.40      0.61    -0.74     1.63 1.00     1901     2181
regionBC[2]           -1.01      1.08    -3.25     0.99 1.00     1794     2464
regionOntario[1]       0.24      0.47    -0.66     1.20 1.00     1715     1876
regionOntario[2]       0.09      0.88    -1.74     1.79 1.00     1703     2524
regionPrairie[1]       0.53      0.57    -0.54     1.70 1.00     2025     2205
regionPrairie[2]      -0.15      1.00    -2.15     1.81 1.00     1806     2589
regionQuebec[1]       -0.13      0.51    -1.10     0.92 1.00     1697     2017
regionQuebec[2]        0.41      0.98    -1.55     2.30 1.00     1899     2373

Family Specific Parameters: 
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
3 Likes

@Henrik_Singmann, I can confirm your method works great. I’m not the best with working with custom families. Would you mind sharing the code required for post processing with functions like fitted() and predict() for models fit with your family = mpt_cond? You know, the code that would follow after executing expose_functions(fit1, vectorize = TRUE). Thanks.

That was more difficult than initially thought. The epred functionality needed a bit of tinkering. However, whereas you will see it is all possible, it simply replicates the functionality of the sratio() family and does so less flexibly (i.e., requiring a specific family for each number of categories). So whereas it works, the sratio() family is probably the way to go (however, the custom family samples quite a bit more efficiently/faster).

To sum the problem up, we need to pass specials = "categorical" to custom_family. This unfortunately renames the parameter names, which is what you can see below. However, it only requires one formula, which is nice.

library("tidyverse")
library("brms")

womenlf <- carData::Womenlf
womenlf <- womenlf %>% 
  mutate(work2 = factor(partic, 
                        levels = c("not.work", "parttime", "fulltime"))) %>% 
  mutate(work3 = as.numeric(work2))
str(womenlf)
# 'data.frame':	263 obs. of  6 variables:
#  $ partic  : Factor w/ 3 levels "fulltime","not.work",..: 2 2 2 2 2 2 2 1 2 2 ...
#  $ hincome : int  15 13 45 23 19 7 15 7 15 23 ...
#  $ children: Factor w/ 2 levels "absent","present": 2 2 2 2 2 2 2 2 2 2 ...
#  $ region  : Factor w/ 5 levels "Atlantic","BC",..: 3 3 3 3 3 3 3 3 3 3 ...
#  $ work2   : Factor w/ 3 levels "not.work","parttime",..: 1 1 1 1 1 1 1 3 1 1 ...
#  $ work3   : num  1 1 1 1 1 1 1 3 1 1 ...

mpt_cond <- custom_family("mpt_cond", 
                          links = "identity", dpars = c("mu", "mub"), 
                          type = "int",
                          vars = c("n_cat"), specials = "categorical")

stanvars <- stanvar(x = 3, name = "n_cat", scode = "  int n_cat;")

stan_lpmf <- stanvar(block = "functions", scode = "
vector mpt_cond_pred(int y, real mu, real mu_b, int n_cat) {
  real p_mu = inv_logit(mu);
  real p_mub = inv_logit(mu_b);
  vector[n_cat] prob;
  prob[1] = p_mu;
  prob[2] = (1 - p_mu) * p_mub;
  prob[3] = (1 - p_mu) * (1 - p_mub);
  return(prob);
}
real mpt_cond_lpmf(int y, real mu, real mu_b, int n_cat) {
  real p_mu = inv_logit(mu);
  real p_mub = inv_logit(mu_b);
  vector[n_cat] prob;
  prob[1] = p_mu;
  prob[2] = (1 - p_mu) * p_mub;
  prob[3] = (1 - p_mu) * (1 - p_mub);
  return(categorical_lpmf(y | prob));
}") 
fit1 <- brm(work3 ~ hincome + children + region, 
             data = womenlf, family = mpt_cond, 
              stanvars = stanvars + stan_lpmf)
fit1

This gives the same results as before:

> fit1
 Family: mpt_cond 
  Links: mu2 = identity; mu3 = identity 
Formula: work3 ~ hincome + children + region 
   Data: womenlf (Number of observations: 263) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
mu2_Intercept          -1.30      0.55    -2.38    -0.19 1.00     2717     2404
mu3_Intercept          -4.14      1.12    -6.41    -1.98 1.00     2627     2718
mu2_hincome             0.05      0.02     0.01     0.09 1.00     4302     3010
mu2_childrenpresent     1.65      0.31     1.05     2.26 1.00     3853     3037
mu2_regionBC           -0.37      0.60    -1.55     0.82 1.00     2924     2948
mu2_regionOntario      -0.22      0.49    -1.20     0.72 1.00     2513     2610
mu2_regionPrairie      -0.51      0.58    -1.67     0.56 1.00     2960     3018
mu2_regionQuebec        0.16      0.51    -0.85     1.17 1.00     2513     2441
mu3_hincome             0.11      0.04     0.04     0.20 1.00     4019     3098
mu3_childrenpresent     3.01      0.61     1.90     4.29 1.00     3698     2458
mu3_regionBC            1.32      1.10    -0.75     3.55 1.00     2709     2951
mu3_regionOntario       0.20      0.90    -1.54     1.96 1.00     2290     2767
mu3_regionPrairie       0.43      1.02    -1.55     2.49 1.00     2682     2986
mu3_regionQuebec       -0.15      0.99    -2.09     1.84 1.00     2583     2892

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

To get post processing we need the following:

expose_functions(fit1, vectorize = TRUE)

log_lik_mpt_cond <- function(i, prep) {
  mu <- brms::get_dpar(prep, "mu2", i = i)
  mu_b <- brms::get_dpar(prep, "mu3", i = i)
  n_cat <- prep$data$n_cat
  y <- prep$data$Y[i]
  mpt_cond_lpmf(y, mu, mu_b, n_cat)
}

loo(fit1)
# Computed from 4000 by 263 log-likelihood matrix
# 
#          Estimate   SE
# elpd_loo   -225.7 12.2
# p_loo        16.4  1.7
# looic       451.4 24.3
# ------
# Monte Carlo SE of elpd_loo is 0.1.
# 
# All Pareto k estimates are good (k < 0.5).
# See help('pareto-k-diagnostic') for details.

posterior_predict_mpt_cond <- function(i, prep, ...) {
  mu <- brms::get_dpar(prep, "mu2", i = i)
  mu_b <- brms::get_dpar(prep, "mu3", i = i)
  n_cat <- prep$data$n_cat
  y <- prep$data$Y[i]
  prob <- mpt_cond_pred(y, mu, mu_b, n_cat)
  extraDistr::rcat(length(mu), t(prob))
}

pp_check(fit1, type = "bars")

We can also get the predictions:

posterior_epred_mpt_cond <- function(prep) {
  mu <- brms::get_dpar(prep, "mu2")
  mu_b <- brms::get_dpar(prep, "mu3")
  n_cat <- prep$data$n_cat
  y <- prep$data$Y
  prob <- mpt_cond_pred(y = y, mu = mu, mu_b = mu_b, n_cat = n_cat)
  dim(prob) <- c(dim(prob)[1], dim(mu))
  prob <- aperm(prob, c(2,3,1))
  dimnames(prob) <- list(
    as.character(seq_len(dim(prob)[1])), 
    NULL, 
    as.character(seq_len(dim(prob)[3]))
  )
  prob
}

epred1 <- posterior_epred(fit1)
str(epred1)
 # num [1:4000, 1:263, 1:3] 0.612 0.697 0.731 0.733 0.694 ...
 # - attr(*, "dimnames")=List of 3
 #  ..$ : chr [1:4000] "1" "2" "3" "4" ...
 #  ..$ : NULL
 #  ..$ : chr [1:3] "1" "2" "3"

conditional_effects(fit1, "region", categorical = TRUE)

So as I said, this is pretty much equivalent to the following (which is however slower in sampling):

fit2 <- brm(work3 ~ cs(hincome + children + region), 
           family = sratio(), 
           data = womenlf)
fit2

pp_check(fit2, type = "bars")
conditional_effects(fit2, "region", categorical = TRUE)
8 Likes

@Henrik_Singmann, this is incredibly helpful. Thank you! You’re a boon to the community.

4 Likes

Hey @mattansb, a quirk about using your family = sratio approach is it reverses the signs for the parameter coefficients (except for the intercepts), relative to @Henrik_Singmann’s custom family approach. Can you think of an easy way to order the criterion variables to overcome this?

I don’t think so.

In the “Conditional logistic” you’re modeling for each split:

  • If bX > T_i then C_i
  • else if bX > T_{i+1} then C_{i+1}
  • etc…

While for the sequential ordinal model, you’re essentially modeling the splits the other way around:

  • C_i
  • unless bX > T_i, then C_{i+1}
  • unless bX > T_{i+1}, then C_{i+2}
  • etc…

In other words, in the Conditional logistic you are predicting the probability of not going down the “tree” whereas in the sequential model you’re predicting the probability of the next category.

Thank you for the walk-out, @mattansb. That makes sense. I’m guessing, then, that the sequential ordinal approach also wouldn’t be able to handle a splitting tree that looked like this:

No, I suspect not (at least not without some horrible data re-shaping…)

Okay. Thank you for the clarification. I think I’ve just about cracked this nut.

The nut is officially cracked. Read all about it in this blog post.

4 Likes

Hi all, this was really fascinating. @Solomon or anyone else, is it possible to generalise the custom families that you came up with to suit an arbitrary number of categories rather than having to rewrite the stanvar block each time you run your model?

Also, how will this handle situations where choice sets might differ? In my own case I want to model vote choice in the Britain, but the constituent nations that make it up (England, Wales, Scotland) all have different party systems. For instance, a voter in London cannot vote for the Scottish National Party.

The custom family approach we’ve explored so far has pushed me to the edges of my brms competence, so someone else will have to chime in on how it might be generalized.

For situations with differing response sets, my intuition is as long as you have a predictor variable clarifying who has which response set, you’ll be good. In such a case, I would expect the propensity for a given response set would plummet to zero. But since log-odds probabilities can have problems when the probabilities get close to zero or one, you might need to go beyond the default priors to keep your chains healthy.

Another option would be fitting separate models for those with different response sets.