Understanding odd estimates from cumulative probit model

I’ve been fitting a couple of different cumulative probit models to some ordinal (e.g., Likert-style) response data. In both models, I have two group-level predictors (respondent ID and question number). The sole difference between the models is that, for one (let’s call it Model A), the mean of the latent variable has a 2-level categorical population-level predictor, whereas the other (Model B) does not contain such a predictor. In Model A, the effect of the predictor is estimated to be fairly minimal. Posterior predictives from both models show that they fit the data quite well, though Model A does a little better than Model B.

What is confusing me is that the estimates of the SDs for the group-level predictors across the two models are wildly different. For Model A, I get the following:

Group-Level Effects: 
~respondent (Number of levels: 196) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.14      0.08     1.00     1.29 1.00     5459     8690

~question (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.52      0.24     0.25     1.13 1.00     6385     9769

By contrast, for Model B, the relevant output is:

Group-Level Effects: 
~respondent (Number of levels: 196) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)    11.32      6.69     4.77    28.05 1.00     8215     6370

~question (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     2.52      1.41     0.70     6.11 1.00     5782     5379

I’m stumped as to what might be leading to this discrepancy, and would appreciate any insight that others who are more familiar with such models might have. (Note that the estimates for the thresholds are a little different too, as shown below.)

The full code for Model A is as follows:

groupModel <- bf(likertResp ~ group + (1 | respondent) + (1 | question),
                   family=cumulative("probit"))
groupPriors <- c(prior(normal(0, 0.5), class="b"))
group <- brm(groupModel, family=cumulative("probit"),
               data=data,
             iter=10000, control = list(adapt_delta = .99),
             prior = groupPriors, init=0, save_pars = save_pars(all=TRUE),
             sample_prior="yes")

This produces the following output:

 Family: cumulative 
  Links: mu = probit; disc = identity 
Formula: likertResp ~ group + (1 | respondent) + (1 | question) 
   Data: data (Number of observations: 1176) 
  Draws: 4 chains, each with iter = 10000; warmup = 5000; thin = 1;
         total post-warmup draws = 20000

Group-Level Effects: 
~respondent (Number of levels: 196) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.14      0.08     1.00     1.29 1.00     5459     8690

~question (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.52      0.24     0.25     1.13 1.00     6385     9769

Population-Level Effects: 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]     -1.86      0.26    -2.39    -1.36 1.00     5701     8647
Intercept[2]     -0.67      0.26    -1.19    -0.18 1.00     5644     8824
Intercept[3]      0.09      0.26    -0.42     0.59 1.00     5725     9061
Intercept[4]      0.62      0.26     0.11     1.12 1.00     5763     8835
Intercept[5]      1.66      0.26     1.14     2.17 1.00     5875     8887
Intercept[6]      2.81      0.27     2.27     3.35 1.00     6542     9518
groupY    -0.14      0.16    -0.46     0.19 1.00     4445     8052

Family Specific Parameters: 
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The code for Model B is:

noGroupModel <- bf(likertResp ~ 1 + (1 | respondent) + (1 | question),
                   family=cumulative("probit"))
noGroup <- brm(noGroupModel, family=cumulative("probit"),
               data=data,
               iter=10000, control = list(adapt_delta = .99),
               init=0, save_pars = save_pars(all=TRUE),
               sample_prior="yes")

It produces the following output:

 Family: cumulative 
  Links: mu = probit; disc = identity 
Formula: likertResp ~ 1 + (1 | respondent) + (1 | question) 
   Data: data (Number of observations: 1176) 
  Draws: 4 chains, each with iter = 10000; warmup = 5000; thin = 1;
         total post-warmup draws = 20000

Group-Level Effects: 
~respondent (Number of levels: 196) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)    11.32      6.69     4.77    28.05 1.00     8215     6370

~question (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     2.52      1.41     0.70     6.11 1.00     5782     5379

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]    -0.73      0.06    -0.84    -0.62 1.00    12418    14704
Intercept[2]     0.10      0.05     0.00     0.20 1.00    11021    14942
Intercept[3]     0.68      0.05     0.59     0.78 1.00    11054    14563
Intercept[4]     1.10      0.05     1.01     1.20 1.00    11587    15185
Intercept[5]     1.93      0.06     1.82     2.04 1.00    15529    16239
Intercept[6]     2.84      0.09     2.67     3.02 1.00    25097    16032

Family Specific Parameters: 
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Here is the result of pp_check() (1000 samples) for the first model:

And here is the result for the second:

Thanks in advance for your assistance.

1 Like

Have you looked at pairs plots for group, the sd’s of group-level effects, and the thresholds (intercepts)? Off the top of my head and with no data, I can’t think of why this would be, but there is shifts in where the variation is (in the thresholds vs the varying intercepts) when group is added vs not. What is group?

1 Like

Thanks for your response. I hadn’t looked at the pairs plots, but having now checked them for both models: In Model A (with the effect of group) the only parameters that are obviously related are the intercepts (unsurprisingly). In Model B, the intercepts are also correlated, but much more weakly for some reason. For instance, the pairs plot for Intercepts 4 and 3 from Model B looks like this:
image

By contrast, the analogous plot from Model A looks like this:
image

The “group” variable is an experimental condition to which the respondents were randomly assigned prior to completing the response measure.

I think actually what you are seeing is the difference in scale on the x and y axis for the two plots. There is a much larger SE (as well as different values) for Intercepts 4 and 3 for Model A vs Model B, and that is reflected in the x and y axis ranges on the plot, making it seem like a tighter grouping for Model A.

But nothing related to ‘group’ and the sd’s for varying intercepts?
Have you looked at the actual values of the varying intercepts to see if they provide some clue with ranef(model)? As I understand it, these are offsets (subtracted) from the various thresholds (population-level Intercepts) (one offset from each threshold, not a different varying offset for each threshold). Somehow when you add group it seems to shift the variation from individuals’ offsets to the thresholds…that’s why I wondered if ‘group’ was correlated. But, yeah, not sure. Maybe someone smarter and/or more experienced with ordinal models will chime in.

1 Like

There is definitely a difference in scale, but the correlation coefficients are different too (~.99 for those two parameters for Model A, vs. ~.80 for Model B). Whether this is relevant to the odd estimates or not though, I haven’t the faintest.

The random intercepts are much more wild in Model B, particularly for the effect on respondents. For example, here is the histogram of the intercepts for Model A:

And here is the plot for Model B:

Actually, looking at these plots makes me wonder if I just need a stronger prior on the random effects, rather than accepting the default one (which is student_t(3, 0, 2.5)).

As an aside, I have a second data set from an experiment very similar to this one, where the analogous models don’t have this weird discrepancy. So, it seems to be a unique feature of these particular data.

Yeah, that is just wacky looking. I don’t think the stronger prior will solve your problem here (it will make the sd less, but there is something weird going on). The plot for Model A is how I would expect the distribution to look, but not like the one for B, even with a higher sd.
@Solomon @simonbrauer @saudiwin may have ideas, as they may have more experience with ordinal models

1 Like

Hmm, what does the output look like for pp_check(group, type = "bars_grouped", group = "group"), where group is the name of your model with the group predictor variable?

1 Like

I also am not sure why this is happening. One of the challenges of hierarchical ordinal models is that if someone always reports the highest or lowest value, their intercept keeps improving as it moves towards positive or negative infinity. Shrinkage from the hierarchical prior should keep this in check. And, in your case, it seems like it should have occurred in both models if this were the explanation. But I have seen this happen in some cases.

I’m just going to throw out some ideas that might reveal some relevant feature of the model results. Hopefully at least one of them identifies something useful.

  1. Check the variability within and between respondents and questions. Are there many cases where a respondent always gives the highest or lowest response?
  2. Compare the predicted probabilities between the two models for some subset of observations. Are they essentially doing the same thing?
  3. Try constraining the SD terms in Model B to the same values in Model A (1.14 and 0.52). Do you get essentially the same predictions and model fit as when you let them be estimated from the data?
  4. What do the sample distributions of the outcome look like for the two groups? Is there some substantial difference in overlap?
1 Like

So contrary to all expectations (mine included), it turns out that setting a stronger (though not super-restrictive) prior really does help. By replacing the default student_t(3, 0, 2.5) with a normal(0, 1) prior in the model without the effect of group, suddenly the estimates for all parameters become much more similar to those from the model with the effect:

 Family: cumulative 
  Links: mu = probit; disc = identity 
Formula: likertResp ~ 1 + (1 | participant) + (1 | whichQuestion) 
   Data: subset(data, data$expPart == "postTest") (Number of observations: 1176) 
  Draws: 4 chains, each with iter = 10000; warmup = 5000; thin = 1;
         total post-warmup draws = 20000

Group-Level Effects: 
~participant (Number of levels: 196) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.13      0.08     0.99     1.29 1.00     4507     9063

~whichQuestion (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.49      0.20     0.24     1.01 1.00     5776    10023

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]    -1.79      0.23    -2.27    -1.34 1.00     4055     6766
Intercept[2]    -0.60      0.23    -1.07    -0.16 1.00     4043     6655
Intercept[3]     0.16      0.23    -0.30     0.60 1.00     4087     6854
Intercept[4]     0.69      0.23     0.23     1.14 1.00     4117     7220
Intercept[5]     1.73      0.23     1.26     2.18 1.00     4322     7383
Intercept[6]     2.88      0.25     2.39     3.36 1.00     4950     8637

Family Specific Parameters: 
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The posterior predictives look no worse either—actually if anything, they look better than the original version:

The only thing bugging me is that I still don’t really understand where the difference came from in the first place.

3 Likes

@Solomon: Like this:

1 Like

Interesting. Obviously the default is pretty wide, but I don’t see why the default prior works fine for Model A but not for Model B…that’s the part I would like to know. I would have expected similar behavior regarding the prior for both models.

2 Likes

Although I am surprised it made that big of a difference!

Yes, thank you. I was curious if the two distributions were drastically different in some way. It looks like they aren’t.

1 Like

Yeah, I was starting to wonder if the problem was your reliance on the brms defaults for the priors, which is very generous in models like this. In my experience, a little prior can go a long way for upper level variance parameters and ordinal thresholds. Speaking of which, you might want to specify non-default priors for those thresholds. If you wanted to take a null approach where you assumed each of the 7 categories was equally probable, you could derive their mean values like this:

tibble(rating = 1:7) %>% 
  mutate(proportion = 1/7) %>% 
  mutate(cumulative_proportion = cumsum(proportion)) %>% 
  mutate(right_hand_threshold = qnorm(cumulative_proportion))
# A tibble: 7 × 4
  rating proportion cumulative_proportion right_hand_threshold
   <int>      <dbl>                 <dbl>                <dbl>
1      1      0.143                 0.143               -1.07 
2      2      0.143                 0.286               -0.566
3      3      0.143                 0.429               -0.180
4      4      0.143                 0.571                0.180
5      5      0.143                 0.714                0.566
6      6      0.143                 0.857                1.07 
7      7      0.143                 1                  Inf  

And thus, you could add in your threshold priors with something like this:

my_priors <- c(
  prior(normal(-1.0675705, 1), class = Intercept, coef = 1),
  prior(normal(-0.5659488, 1), class = Intercept, coef = 2),
  prior(normal(-0.1800124, 1), class = Intercept, coef = 3),
  prior(normal( 0.1800124, 1), class = Intercept, coef = 4),
  prior(normal( 0.5659488, 1), class = Intercept, coef = 5),
  prior(normal( 1.0675705, 1), class = Intercept, coef = 6),
  
  prior(normal(0, 1), class = sd),
  
  prior(normal(0, 0.5), class = b))
3 Likes

Thanks, I appreciate it. Though my practical problem is “solved”, I’ve tried to follow your suggestions anyway in the hope that this makes clearer to me (or someone else) where the issue comes from.

There are three respondents who always responded with 1 (minimum response), and none who always responded with 7 (maximum response). There is a fair bit of variability across respondents; actually if I treat the responses as numeric and average them for each respondent, the resulting histogram looks almost uniform (with a few bumps around the middle-range response values, which I guess is those who always selected 3, 4, or 5).

The responses to the individual questions are all within a fairly narrow range (“average” responses between 3.1 and 4.2), so nothing really sticks out to me there.

I’m afraid I’m not completely sure what you mean by this (or possibly just not sure how to go about it). For instance, are you suggesting using posterior_predict() to generate data for (e.g.) a subset of the respondents?

I think this is partly answered by the result of changing the SD prior, which led to changes in the estimates for both the group-level effects, as well as some changes to the overall model fit. As an aside, in comparisons between the models, whereas using the default prior on Model B leads to a strong advantage for Model A (e.g., elpd_diff = 170 with se_diff = 17 using loo()), with the altered prior now Model B is marginally better than Model A. So the outcome does change, and for the better, with a narrower prior on the SD.

Not insofar as I can see (see my earlier post, which didn’t exist when you posted).

1 Like

Hi @ps2, glad to hear you’ve solved it, although the mystery remains. These models always end up being funky–have you tried with just a logit link? That tends to work better and usually samples faster too. It would also be an “easier” solution (if it works), and you can roughly scale the coefficients with x/1.6 to correspond to the normal distribution.

3 Likes

I like this idea, and it seems an obviously superior approach to using the defaults now you mention it; but I hadn’t even thought of it previously (which perhaps shows my lack of experience with ordinal models).

For interest’s sake, I tried changing the intercept priors as you suggest, but leaving the SD prior as its default, to see whether this was enough to convince the model to behave better. It was!

1 Like

Glad you found a solution to the unexpected behavior.

By “predicted probabilities,” I meant the probabilities for each level of the outcome for a specific respondent and question. Posterior predictive checks are, in some ways, trivial for discrete models like this because you can always produce a perfect fit to the marginal distributions with a completely empty model (ie with thresholds alone). The predicted probabilities might reveal some divergence between the two models.

1 Like

@matti You’re right: Using a logit instead of probit link leads to much more sensible estimates without changing any priors from the defaults.

2 Likes

Cool! I think the logit link in combination with @Solomon’s suggestion (but use qlogis() or whatever it’s called) should be pretty robust and least informative wrt priors.

1 Like