Deviance, Priors, and Fitted Values of a brms Multinomial Mixed Model

I’m new both to this forum and Bayesianism. And a humanities student.

I’ve used only non-Bayesian GL(M)Ms before and have now had to venture into Bayesian territory, after failing to find R packages capable of fitting frequentist multinomial GLMMs. After I fit my first Bayesian model with brm(), several problems arose immediately:

  1. I cannot find Deviance reported anywhere. I do see a function named log_lik(), but it returns a huge matrix rather than one statistic. Since its dimensions are 4000-by-N, I assume they are the observation-specific, “point-wise” log-likelihoods calculated for each MC sample. If so, does this mean Deviance is simply:
[1] 4172.319

If not, then how exactly does one compute good ole Deviance for this kind of model?

  1. I can’t access the fitted values. Both fitted(fit) and predict(fit, newdata = NULL, re_formula = NULL) report:
Error in Summary.factor(c(2L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 4L, 4L,  : 
  ‘max’ not meaningful for factors


  1. I don’t know how to access residuals of any kind – neither Pearson nor Deviance ones, neither regular nor standardized. And I can’t even begin to calculate them on my own without access to the fitted values.
  2. Trying to obtain the model’s WAIC() results in a warning:
> waic(fit)

Computed from 4000 by 2321 log-likelihood matrix

          Estimate   SE
elpd_waic  -2144.1 38.6
p_waic       109.5  4.1
waic        4288.1 77.2
Warning message: 32 (1.4%) p_waic estimates greater than 0.4. We recommend trying loo instead.

Can this test statistic still be used despite the warning? When I do what it says and try loo() instead, I get an error instead of a warning:

> loo(fit, fit2, compare = TRUE, reloo = TRUE)
Error: At least three response categories are required.

What’s going on? There are 4 response categories.

Lastly, there are three priors whose purpose I don’t understand:

> prior_summary(fit)
                 prior     class                   coef   group resp         dpar nlpar bound
1       normal(0, 2.5)         b                                     muindicative            
2                              b actualized1_actualized              muindicative            
3                              b                adjPoly              muindicative            
58                     Intercept                                     muindicative            
59                     Intercept                                          muother            
60                     Intercept                                         mushould            
61 student_t(3, 0, 10)        sd                                     muindicative            
62 student_t(3, 0, 10)        sd                                          muother            
63 student_t(3, 0, 10)        sd                                         mushould            
64        cauchy(0, 3)        sd                        trigger      muindicative            
65                            sd              Intercept trigger      muindicative            
66        cauchy(0, 3)        sd                        trigger           muother            
67                            sd              Intercept trigger           muother            
68        cauchy(0, 3)        sd                        trigger          mushould            
69                            sd              Intercept trigger          mushould             

What are these student priors and what do they do? I deliberately set a fairly skeptical prior on the betas (the normal one) and group-level effects (cauchy), and I deliberately refrained from setting any priors on the fixed intercepts. Yet there appear to be default student priors on the fixed intercepts? It doesn’t say “Intercept” in the class column, but the dpar column leads me to suspect that these priors have something to do with the intercepts because the three non-baseline response categories are named. The helpfile for set_prior() doesn’t seem to say anything about ‘class sd’ priors for fixed intercepts, nor have I ever heard of such a thing before, so I’m confused.

Here’s the code I used to fit the model:

priors <- prior(
  normal(0, 2.5), class = b) + #prior for betas
  prior(cauchy(0, 3), class = sd, group = "trigger", dpar = "muindicative") + #group-level prior, response category 2
  prior(cauchy(0, 3), class = sd, group = "trigger", dpar = "mushould") + #group-level prior, response category 3
  prior(cauchy(0, 3), class = sd, group = "trigger", dpar = "muother") #group-level prior, response category 4
fit <- brm(dep.var ~ (1|trigger) + syn.type + sem.stat + tense + ISCmod + should.merge + sub.verb + time.period + polarity + broadsheet2 + that + actualized + verbPoly + adjPoly, data = testdata, family = categorical(link = "logit"), prior = priors) 

Congrats on getting this far; there is a long way to go. It would help us a bit if you use the brms category when asking questions that focus on brms.

You don’t really want the in-sample deviance the way frequentists do. The WAIC (or better the LOOIC) is an estimate of the expected deviance of future data, which is conceptually what a Bayesian should want when comparing models.

Here I think you want posterior_linpred(fit, transform = TRUE), which will give you a matrix of posterior conditional expectations for each observation. You can subtract these from the raw counts to obtain residuals in a sense.

No. Indeed, you should not use waic; just use loo because they are asymptotically equivalent and loo has better diagnostics. I don’t know why loo is not working in your case, but first try it with just one model at a time, like


and see what happens.

I believe the student_t(3, 0, 10) priors pertain to the standard deviation of the intercepts across levels of trigger for each outcome category. These are default priors on group-level standard deviations in brms and are very wide, since there are only 3 degrees of freedom, a scale of 10, and a truncation at zero. So, under that prior there is about a 5% chance that the standard deviation in the intercepts across groups is more than 32.

1 Like

The error ‘max’ not meaningful for factors has been fixed in the github version of brms to be installed via

if (!requireNamespace("devtools")) {
1 Like

Thanks guys. I did the update and have access to fitted values now.

Sounds like a prior for the standard deviation of the varying intercepts. There’s only one group-level (random) effect, i.e. the varying intercepts across levels of ‘trigger’. But I thought my code specifically set that to Cauchy with mean 0 and scale 3 (see 1st post). This is also reported by prior_summary(fit) (see 1st post). It’s as if there are TWO sets of priors for the varying intercept: one that I specified and a default one, when there should only be one. And I don’t know which of the two priors has actually been applied to the varying intercepts.

While those are doubtless far superior diagnostics of model trueness, there are reasons why I want access to plain old deviance as well.

A) Just to be sure everything is working correctly, I need to verify that a brms single-level (fixed-effects) model with flat priors has the same deviance as a corresponding fixed-effects model that was fit using frequentist software.

B) I want to be able to analyze the significance of individual covariates through likelihood-ratio tests of nested models, which is done by comparing deviances.

C) Deviance doesn’t require expensive computations to calculate, so imho it has value as a quick-and-dirty initial heuristic.

So, does the sum of the the point-wise mean LLs calculated from the log_lik(fit) matrix equal the deviance?

Here’s what happens:

> leave1oat <- loo(fit)
Warning message:
Found 1 observations with a pareto_k > 0.7 in model 'fit'. It is recommended to set 'reloo = TRUE' in order to calculate the ELPD without the assumption that these observations are negligible. This will refit the model 1 times to compute the ELPDs for the problematic observations directly. 

> lootest <- loo(fit, reloo = TRUE)
1 problematic observation(s) found.
The model will be refit 1 times.

Fitting model 1 out of 1 (leaving out observation 1106)
Start sampling
Gradient evaluation took 0.005 seconds
1000 transitions using 10 leapfrog steps per transition would take 50 seconds.
Adjust your expectations accordingly!

 Elapsed Time: 417.01 seconds (Warm-up)
               483.011 seconds (Sampling)
               900.021 seconds (Total)

Gradient evaluation took 0.005 seconds
1000 transitions using 10 leapfrog steps per transition would take 50 seconds.
Adjust your expectations accordingly!

 Elapsed Time: 419.207 seconds (Warm-up)
               402.659 seconds (Sampling)
               821.866 seconds (Total)

Gradient evaluation took 0.004 seconds
1000 transitions using 10 leapfrog steps per transition would take 40 seconds.
Adjust your expectations accordingly!

 Elapsed Time: 433.343 seconds (Warm-up)
               430.745 seconds (Sampling)
               864.088 seconds (Total)

Gradient evaluation took 0 seconds
1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Adjust your expectations accordingly!

 Elapsed Time: 400.988 seconds (Warm-up)
               380.321 seconds (Sampling)
               781.309 seconds (Total)

Error: At least three response categories are required.

Despite the warning and despite the error message, a new object seems to have appeared in the workspace in both cases. They report identical results, so reloo = TRUE seems to have had no effect:

> leave1oat$estimates
           Estimate        SE
elpd_loo -2145.2173 38.654766
p_loo      110.6341  4.271084
looic     4290.4346 77.309533

> lootest$estimates
           Estimate        SE
elpd_loo -2145.2173 38.654766
p_loo      110.6341  4.271084
looic     4290.4346 77.309533

Are these to be trusted given the warning, the error message, and the fact that they show identical results despite a supposedly different procedure?

Let me check why the error At least three response categories are required. occurs? On question to that: Are all four response categories present in more than one response?

With regard to the prior summary: brms way to specifiying priors is hierarchical. If the coefficients specfic priors are specified, those are used. If not, check if there are group specific priors (as in your case) and use them. If not check if there are global priors (those student-t ones). Since you already speficy group-specific priors, the global priors will be ignored.

1 Like

With regard, to the error At least three response categories are required to you perhaps have a reproducible example for me?

update: reloo should now work with categorical models in the github version of brms.

1 Like

The row-wise sums of the log_lik matrix multiplied by -2 plus 2 times the log-likelihood of a saturated model are the posterior deviances. Now, think about why that goes against your reasons for wanting to calculate it.

1 Like

Got it, thank you.

It does. Many thanks, sir.

I might get it now (fingers crossed). Perhaps that’s the wrong place to look for the in-sample Deviance.

If I took the fitted probabilities F.mat (an N-by-4 matrix since there’s 4 response categories), then represented the observed responses by Y.mat, an N-by-4 matrix of 0s and 1s (every row having one 1 for the response category that occurred for that observation), then could the point-wise log-likelihoods be obtained by taking the log of the binomial probability density for each point in Y.mat, with sample size 1 and p = the corresponding element in F.mat? In R code I guess it would be something like dbinom(Y.mat, size = 1, prob = as.vector(t(F.mat)), log = TRUE). Would minus twice the sum of those log-likelihoods then be the in-sample Deviance?

The deviance is defined relative to a saturated model. If you have two deviances relative to the same saturated model, then the differences between the two deviances is the same irrespective of how well the saturated model fits.

In this case, you don’t have a saturated model, it would be computationally expensive to fit one and hard to get it to converge, at best you would have a value of the in-sample deviance for every iteration and every chain (none of which are going to exactly correspond to the single deviance produced by a frequentist estimator), and using them to decide which model is preferable is a worse procedure than using something like LOOIC that estimates the out-of-sample deviance in a way where you can very the assumptions it is making.

I’ve been firmly under the impression that that the term saturated model refers to the data itself, i.e., is a way of conceptualizing the data as if it were a model with N parameters – a kind of convenient fiction enabling us to assess how real models fit. If the saturated model is the data itself, then I don’t easily see how there could be several different saturated models for a given dataset, nor how a saturated model would need to be “fit”. Hosmer et al. (2013: 12) say:

In logistic regression, comparison of observed to predicted values is based on the log-likelihood function defined in equation (1.4). To better understand this comparison, it is helpful conceptually to think of an observed value of the response variable as also being a predicted value resulting from a saturated model. A saturated model is one that contains as many parameters as there are data points.


The comparison of observed to predicted values using the likelihood function is based on the following expression:

\begin{gather*} D = -2 ln \Bigg[\frac{(likelihood~of~the~fitted~model)}{(likelihood~of~the~saturated~model)}\Bigg] \end{gather*} (1.9)

Specifically, it follows from the definition of a saturated model that \hat{\pi_i} = y_i and the likelihood is

\begin{gather*} l(saturated~model) = \prod_{i=1}^{n}y_i^{y_i} \times (1 - y_i)^{(1 - y_i)} = 1.0. \end{gather*}

Thus it follows from equation (1.9) that the deviance is

\begin{gather*} D = -2ln(likelihood~of~the~fitted~model) \end{gather*}

They seem to be saying that the since the saturated model has perfect fit, resulting in a log-likelihood of 0, it can be ignored – and the in-sample deviance is just twice the negative log-likelihood of the model of interest. That’s the quantity whose manual calculation I’m trying to figure out here.

I’ve no doubt that LOOIC and WAIC are better diagnostics. But they have their own complications in my case – WAIC complains about high p_waic estimates and LOOIC takes an hour to compute (at least with reloo = TRUE, which it’s telling me to use). Also, comparing deviances of nested models is a computationally cheap way to assess the significance of individual predictors. And I’d like to have recourse to a diagnostic that I’m familiar with, just to feel a bit less like a fish on dry ground amidst all this arcane Bayes stuff.

Hosmer, D. W., Lemeshow, S. & Sturdivant, R. X. (2013). Applied logistic regression (3rd ed.). Hoboken, N.J.: Wiley.

If WAIC and LOO computations give warnings, then it means that also deviance is not reliable measure and it doesn’t matter how cheap it is to compute. It is likely to be over-optimistic favoring too complex models and inflated estimates for the significance of individual predictors. Deviance is fine only when WAIC and LOO computations do not give warnings and the estimated effective number of parameters is much smaller than the number of observations.

It’s fine to compute that for the comparison and learning, but given that you got warnings don’t rely on it for any decision making or publication.

1 Like

I think I’ve now managed to write a convenience function that calculates the in-sample deviance for brms multinomial logit models fit to ungrouped data. Based on the formula seen in Hosmer et al (2013: 271), each point-wise log-likelihood is calculated as:

\begin{gather*} l(y_i) = \bigg(\sum_{j=2}^{J}y_{ji}g_{ji}(x_i)\bigg) - ln\bigg[1 + \sum_{j=2}^Jexp[g_{ji}(x_i)]\bigg] \end{gather*}

where j_2...J are the non-baseline categories of the response, and g_{ji}(x_i) is the linear predictor for a response in category j, as opposed to the baseline category, for observation i. Deviance is calculated as minus twice the sum of these point-wise log-likelihoods.

Some headache was caused by the fact that the fitted values produced by predict(fit, newdata = NULL, re_formula = NULL) rounds very small fitted probabilities down to zero, causing the final output to be NaN. I had to make the function detect fitted probabilities of zero and add 0.000000000001 to them.

Got it. Thanks.

Unless someone points out a fresh blunder of mine, then I guess the problems listed in the first post have now been solved.

Hosmer, D. W., Lemeshow, S. & Sturdivant, R. X. (2013). Applied logistic regression (3rd ed.). Hoboken, N.J.: Wiley.

Paul: I think I found a little error:

> summary(fit)
Family: categorical 
  Links: muindicative = identity; mushould = identity; muother = identity 

The formula specified family = categorical(link = “logit”). I guess this must be an error in the summary output. I don’t see how an identity link could produce coefficients outside the [0,1] range while still constraining all fitted values to that interval.

Indeed this is just a bug in the summary output. When fitting models with the latest version of brms, I see the logit link to be displayed correctly.

(I figured I shouldn’t start a new thread since this stuff, although new, is very much related to the topic)

New question about the fitted values of this multinomial multilevel model (one group-level intercept). Calculating the fitted values with predict(mod, newdata = NULL, re_formula = NA) should “predict” the training data by using population-level coefficients only, correct? When I do that, I get slightly different fitted values than when I use those same population-level coefficients to calculate the fitted values manually. I think the discrepancy MIGHT be due to rounding, but I’m not number-savvy enough to say for sure.

Here’s the code and output:

> #Auto-calculated fitted values:
> <- predict(mod, newdata = NULL, re_formula = NA)
> #Extract the population-level coefficients as a matrix (using a homemade function):
> coef.matrix <- fixedmult(mod)[,,1] 
           (Intercept) syn.type0_verb syn.typeCCadj syn.typeNPCAdj syn.typeConjunction syn.typeWR sem.stat1_polysemous
indicative  0.22632285     -0.5309613      3.435623      1.3641014           3.3795838  -2.650339            0.5472354
should     -0.06570302     -0.5044594      0.955685      1.5902828          -1.7527169  -2.739128            0.8833987
other      -0.40097588     -0.2412527     -2.119067      0.6757036           0.7860911   1.430628            1.3281295
           tense1_past    ISCmod should.merge sub.verbBePassAux sub.verbBe_ActiveC time.period1_2017 polarity1_Negated
indicative  -1.4429293 2.4356208   -0.8170758       -0.86914042          0.2660553         0.4715617          1.930942
should      -0.1705327 0.1519158    0.4060827        0.07695408          1.1884271        -0.4550236          2.053665
other       -0.3566856 0.0830840   -1.1012577       -0.50384902          0.8481126         0.3844860          2.117250
           broadsheet2 that1_omitted actualized1_actualized verbPoly    adjPoly
indicative  -0.8335300    -0.2507014             -0.1674223 1.099798 -0.7782793
should      -0.2630681    -0.6726740             -0.6079875 1.116736 -1.4259344
other       -0.3408224    -0.4919738             -0.9874672 1.397991 -0.9312663

The lower and upper bound of the 95% confidence interval are returned if the result is assigned to an object.
> #Add a row of zeros for the reference outcome:
> coef.matrix <- rbind(0, coef.matrix) 
> #Extract the design matrix, WITHOUT the group-level effect:
> designMatrix <- model.matrix(
+   formula(paste0(names(mod$data)[1],"~", paste0(names(mod$data)[2:(ncol(mod$data)-1)], collapse = "+"))), 
+   data = mod$data) 
> #Manually calculate fitted values:
> fitted.manual <- sapply( #Manually calculate the same fitted values
+   1:ncol(, function(x){
+     exp(designMatrix %*% coef.matrix[x,]) / 
+       (exp(designMatrix %*% coef.matrix[1,]) + 
+         exp(designMatrix %*% coef.matrix[2,]) + 
+         exp(designMatrix %*% coef.matrix[3,]) + 
+         exp(designMatrix %*% coef.matrix[4,]))})
> #Auto-fitted values look like this:
> head( #Rounded to 5 decimals
     P(Y = 1) P(Y = 2) P(Y = 3) P(Y = 4)
[1,]  0.20100  0.07275  0.58025  0.14600
[2,]  0.14650  0.15775  0.44575  0.25000
[3,]  0.41850  0.16800  0.21200  0.20150
[4,]  0.01725  0.25200  0.33900  0.39175
[5,]  0.39325  0.02175  0.50475  0.08025
[6,]  0.29875  0.34400  0.11550  0.24175
>#Manually fitted values look like this:
> head(fitted.manual) 
           [,1]       [,2]      [,3]       [,4]
[1,] 0.20350966 0.06876404 0.5859579 0.14176841
[2,] 0.14893410 0.15480483 0.4633910 0.23287007
[3,] 0.43821326 0.16043996 0.2161268 0.18522003
[4,] 0.01474427 0.25401339 0.3487195 0.38252283
[5,] 0.40802951 0.01954921 0.4977440 0.07467726
[6,] 0.30182416 0.35665460 0.1082733 0.23324794
> #Overall mean difference seems 0 however:
> summary(as.vector( - as.vector(fitted.manual)))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.128631 -0.006333  0.002295  0.000000  0.008634  0.086163 

Are these differences just due to rounding or is something more sinister going on?

Predict really predicts new values of the response. You see that when running predict(..., summary = FALSE). Thus, the difference you see are likely just random error in the predictions. You may also want to compare you values with the output of fitted().

1 Like

Oh, so I was using the wrong function *slaps self*. Trying again:

> <- matrix(fitted(mod, re_formula = NA)[,1,], ncol = 4)
> head(
          [,1]       [,2]      [,3]       [,4]
[1,] 0.2023560 0.07204305 0.5712995 0.15430143
[2,] 0.1465971 0.16178891 0.4424908 0.24912321
[3,] 0.4215123 0.16349730 0.2167325 0.19825791
[4,] 0.0161272 0.25690249 0.3412763 0.38569405
[5,] 0.4038917 0.02093166 0.4908743 0.08430233
[6,] 0.2927359 0.35175291 0.1111998 0.24431140
> head(fitted.manual)
           [,1]       [,2]      [,3]       [,4]
[1,] 0.20350966 0.06876404 0.5859579 0.14176841
[2,] 0.14893410 0.15480483 0.4633910 0.23287007
[3,] 0.43821326 0.16043996 0.2161268 0.18522003
[4,] 0.01474427 0.25401339 0.3487195 0.38252283
[5,] 0.40802951 0.01954921 0.4977440 0.07467726
[6,] 0.30182416 0.35665460 0.1082733 0.23324794
> summary(as.vector( - as.vector(fitted.manual))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.130760 -0.004902  0.002576  0.000000  0.007173  0.090803 
> plot(1:prod(dim(, as.vector( - as.vector(fitted.manual))

Minor differences remain. Is it safe to conclude that this is due to rounding? I notice that brmsfit’s point estimates for all coefficients are rounded to just two (2) decimal places. I imagine that could easily cause the small discrepancies?

brms doesn’t round coefficients except for pretty printing. All computation are done at the usual precision used by R. Not sure where the minor differences are coming from, but they seem small enough to be ignorable.

1 Like

Ah I see the reason. You likely compute point estimates and then transform afterwards. That’s not the right approach in general. Correct is to transform all the posterior samples and compute point estimates (or basically any summary statistics) only after all transformations have been made. For the purpose of your analysis, I would probably trust what fitted.brmsfit gives you more than your own function.