How to interpret brms estimation on the correct scale

I posted this yesterday but used the wrong category. So I posted it again and deleted yesterday’s post. Also, some modifications are made after doing some more tests.

The question is in general regarding how to interpret the summary result of a brms model. I have a hard time understanding the scale of each estimate and also how they are calculated. Therefore, it is hard for me to tell whether the estimates agree with the true value or not when I simulate the data. This seems a pretty basic question so if it has already been asked (I did a quick search but didn’t find anything that can answer all my questions in a consistent way), feel free to direct me to the answer. Any help is appreciated!

Below is the structure of a hierarchical model:

Screen Shot 2020-12-10 at 9.37.06 AM

In a biological context, you can consider x as a matrix containing gene expression of all genes in all cells measured by technology A. y is a matrix containing the gene expression measured by technology B. And alpha here is the gene-specific distortion between the two techniques. The distribution of alpha is the key information I want to get.

Below is the code for generating input data and model fitting. To make it simple and also easy to evaluate, I specify alpha to be all 0. My hope is that the estimated random intercepts should all be 0 and the distribution of sigma in the model should also center on 0.

library(tidyverse)
library(brms)
library(tidybayes)
library(MASS)

N=3         #number of genes
M=50        #number of cells
theta=0.1   #overdispersion

#generate x
x = rbind(rbind(rep(1, M), rep(2, M)), rep(3, M))

#generate alpha
sigma = 0           
set.seed(7)
alpha = rnorm(N, mean = 0, sd = sigma)     

#generate y
y = matrix(0, N, M)
for (i in 1:N){
  for (j in 1:M){
    set.seed(7+i*10+j)
    y[i,j] = rnegbin(1, mu = exp(alpha[i] + x[i, j]), theta = theta) 
  }
}

#gene label
gene.symbol=matrix(NA, nrow = N, ncol = M)
for (i in 1:N){
  gene.symbol[i, ] = rep(paste("gene", i, sep = ""), M)
}

data2fit <-
  data_frame(
    gene = as.character(gene.symbol),
    X = as.numeric(x),
    response = as.numeric(y)
  )

#prior
bpriors <-  prior(constant(0), class = "b", coef = "Intercept") +                    #fix the population level intercept to be 0 (no global distortion)         
            prior(constant(1), class = "b", coef = "X") +                            #fix the coefficient of X to be 1  
            prior(normal(0, sigma_gene_sd), class = "sd") +                           
            prior("target += cauchy_lpdf(sigma_gene_sd | 0, 10)", check = FALSE) +    
            prior(gamma(0.01, 0.01), class = "shape")                                        

# here is where we add the user defined parameters
stanvars <- stanvar(scode = "real<lower=0> sigma_gene_sd;",                # by providing lower=0, we force tau to use just half cauchy                                          
                    block = "parameters")

#fit the model
m <- brm(    
  bf(response ~ 0 + Intercept + X + (1 | gene)),
  data = data2fit, 
  prior = bpriors, stanvars = stanvars,                                      # this line passes the prior & modified parameter blocks into brm()
  family = negbinomial(),
  iter = 2000, chains = 4, cores = 4
)

#summary
summary(m)
 Family: negbinomial 
  Links: mu = log; shape = identity 
Formula: response ~ 0 + Intercept + X + (1 | gene) 
   Data: data2fit (Number of observations: 150) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~gene (Number of levels: 3) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.29      0.35     0.01     1.26 1.00     1463     1770

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.00      0.00     0.00     0.00 1.00     4000     4000
X             1.00      0.00     1.00     1.00 1.00     4000     4000

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     0.12      0.02     0.08     0.15 1.00     2775     2119

#check random intercept
ranef(m)$gene[,,"Intercept"][,"Estimate"]
     gene1      gene2      gene3 
0.08128817 0.02652586 0.03820017 

#exponentiate the random intercept
exp(ranef(m)$gene[,,"Intercept"][,"Estimate"])
   gene1    gene2    gene3 
1.084683 1.026881 1.038939

The first major question is for the interpretation of the random intercept:
(1) Is the estimate returned by ranef(m) on the original scale or the log scale?
(2) Based on my understanding, since I specify alpha to be 0 during simulation, the random intercept estimate returned by ranef(m) should all be 0. However, the random intercepts of the three genes (before or after exp) are not 0. Why is that?
(3) How is the estimate of sd(Intercept) calculated? And is it on log scale or original scale? Based on the summary, the estimate of sd(Intercept) is 0.29. Is it the standard deviation of all three random intercepts (i.e., the estimation of sigma in the model)? However, it doesn’t equal to sd(exp(ranef(m)$gene[,“Intercept”][,“Estimate”])) or sd(ranef(m)$gene[,“Intercept”][,“Estimate”]). exp(0.29) also doesn’t equal any of the two standard deviations. In theory, if it is the estimation of sigma in the model, it should equal 0 since I force alpha to be 0. There should be no variance of random intercept. So it seems that it is not the standard deviation of random intercepts? It is also not the mean. So how should I understand it?

The second major question is for the interpretation of the population-level effects:
(1) for the population-level Intercept and X, based on what I found online, due to the log link function, they are on the log scale and I should exponentiate them to get the values on the original scale?
(2) If that is the case, then it contradicts with the prior I set. When setting the prior, I force the Intercept to be 0 and X to be 1. However, in the summary, if the estimate of Intercept and X are on log scale, then the estimate on the original scale is exp(0) and exp(1). This contradicts with the prior I set. So is the population level estimate on log scale or not? This leads to another question: when I fix a prior, should I consider the scale? For example, if I want to fix the coefficient of X to be 1, should I just specify constant(1) as prior or I should use constant(0) because exp(0) = 1?

A few other questions:

  1. If I include offset in the model (the offset is equal to all genes), will that affect the estimation of the distribution of alpha? I tried to include offset when I estimate the random intercepts, it changes the estimation but I cannot figure out the magnitude of difference. If the true random intercepts are c(1,2,3) for three groups and I use log(500) as offset, I thought the new random intercepts with offset will be c(1,2,3)-log(500) but it is actually not.
  2. Eventually I need to use zero inflated negative binomial distribution. Since both ZINB and NB use log link function, I assume the interpretation of the two results are the same?

Sorry for the long question. I hope I make myself clear. Overall, I cannot find a comprehensive description of how to interpret the summary output of a brms model, especially which estimate is on what scale. This also affects the prior setting (should I set the prior on the original scale or on the scale based on link function?).

  • Operating System: Linux server
  • brms Version: brms_2.14.6

Hi, sorry to take so long to give you a reply, and sorry I won’t answer your questions directly.

There are a lot of brms specific questions here so I think it’s better to ask @paul.buerkner directly, otherwise, I might confuse you by giving the wrong answer to some of your questions ;)

Thank you for your reply! I am new to the forum and I didn’t know I can ask @paul.buerkner directly. Thank you for letting me know!