Priors on mixing distributions - probability vs log-odds

Just leaving the answer here below in the case someone finds this and is looking for the answer. It appears that, when estimating mixing proportions in brms, estimates are indeed on the logit scale and priors should be specified as such.

I was still somewhat uncertain of this after looking at the Stan code created by the make_stancode() function, so I ran some simple models with simulated data that convinced me. The following code demonstrates this with some simulated data:

library(brms)


set.seed(717)

# Generating data from two different normal distributions.
N <- 100
x1 <- rnorm(N, -5, 1)
x2 <- rnorm(N, 5, 1)

x <- c(x1, x2)
d <- data.frame(x)

I then fit a model with Normal(0,1) prior on mixing proportions - appropriate for logit space, and wide but non-exchangeable priors on the location of the two distributions (assuming one is negative and one is positive). The model fits well and the mixing proportion is estimated at zero, which is where it should be for even proportions in log-odds.

b1 <- brm(bf(x ~ 1,
             theta2 ~ 1),
          data = d,
          family = mixture(gaussian(),gaussian(), order = TRUE),
          cores = 4,
          prior = c(
            prior(exponential(1), class = "sigma1"),
            prior(exponential(1), class = "sigma2"),
            prior(normal(-5, 2.5), class = "Intercept", dpar = "mu1"),
            prior(normal(5, 2.5), class = "Intercept", dpar = "mu2"),
            prior(normal(0, 1), class = "Intercept", dpar = "theta2")
          ),
          backend = "cmdstanr"
)
summary(b1)
Family: mixture(gaussian, gaussian) 
  Links: mu1 = identity; sigma1 = identity; mu2 = identity; sigma2 = identity; theta1 = identity; theta2 = identity 
Formula: x ~ 1 
         theta2 ~ 1
   Data: d (Number of observations: 200) 
  Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
mu1_Intercept       -4.90      0.09    -5.08    -4.73 1.00     4126     2907
mu2_Intercept        5.01      0.11     4.79     5.22 1.00     5148     3758
theta2_Intercept    -0.00      0.14    -0.28     0.27 1.00     4333     2882

Family Specific Parameters: 
       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma1     0.89      0.06     0.77     1.02 1.00     4694     3080
sigma2     1.09      0.08     0.96     1.26 1.00     4794     2633

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

I then fit the same model with a beta(4, 4) on the mixing proportions, which would be an appropriate weakly informative prior in probability space but informative and strange in the logit space as it’s bounded. The true mixing proportion is not recovered and there are now a few divergent transitions.

b2 <- brm(bf(x ~ 1,
             theta2 ~ 1),
          data = d,
          family = mixture(gaussian(),gaussian(), order = TRUE),
          cores = 4,
          prior = c(
            prior(exponential(1), class = "sigma1"),
            prior(exponential(1), class = "sigma2"),
            prior(normal(-5, 2.5), class = "Intercept", dpar = "mu1"),
            prior(normal(5, 2.5), class = "Intercept", dpar = "mu2"),
            prior(beta(4, 4), class = "Intercept", dpar = "theta2")
          ),
          backend = "cmdstanr"
)
summary(b2)
Family: mixture(gaussian, gaussian) 
  Links: mu1 = identity; sigma1 = identity; mu2 = identity; sigma2 = identity; theta1 = identity; theta2 = identity 
Formula: x ~ 1 
         theta2 ~ 1
   Data: d (Number of observations: 200) 
  Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
mu1_Intercept       -4.90      0.09    -5.08    -4.73 1.00     3992     3236
mu2_Intercept        5.01      0.11     4.79     5.23 1.00     5524     3304
theta2_Intercept     0.23      0.09     0.08     0.42 1.00     3499     2149

Family Specific Parameters: 
       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma1     0.89      0.07     0.77     1.03 1.00     4296     2495
sigma2     1.09      0.08     0.96     1.26 1.00     4092     2686

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Warning message:
There were 14 divergent transitions after warmup. Increasing adapt_delta above  may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 

I then ran this again with a mixing proportion of 0.25 for the second distribution. Ran the same two models again. The weakly-informative prior of normal(0,1) in the logit space allows for my model to recover the true mixing proportion, as -1.08 in logit space is approximately 0.25 back-transformed to probability.

# Creating unbalanced data for probability of theta2 = 0.25 (probability scale)
set.seed(717)

x1 <- rnorm(150, -5, 1)
x2 <- rnorm(50, 5, 1)

x <- c(x1, x2)
d <- data.frame(x)

b3 <- brm(bf(x ~ 1,
             theta2 ~ 1),
          data = d,
          family = mixture(gaussian(),gaussian(), order = TRUE),
          cores = 4,
          prior = c(
            prior(exponential(1), class = "sigma1"),
            prior(exponential(1), class = "sigma2"),
            prior(normal(-5, 2.5), class = "Intercept", dpar = "mu1"),
            prior(normal(5, 2.5), class = "Intercept", dpar = "mu2"),
            prior(normal(0, 1), class = "Intercept", dpar = "theta2")
          ),
          backend = "cmdstanr"
)
summary(b3)
 Family: mixture(gaussian, gaussian) 
  Links: mu1 = identity; sigma1 = identity; mu2 = identity; sigma2 = identity; theta1 = identity; theta2 = identity 
Formula: x ~ 1 
         theta2 ~ 1
   Data: d (Number of observations: 200) 
  Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
mu1_Intercept       -5.00      0.08    -5.16    -4.84 1.00     4352     2901
mu2_Intercept        4.88      0.14     4.62     5.15 1.00     5091     3547
theta2_Intercept    -1.08      0.16    -1.40    -0.76 1.00     5389     2836

Family Specific Parameters: 
       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma1     0.99      0.06     0.88     1.11 1.00     4558     2747
sigma2     0.96      0.10     0.79     1.18 1.00     4197     2180

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Here the beta(4,4) prior is really not good, as we are fixing the mixing proportions at an incorrect value that is not close to the true value. Chaos ensues and the model can’t sample properly.

b4 <- brm(bf(x ~ 1,
             theta2 ~ 1),
          data = d,
          family = mixture(gaussian(),gaussian(), order = TRUE),
          cores = 4,
          prior = c(
            prior(exponential(1), class = "sigma1"),
            prior(exponential(1), class = "sigma2"),
            prior(normal(-5, 2.5), class = "Intercept", dpar = "mu1"),
            prior(normal(5, 2.5), class = "Intercept", dpar = "mu2"),
            prior(beta(4, 4), class = "Intercept", dpar = "theta2")
          ),
          backend = "cmdstanr"
)
summary(b4)
Family: mixture(gaussian, gaussian) 
  Links: mu1 = identity; sigma1 = identity; mu2 = identity; sigma2 = identity; theta1 = identity; theta2 = identity 
Formula: x ~ 1 
         theta2 ~ 1
   Data: d (Number of observations: 200) 
  Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
mu1_Intercept       -5.05      0.16    -5.52    -4.84 1.06       74       NA
mu2_Intercept       -0.03      4.92    -5.08     5.13 1.73        6       NA
theta2_Intercept     0.30      0.25     0.02     0.75 1.73        6       NA

Family Specific Parameters: 
       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma1     4.51      3.55     0.90     9.19 1.74        6       NA
sigma2     0.89      0.11     0.71     1.13 1.36        9       NA

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Warning messages:
1: Parts of the model have not converged (some Rhats are > 1.05). Be careful when analysing the results! We recommend running more iterations and/or setting stronger priors. 
2: There were 61 divergent transitions after warmup. Increasing adapt_delta above  may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
2 Likes