Using finite mixture models for regression

Background

I am trying to model data from a psychological experiment, where my participants use a slide scale on a computer screen to give a response. There actually two very similar experiments that both have the same problem. The response variable y can hence vary between 0 and 1. Unfortunately, I think I made mistake while designing the experiment as the starting position of the slider is always 0.5 and it seems participants always move the slider at least a bit and never leave it at the middle even when they “should”. This can be seen in the raw distribution of y:

ggplot(Experiment1, aes(x = est_dist)) + 
  geom_density() +
  geom_vline(xintercept = 0.36) +
  geom_vline(xintercept = 0.65) +
  theme_classic() +
  labs(x = "Raw distribution of y", y = "Density")

As we can see, the raw distribution of y has two modes approximately at y = 0.36 and y = 0.65. My interpretation is that participants fairly randomly move the slider to the left or the right even if they want to give a centre response. My original aim with this analysis was to simply build a hierarchical regression model, where I predict y using the two predictor variables x_1 and x_2 because I would like to make inferences whether there is evidence that the predictors are different from zero. However, pp_check() of a model with only one Gaussian response distribution still has two modes, so I feel I have to do something different.

I played around with my data to see whether there is anything that would determine a left or right shift but I could not find anything so I turned to mixture models, which I understand might be useful in this context. Here is how I currently model the data:

First, I scaled y, x_1 and x_2 so don’t wonder if there numbers are different from the raw distribution. The model I fit with its priors is

# Create Gaussian mixture model
mix <- mixture(gaussian, gaussian)

# Prior
prior  <- c(prior(normal(-0.7, 1), class = "Intercept", dpar = mu1),
            prior(normal(0, 1), class = "b", dpar = mu1),
            prior(normal(0, 0.5), class = "b", coef = "x2", dpar = mu1),
            prior(normal(0.57, 1), class = "Intercept", dpar = mu2),
            prior(normal(0, 1), class = "b", dpar = mu2),
            prior(normal(0, 0.5), class = "b", coef = "x2", dpar = mu2))

# Fit model
m1 <- brm(y ~ x1 * x2 + (x1 * x2 | ppid), 
          data = Experiment2,
          family = mix,
          prior = prior, 
          iter = iter,
          warmup = warmup,
          chains = chains,
          cores = cores,
          init = 0,
          seed = 1412)

The summary

Family: mixture(gaussian, gaussian) 
Links: mu1 = identity; sigma1 = identity; mu2 = identity; sigma2 = identity; theta1 = identity; theta2 = identity 
Formula: y ~ x1 * x2 + (x1 * x2 | ppid) 
Data: Experiment2 (Number of observations: 3240) 
Draws: 10 chains, each with iter = 34000; warmup = 7000; thin = 1;
total post-warmup draws = 270000

Group-Level Effects: 
  ~ppid (Number of levels: 27) 
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(mu1_Intercept)             0.33      0.05     0.24     0.45 1.00    79474   136602
sd(mu1_x1)                    0.55      0.09     0.40     0.76 1.00    88560   144127
sd(mu1_x2)                    0.24      0.08     0.11     0.41 1.00    59470   109863
sd(mu1_x1:x2)                 0.31      0.15     0.04     0.61 1.00    43261    67606
sd(mu2_Intercept)             0.40      0.07     0.29     0.54 1.00    71623   125214
sd(mu2_x1)                    0.50      0.09     0.35     0.70 1.00    91114   152636
sd(mu2_x2)                    0.22      0.09     0.05     0.41 1.00    53965    62593
sd(mu2_x1:x2)                 0.46      0.15     0.17     0.78 1.00    80310    96385
cor(mu1_Intercept,mu1_x1)     0.24      0.19    -0.16     0.59 1.00    85937   136518
cor(mu1_Intercept,mu1_x2)    -0.14      0.25    -0.62     0.35 1.00   165462   178790
cor(mu1_x1,mu1_x2)           -0.39      0.25    -0.83     0.12 1.00   116450   138334
cor(mu1_Intercept,mu1_x1:x2) -0.25      0.30    -0.80     0.36 1.00   168084   143507
cor(mu1_x1,mu1_x1:x2)        -0.15      0.31    -0.73     0.46 1.00   183502   149842
cor(mu1_x2,mu1_x1:x2)         0.46      0.34    -0.42     0.90 1.00    74380   115312
cor(mu2_Intercept,mu2_x1)    -0.28      0.19    -0.63     0.12 1.00    99610   151304
cor(mu2_Intercept,mu2_x2)    -0.20      0.28    -0.69     0.41 1.00   171919   151727
cor(mu2_x1,mu2_x2)           -0.29      0.29    -0.80     0.31 1.00   158194   154069
cor(mu2_Intercept,mu2_x1:x2) -0.01      0.27    -0.55     0.49 1.00   165735   175828
cor(mu2_x1,mu2_x1:x2)         0.58      0.21     0.09     0.90 1.00   168609   177125
cor(mu2_x2,mu2_x1:x2)        -0.52      0.30    -0.93     0.18 1.00    91228   124046

Population-Level Effects: 
  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
mu1_Intercept -0.56      0.07    -0.70    -0.43 1.00    50255    93363
mu2_Intercept  0.77      0.08     0.60     0.93 1.00    48193    90484
mu1_x1         0.56      0.11     0.34     0.79 1.00    67273   110857
mu1_x2        -0.23      0.07    -0.37    -0.10 1.00   144838   158255
mu1_x1:x2      0.03      0.11    -0.18     0.25 1.00   169239   171867
mu2_x1         0.47      0.11     0.26     0.68 1.00    72559   120684
mu2_x2        -0.19      0.07    -0.33    -0.06 1.00   138909   158043
mu2_x1:x2      0.25      0.13    -0.00     0.51 1.00   122729   166378

Family Specific Parameters: 
  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma1     0.48      0.01     0.45     0.50 1.00   173057   197944
sigma2     0.45      0.02     0.42     0.48 1.00   150047   181474
theta1     0.56      0.02     0.53     0.60 1.00   129195   173993
theta2     0.44      0.02     0.40     0.47 1.00   129195   173993

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

and pp_check() of model:

pp_check(m1, ndraws = 1000)

I would argue that this already looks much better than what I got using one Gaussian distributions instead of two but it doesn’t seem to be perfect.

Possibly interesting side note for readers who try to fit similar models: the model behaved well after I set more informative priors on the intercepts, have a very high number of samples (warmup = 7,000 and iter = 34,000 with 10 chains) and set init = 0, which seemed pretty crucial in order to get it to converge.

Questions

I’ve looked at some responses on here and also read the chapter on finite mixture models in Bayesian Data Analysis from Gelman and co. but I am still not exactly sure how to approach this issue. Mainly, I have these questions:

1. Does my approach make any sense?

My first question is whether what I attempt makes any sense.

2. Is it acceptable to eye-ball the priors for the intercept based on the raw data?

Generally, I think my field prefers specifying priors before seeing the data but in this particular case, I feel this was not possible. I therefore wonder if the approach of eye-balling possible locations of the intercepts based on the data, is fine.

3. How do I do do inference on the effect of my predictors?

Assuming that my current model or a similar version makes any sense, I would like to know how I can make the inference whether or not there is an effect of my predictor variables. Since, we have two Gaussian distributions I have two slopes per predictor etc.

Is it possible to average mu1_x1 and mu1_x2 together? Or do I have to strictly interpret them in terms of the two Gaussian distributions? My idea was to do average them like this using the $\theta$s:

# Extract the posterior draws from the model
m1_draws <- as_draws_df(m1)
m1_draws$b_mu1_x1 <- m1_draws$b_mu1_s_true_dist
m1_draws$b_mu2_x1 <- m1_draws$b_mu2_s_true_dist
# Calculate the average effect of x1
m1_draws$b_x1 <- m1_draws$theta1*m1_draws$b_mu1_x1 + 
                m1_draws$theta2*m1_draws$b_mu2_x1

# Plot
ggplot(m1_draws, aes(x = b_x1)) + 
  geom_histogram(bins = 1000) +
  theme_classic()

Is that reasonable? Relatedly, what is the prior of x_1 so I can compare it to the posterior of it?

When we look at the relationship between the posterior samples of mu1_x1 and mu1_x2, we btw find they are not correlated, which I guess is a good thing.

ggplot(m1_draws, aes(x = b_mu1_x1, y = b_mu2_x1)) + 
  geom_point(alpha = 0.01) +
  geom_smooth() + 
  theme_classic()

How are theta_1 and theta_2 interpreted?

When I want to describe the results of the model, would I then simply report the values of \theta_1 and \theta_2 along with the intercepts and the \sigma values to report the distributions? Can I interpret \theta as the proportion of the distribution in the data?

Interestingly, conditional_effects() does seem to combine both slopes but to me it is unclear how.

conditional_effects(m1)

As a way of thinking about this problem, I recommend my 2007 article with Iain Pardoe, Average predictive comparisons for models with nonlinearity, interactions, and variance components. The basic idea is that the comparison—the expected difference in y, comparing two people who differ by 1 unit in x1 but are identical in other predictors—depends on the values of all the predictors. In general, the way to do this is using simulations, although if you’re fitting a linear model (rather than logistic, say), you can do some of the computations analytically. We also discuss these ideas in Section 21.4 of my book with Jennifer and Section 14.4 of Regression and Other Stories. Using a mixture model rather than, say, an ordinary regression model with interactions, introduces no new conceptual challenges.

If you’re working with Stan, you should compute the average predictive comparisons in the generated quantities block, and then your Stan fit will automatically give you the posterior distribution of all the average predictive comparisons that you compute.

P.S. We prefer the term “comparisons” rather than “effects” because the interpretation as a comparison in a predictive model is more general than the interpretation as a casual effect. The comparison reduces to an effect in the special case that the model has a valid causal interpretation.

1 Like