Logistic regression with imbalanced nested data

  • Operating System: Windows 10 64-bit
  • brms Version: 2.4.0

I am unsure how to optimally model a data set obtained from a crude experiment to compare performance of two plastic molding tools used to manufacture widgets. Widgets are tested and results are recorded as either a pass or fail.

Tool 1 has two molding cavities, A2 and B2, meaning that two widgets can be produced at once. Tool 2 has 16 cavities, but they are paired according to the letter prefix; for example, C1 and C2 are more similar to each other than C1 and F1. Tool 2 was also evaluated under two operating conditions: “previous” and “new”. However, I only have data for five out of 16 cavities under the “previous” condition. Pass and fail counts grouped by Cavity are shown below:

> Tool  Condition  Cavity  n_pass n_fail
> 1     Control    A2      177    2
> 1     Control    B2      168    1
> 2     New        C1      159    1
> 2     New        C2      160    1
> 2     New        D1      154    1
> 2     New        D2      176    3
> 2     New        F1      147    0
> 2     New        F2      168    0
> 2     New        G1      166    0
> 2     New        G2      166    0
> 2     New        P1      163    1
> 2     New        P2      172    0
> 2     New        R1      135    2
> 2     New        R2      175    1
> 2     New        S1      169    4
> 2     New        S2      171    2
> 2     New        T1      176    2
> 2     New        T2      148    0
> 2     Previous   C1      138    2
> 2     Previous   C2      23     0
> 2     Previous   F1      30     0
> 2     Previous   F2      21     1
> 2     Previous   G1      96     0

Based on the number of failures, I want to determine the best tool and operating condition, as well as the performance of each cavity.

Since the data is available as pass/fail, my first thought was to use logistic regression. I’m unsure how to correctly specify the model, however, given the imbalanced nested structure of the data. My first attempt was

formula = Fail ~ Tool + (1|Tool/Condition/Cavity)
fit = brm(formula, data=data, family=bernoulli, warmup=1000, iter=2000, chains=4, control=list(adapt_delta-0.99)

where Fail is a binary variable with 1 indicating a failure and 0 a pass.

This led to 46 divergent transitions and 800 transitions exceeding max treedepth. I’m currently rerunning the model with an adapt_delta of 0.999 and max_treedepth of 20, but I suspect that I may not have enough data to do all the inference I want, and it would be difficult to come up with strong priors. Is my model correctly specified? Is there a better way of approaching this problem?

Thanks!

The data you used in your model seems to be somewhat different than the data you showed. For instance, how is “Fail” generated from the shown data?

Using the shown data, you may try

data$n_trials <- data$n_fail + data$n_pass
formula <- n_fail | trials(n_trails) ~  (1|Tool/Condition/Cavity)
# and then use family binomial() in brm

Using Tool as both an overall and a varying effect does not make much sense and I would recommend leaving out one of the two. Maybe it already gets better through that.

Thanks, Paul. Sorry for not being clear - the data shown is an aggregate, but I have results for individual pass/fail as well, hence my original thought for using a logistic regression.

Attempts at fitting either model
formula = 'n_fail|trials(n_trials) ~ (1|Tool/Condition/Cavity)'
or
formula = 'n_fail|trials(n_trials) ~ Tool + (1|Condition/Cavity)'
always result in a dozen or so divergent transitions, respectively, despite setting a higher warmup to 4000 and adapt_delta to 0.999999. Would it help to somehow also model the increased similarity between cavities with the same letter, e.g., C1 and C2?

Edit: I realized I was missing a factor in my dataset and it seems to have occasionally solved the divergence problem - some simulations run fine, but in others there are half a dozen divergences. Are these spurious, and would it still be ok to use the model?

The new model is:
n_fail|trials(n_trials) ~ (1|Fluid) + (1|Condition/Cavity)
where Fluid is two-level factor occurring for samples tested in each cavity.

I would suggest you start with a simple model, let’s say only by using Tools as a predictor and then increase complexity step by step in order to find out what is causing the problems.

That is working better, thanks!

Using a binomial model, is there a way to directly get posterior predictive proportions rather than counts using brms::predict or brms::posterior_predict? That is, I would like to plot the predicted distribution of the proportions for each Tool, rather than the specific number of failures for a given number of trials. Or would I have to revert to a bernoulli model to do this? With over 3000 observations, the bernoulli model takes much longer to run than the binomial using aggregated data…

Call the methods with argument newdata, which can be your original data, but with your trials variable (n_trials) to 1.

It looks like that just returns point estimates without correct quantiles when calling
predict(fit, newdata=data.frame(Condition=c('Control', 'Previous', 'New'), n_trials=1))

Estimate  Est.Error Q2.5 Q97.5
[1,] 0.0057500 0.07561280    0     0
[2,] 0.0060625 0.07762811    0     0
[3,] 0.0061875 0.07841938    0     0

Setting summary=FALSE results in a matrix of count data for each Monte Carlo sample (most entries being zero, because the proportions are so small).

I suppose one crude alternative would be to set n_trials to some very high number (say 1E6) and then divide the counts in the matrix by n_trials to approximate proportions.

use fitted instead of predict to get the probabities with reasonable quantiles.

That did the trick, thanks! Hopefully just one more question: how can I model the additional similarities between cavities with the same letter? Would I parse out the cavity ID into new columns “Cavity Letter” and “Cavity Number” and nest Number within Letter? I will be getting more data, so hopefully this type of model will be estimable.

Given that you have both variables called CL and CN, you could use something like (1 | CL/CN) to model one random effect per “cavity letter” and one per “cavity”.