I have been asked to re-examine a small dataset consisting of 36 mothers who who were administered a PTSD questionnaire. The original analysis included a set of exploratory analyses where each of the four instrument subscales were regressed against a set of nine participant characteristics. The motivation for my re-analysis is to take advantage of the shrinkage properties for linear models in rstanarm to help reign in incredible regression slopes.
I had originally set out to repeat the analyses as 4 seperate regressions as in the original paper, but I wondered if approaching this as a multivariate problem might provide some added protection against inflated estimates. To approach this analysis, I arranged my data as follows:
studyid = vector of participant ids
predictorx… to n = columns of coefficient values
outcome = index indicating the subscale
score = score on a given subscale
My understanding of the goal here is that I provide an intercept for each outcome and random slope for each predictor by outcome. Since subscales have different maximums, I mean centred and scaled each individually before stacking into long format. All predictors are continuous and scaled in the same way, as the original authors are looking for interpretation as standardized coefficients. My expectation of what this model will achieve is that the individual slopes will be shrunk towards the mean across outcomes. For the sake of speed I am fitting preliminary models using lme4 but the plan is to move to rstanarm once I am confident I’m getting the right answer. I am currently working with a subset of predictors. In an effort to protect myself from happily reporting results from a model I specify incorrectly I did the following:
No pooling:lm(score ~ (mi9t1+GA+Sex):outcome, data = example)
Finally my actual model:
partial = lmer(score ~ (1|outcome) + (mi9t1 + GA + Sex - 1|outcome), data = example)
mi9t1 GA Sex (Intercept)
B 0.3566662 0.02388319 -0.04291586 -1.101495e-17
C 0.2919617 0.01955043 -0.03513029 -1.101495e-17
D 0.1792658 0.01200405 -0.02157016 -1.101495e-17
E 0.2293102 0.01535514 -0.02759175 -1.101495e-17
At first glance I was pretty happy that it seemed to work, but these results don’t seem to make sense to me. For mi9t1 the no pooling estimate for outcome D was 0.189, the complete pool was 0.30, but the partial is 0.179. Shouldn’t this value have been shrunken up towards the complete pool mean instead of down below it’s original estimate?
Any help would be much appreciated! I am absolutely positive I have made one or several mistakes along the way, but if I’m right that this CAN work I think it will be a huge benefit to the research we are doing.
You typically want to put
mi9t1 + GA + Sex both outside and inside the grouping construct. Doing what you have now restricts the common coefficients to be zero. So, in rstanarm it would be
stan_lmer(score ~ mi9t1 + GA + Sex + (1 | outcome) +
(mi9t1 + GA + Sex - 1| outcome), data = example, QR = TRUE)
but it is weird to make the variation in the intercept across outcomes to be independent of the variation in the other coefficients across outcomes.
Thanks for the quick reply. Using your specification (with random intercepts allowed to correlate with slopes, although I have to remove the fixed intercept to converge) I get the following from lmer:
test = lmer(score ~ -1 + mi9t1 + GA + Sex + (mi9t1 + GA + Sex|outcome), data = example)
Linear mixed model fit by REML ['lmerMod']
Formula: score ~ -1 + mi9t1 + GA + Sex + (mi9t1 + GA + Sex | outcome)
REML criterion at convergence: 396.5
Min 1Q Median 3Q Max
-1.6607 -0.6379 -0.2698 0.4374 3.9129
Groups Name Variance Std.Dev. Corr
outcome (Intercept) 0.000000 0.00000
mi9t1 0.002809 0.05300 NaN
GA 0.003283 0.05730 NaN -1.00
Sex 0.005996 0.07744 NaN 1.00 -1.00
Residual 0.871238 0.93340
Number of obs: 144, groups: outcome, 4
Estimate Std. Error t value
mi9t1 0.33056 0.08393 3.939
GA 0.04112 0.08573 0.480
Sex -0.05963 0.08963 -0.665
Correlation of Fixed Effects:
Sex 0.213 -0.306
(Intercept) mi9t1 GA Sex
B 0 0.3423532 0.028369091 -0.042400693
C 0 0.3652228 0.003644452 -0.008987116
D 0 0.3057213 0.067972278 -0.095921562
E 0 0.3089524 0.064479145 -0.091200843
Still seems off in that for outcome D the slope for GA is 0.047 in no pooling, 0.041 in complete and 0.068 in partial pooling. I would expect if I’ve actually set this up properly than the partial would be somewhere between 0.047 and 0.041, wouldn’t I?
In rstanarm, the output is slightly more confusing still. If I DON’T include the fixed intercept then I am told there are too many divergences, so I include it stan_lmer(score ~ mi9t1 + GA + Sex + (mi9t1 + GA + Sex| outcome), data = example, QR = TRUE)
and get this:
b[(Intercept) outcome:B] -1.717148e-04
b[mi9t1 outcome:B] 2.991796e-03
b[GA outcome:B] -2.689281e-03
b[Sex outcome:B] -1.190121e-03
b[(Intercept) outcome:C] -6.200280e-05
b[mi9t1 outcome:C] 1.537505e-03
b[GA outcome:C] -2.379720e-03
b[Sex outcome:C] 8.071129e-03
b[(Intercept) outcome:D] 9.884686e-06
b[mi9t1 outcome:D] -3.009933e-03
b[GA outcome:D] 2.427731e-06
b[Sex outcome:D] -5.135490e-03
b[(Intercept) outcome:E] 9.282460e-05
b[mi9t1 outcome:E] -1.340291e-03
b[GA outcome:E] 7.609781e-03
b[Sex outcome:E] -1.518275e-04
lmer results are suspect because the correlations are 1 or NaN.
stan_lmer ones might be okay but you get a very small intercept (I think) due to standardizing your variables, which isn’t recommended. Just do it with unstandardized variables, and if you have to interpret things in terms of standard deviations. Do
posterior_predict once with the actual data and once with a one standard deviation increase for all observations on the variable of interest. The distribution of the difference between those two is what you want.
I suppose my rationale for standardizing the scores is that the scales they are measured on are not the same length (e.g. max scores can be higher in one subscale). My understanding of this approach to multivariate problems is that I will be shrinking the slope coefficients for each outcome together. Wouldn’t difference in scale be an issue?
Dividing by the square root of the length of the test would be fine if the length of the test is fixed. Standardizing involves an estimate of the unknown mean and standard deviation, so Bayesian results are not quite right when you standardize the dependent variable.
Still no luck getting sensible answers with everything unstandardized or standardizing depent by length of subscale. With lmer I still have perfect correlations in the random effects shrinkage not working as expected. Stan gives me 13 divergent transitions. Wondering now if this is just a consequence of 36 data points per outcome? Only other idea I have is whether it would make any difference if data was arranged as in this description: https://stats.idre.ucla.edu/r/faq/multivariate-random-coefficient-model/
stan_lmer you should be able to get rid of the divergent transitions by specifying the
adapt_delta = 0.99 argument.
Still have four divergent transitions with
stan_lmer(score ~ mi9t1 + GA + Sex + (mi9t1 + GA + Sex| outcome), data = example, QR = TRUE, adapt_delta = 0.99)
Still no dice and pairs plot is playing nicely for some reason so I think I will call it a day for now, but thank you!