Hello all,
I’m doing some work trying to understand how academic performance predicts freshmen retention. I have the GPA for each student each semester, along with some metrics measured as freshmen. At it’s core, the model is that struggling students are more likely to drop out, & that poor performance is one indicator of struggling (though not the only one!)
My data are: if/when the student left, the GPA gap between high school-fall semester, the GPA gap between high school-spring semester (or fall semester-spring semester), and a variety of non-academic “struggle” metrics measured before the student begins fall semester.
The issue I’m having is, if a student does badly in the Fall semester, I think that it increases their chance of dropping out after the Spring semester, even if they did ok. That is, a student’s dropout chance in Spring is a function of both their Spring performance & their fall performance…but I’m not sure the best way to actually model this kind of accumulating hazard as survival analyses aren’t really a thing I do a lot.
Any and all advice would be immensely appreciated!
Here’s code to simulate broadly what I think is going on:
#### Survival model simulation
inv.logit <- function(x) {
exp(x) / (1+exp(x))
}
logit <- function(x) {
log( x / (1 - x))
}
# parameters
sBetas <- c(0.5, 1.2, 0.75)
aBetas <- c(0.25, 1.1, 0.9)
bStrugGPA <- -1
bDiff <- -1
bDropS <- 1
bDropA <- -1
fBaseDrop <- -5
sBaseDrop <- 1.5
# struggles
Struggles <- rgamma(1e3, 2, 2)
names(Struggles) <- 1:length(Struggles)
fStruggles <- Struggles
Rating1 <- sBetas[1] * Struggles + rnorm(1e3, 0, 0.01)
Rating2 <- sBetas[2] * Struggles + rnorm(1e3, 0, 0.01)
Rating3 <- sBetas[3] * Struggles + rnorm(1e3, 0, 0.01)
# Course difficulty
Diff <- rnorm(1e3)
fDiff <- Diff
aRating1 <- aBetas[1] * Diff + rnorm(1e3, 0, 0.01)
aRating2 <- aBetas[2] * Diff + rnorm(1e3, 0, 0.01)
aRating3 <- aBetas[3] * Diff + rnorm(1e3, 0, 0.01)
# GPA measured as gap between observed GPA minus expected GPA
GPA_gap <- rnorm(1e3, bStrugGPA * Struggles + bDiff * Diff + rnorm(1e3, 0, 0.01), 2)
GPA_gap[which(GPA_gap > 2)] <- 2
GPA_gap[which(GPA_gap < -4)] <- -4
GPA_gap1 <- GPA_gap
# Perform
sStruggles <- fBaseDrop + bDropS * Struggles + bDropA * GPA_gap + rnorm(1e3, 0, 0.01)
fallDrop <- rbinom(1e3, 1, inv.logit(sStruggles))
names(fallDrop) <- names(sStruggles)
spStruggles <- Struggles[which(fallDrop == 0)] + -0.05 * GPA_gap[which(fallDrop == 0)]
spStruggles[which(spStruggles < 0)] <- 0
# second semester
Struggles <- sStruggles
Diff <- rnorm(length(Struggles))
GPA_gap <- rnorm(length(Struggles), bStrugGPA * Struggles + bDiff * Diff + rnorm(length(Struggles), 0, 0.1), 2)
GPA_gap[which(GPA_gap > 2)] <- 2
GPA_gap[which(GPA_gap < -4)] <- -4
GPA_gap2 <- GPA_gap
eStruggles <- sBaseDrop + bDropS * Struggles + bDropA * GPA_gap + rnorm(length(Struggles), 0, 0.01)
sDrop <- rbinom(length(Struggles), 1, inv.logit(eStruggles))
names(sDrop) <- names(Struggles)
left <- c(fallDrop, sDrop)
## fake data
regdat <- data.frame(id = names(left), left=left, gap = c(GPA_gap1, GPA_gap2), pred1 = Rating1[as.numeric(names(left))], pred2 = Rating2[as.numeric(names(left))], pred3 = Rating3[as.numeric(names(left))], apred1 = aRating1[as.numeric(names(left))], apred2 = aRating2[as.numeric(names(left))], apred3 = aRating3[as.numeric(names(left))])
And here’s the model I’m not satisfied with.
# fit survival model
library(brms)
model <- bf(left | trials(1) ~ 0 + Intercept + gap + pred1 + pred2 + pred3 + apred1 + apred2 + apred3 + (1 + gap|id))
fit <- brm(model,
data = regdat,
family = binomial(),
prior(normal(0, 1), class = b),
chains = 4, cores = 4, iter = 2000, warmup = 1000)
It does ok, but I feel like the model itself could be greatly improved, specifically it seems like it is treating the Spring semester dropout rate as a function of only the spring semester data, really.
I’m looking for advice/guidance on the best way to do so. I used brms here, but rstarnarm or even just straight stan are fine–I just am unclear on the best way to actually build the model here.
Thank you!