Survival analysis with cumulative hazards

Hello all,

I’m doing some work trying to understand how academic performance predicts freshmen retention. I have the GPA for each student each semester, along with some metrics measured as freshmen. At it’s core, the model is that struggling students are more likely to drop out, & that poor performance is one indicator of struggling (though not the only one!)

My data are: if/when the student left, the GPA gap between high school-fall semester, the GPA gap between high school-spring semester (or fall semester-spring semester), and a variety of non-academic “struggle” metrics measured before the student begins fall semester.

The issue I’m having is, if a student does badly in the Fall semester, I think that it increases their chance of dropping out after the Spring semester, even if they did ok. That is, a student’s dropout chance in Spring is a function of both their Spring performance & their fall performance…but I’m not sure the best way to actually model this kind of accumulating hazard as survival analyses aren’t really a thing I do a lot.

Any and all advice would be immensely appreciated!

Here’s code to simulate broadly what I think is going on:

#### Survival model simulation
inv.logit <- function(x)	{
	exp(x) / (1+exp(x))
}
logit <- function(x)	{
	log( x / (1 - x))
}

# parameters
sBetas <- c(0.5, 1.2, 0.75)
aBetas <- c(0.25, 1.1, 0.9)
bStrugGPA <- -1
bDiff <- -1
bDropS <- 1
bDropA <- -1
fBaseDrop <- -5
sBaseDrop <- 1.5

# struggles
Struggles <- rgamma(1e3, 2, 2)
names(Struggles) <- 1:length(Struggles)
fStruggles <- Struggles

Rating1 <- sBetas[1] * Struggles + rnorm(1e3, 0, 0.01)
Rating2 <- sBetas[2] * Struggles + rnorm(1e3, 0, 0.01)
Rating3 <- sBetas[3] * Struggles + rnorm(1e3, 0, 0.01)

# Course difficulty
Diff <- rnorm(1e3)
fDiff <- Diff
aRating1 <- aBetas[1] * Diff + rnorm(1e3, 0, 0.01)
aRating2 <- aBetas[2] * Diff + rnorm(1e3, 0, 0.01)
aRating3 <- aBetas[3] * Diff + rnorm(1e3, 0, 0.01)

# GPA measured as gap between observed GPA minus expected GPA
GPA_gap <- rnorm(1e3, bStrugGPA * Struggles + bDiff * Diff + rnorm(1e3, 0, 0.01), 2)
GPA_gap[which(GPA_gap > 2)] <- 2
GPA_gap[which(GPA_gap < -4)] <- -4
GPA_gap1 <- GPA_gap

# Perform
sStruggles <- fBaseDrop + bDropS * Struggles + bDropA * GPA_gap + rnorm(1e3, 0, 0.01)

fallDrop <- rbinom(1e3, 1, inv.logit(sStruggles))
names(fallDrop) <- names(sStruggles)
spStruggles <- Struggles[which(fallDrop == 0)] + -0.05 * GPA_gap[which(fallDrop == 0)]
spStruggles[which(spStruggles < 0)] <- 0

# second semester
Struggles <- sStruggles
Diff <- rnorm(length(Struggles))
GPA_gap <- rnorm(length(Struggles), bStrugGPA * Struggles + bDiff * Diff + rnorm(length(Struggles), 0, 0.1), 2)
GPA_gap[which(GPA_gap > 2)] <- 2
GPA_gap[which(GPA_gap < -4)] <- -4
GPA_gap2 <- GPA_gap
eStruggles <- sBaseDrop + bDropS * Struggles + bDropA * GPA_gap + rnorm(length(Struggles), 0, 0.01)
sDrop <- rbinom(length(Struggles), 1, inv.logit(eStruggles))
names(sDrop) <- names(Struggles)

left <- c(fallDrop, sDrop) 

## fake data
regdat <- data.frame(id = names(left), left=left, gap = c(GPA_gap1, GPA_gap2), pred1 = Rating1[as.numeric(names(left))], pred2 = Rating2[as.numeric(names(left))], pred3 = Rating3[as.numeric(names(left))], apred1 = aRating1[as.numeric(names(left))], apred2 = aRating2[as.numeric(names(left))], apred3 = aRating3[as.numeric(names(left))])

And here’s the model I’m not satisfied with.

# fit survival model
library(brms)

model <- bf(left | trials(1) ~ 0 + Intercept + gap + pred1 + pred2 + pred3 + apred1 + apred2 + apred3 + (1 + gap|id))

fit <- brm(model, 
	data = regdat,
	family = binomial(),
	prior(normal(0, 1), class = b),
	chains = 4, cores = 4, iter = 2000, warmup = 1000)

It does ok, but I feel like the model itself could be greatly improved, specifically it seems like it is treating the Spring semester dropout rate as a function of only the spring semester data, really.

I’m looking for advice/guidance on the best way to do so. I used brms here, but rstarnarm or even just straight stan are fine–I just am unclear on the best way to actually build the model here.

Thank you!

Why not just code that model up in Stan directly?

If there are just two time periods and you aren’t looking at continuous dropout times, this doesn’t look like a standard survival model.