Posterior_survfit() is not using the nd

Hello, I am having trouble generating posterior predictions using posterior_survfit(). I am trying to use a new data frame, but it is not using the new data frame and instead is using values from the dataset I used to fit the model. The fitted variables in the model are New.Treatment (6 treatments = categorical), Openness (a continuous light index min= 2.22, mean= 6.903221 and max=10.54), subplot_by_site(categorical-720 sites), New.Species.name(categorical- 165 species). My new data frame has 94 rows and the posterior_survfit() is giving me 3017800 rows. Help, please!

head(nd)
      New.Treatment Openness
1          BE	               5
2          BE	               6
3          BE	               7
4          BE	               8
5          BE	               9
6          BE	              10


fit= stan_surv(formula = Surv(days, Status_surv) ~ New.Treatment*Openness + (1 |subplot_by_site)+(1|New.Species.name),
  data = dataset,
  basehaz = "weibull",
  chains=4,
  iter = 2000,
  cores =4 )

Post=posterior_survfit(fit, type="surv",
                        newdata=nd5)

head(Post)
  id cond_time    time median  ci_lb  ci_ub
1  1        NA 62.0000 0.9626 0.9623 1.0000
2  1        NA 69.1313 0.9603 0.9600 0.9997
3  1        NA 76.2626 0.9581 0.9579 0.9696
4  1        NA 83.3939 0.9561 0.9557 0.9665
5  1        NA 90.5253 0.9541 0.9537 0.9545
6  1        NA 97.6566 0.9522 0.9517 0.9526

#here some sample data set to replicate my problem
library(rstanarm)

data_NHN<- expand.grid(New.Treatment = c("A","B","C"), Openness = c(seq(2, 11, by=0.15)))
data_NHN$subplot_by_site=c(rep("P1",63),rep("P2",60),rep("P3",60))
data_NHN$Status_surv=sample(0:1,183, replace=TRUE) 
data_NHN$New.Species.name=c(rep("sp1",10),rep("sp2",40),rep("sp1",80),rep("sp2",20),rep("sp1",33))
data_NHN$days=sample(10, size = nrow(data_NHN), replace = TRUE)

nd_t<- expand.grid(New.Treatment = c("A","B","C"), Openness = c(seq(2, 11, by=1)))


mod= stan_surv(formula = Surv(days, Status_surv) ~ New.Treatment+Openness + (1 |subplot_by_site)+(1|New.Species.name),
                  data =data_NHN,
                  basehaz = "weibull",
                  chains=4,
                  iter = 30,
                  cores =4)

summary(mod)
pos=posterior_survfit(mod, type="surv",
                        newdataEvent=nd_t,
                      times = 0)
head(pos)
# I am interested in predicting values for specific Openess values  (nd_t=20 rows)but I am getting instead values for each point in time (pos=18300rows) .
  • Operating System: Mac OS Catalina 10.15.6
  • R version: 4.0
  • rstan version: 2.21.2
  • rstanarm Version: rstanarm_2.21.2

Any suggestions on why is it not working. it’s not clear how to give some sort of plot of the effects of one variable in the interaction as the other changes and the associated uncertainty (i.e. a marginal effects plot). TIA.

Hi @nohemihuanca, welcome to the Stan forums.

I don’t know much about the posterior_survfit() function, but I’m tagging the author
@sambrilleman. Sam any idea what’s going on here? Is this a bug or is there a misunderstanding?

1 Like

hey @nohemihuanca. Sorry this was confusing for you.

So I think the first issue (at least with the reproducible example) is simply the name of the newdata argument in posterior_survfit(). I think in that example you have specified newdataEvent instead of just newdata. The newdataLong / newdataEvent arguments are only for predictions from joint models (i.e. stanjm objects). It should really throw an informative error when you specify that by accident, but I mustn’t have considered that, so instead your misspecified newdataEvent is just getting absorbed as one of the additional ... arguments and the model estimation dataset is being used in the predictions instead.

Hopefully that will fix the issue you are seeing.

But that hasn’t solved it entirely. When you specify newdata = nd_t you will get another error in your reprex (at least I did):

> pos=posterior_survfit(mod, type="surv",
+                       newdata=nd_t,
+                       times = 0)
Error: The following variables are missing from the data: subplot_by_site, New.Species.name

It seems you will have to specify something for the grouping factors. It depends what you want to do. If you want to predict based on the random effects for a specific group then just specify their subplot_by_site and New.Species.name in the newdata.

Alternatively, if you want to marginalise over the random effects for each of those grouping factors, then from memory you need to specify a value for each of the grouping factors that doesn’t correspond to a group in the estimation dataset. This is way more clunky than stan_glmer so we should endeavour to fix this. I am also encountering a bug with new levels when the grouping variable is a character variable (@jonah fyi – I think this is a bug here because the object$xlevs includes “x levels” for the grouping factors, but ideally it shouldn’t because otherwise new groups can’t be included in the prediction model frame).

Long story (kinda) short… @nohemihuanca use numeric values for your grouping factors (the numeric values get converted to factors anyway) and then specify a new value in your prediction data and it should work until we fix that bug. Something like:

library(rstanarm)

data_NHN <- expand.grid(
    New.Treatment = c("A","B","C"),
    Openness = c(seq(2, 11, by=0.15)))
data_NHN$subplot_by_site = c(rep(1,63), rep(2,60), rep(3,60))
data_NHN$Status_surv = sample(0:1,183, replace = TRUE) 
data_NHN$New.Species.name = c(rep(1,10), rep(2,40), rep(1,80), rep(2,20), rep(1,33))
data_NHN$days = sample(10, size = nrow(data_NHN), replace = TRUE)

mod = stan_surv(
    formula = Surv(days, Status_surv) ~ New.Treatment + Openness + (1 | subplot_by_site) + (1 | New.Species.name),
    data = data_NHN,
    basehaz = "weibull",
    chains = 4,
    iter = 30,
    cores = 4)

nd <- expand.grid(
    New.Treatment = c("A","B","C"), 
    Openness = c(seq(2, 11, by = 1)),
    subplot_by_site = -999,   # new value in the prediction data
    New.Species.name = -999)  # new value in the prediction data

pos = posterior_survfit(
    mod, type = "surv",
    newdata = nd,
    times = 0)

# to confirm the new value for the grouping factor doesn't matter
nd2 <- expand.grid(
    New.Treatment = c("A","B","C"), 
    Openness = c(seq(2, 11, by = 1)),
    subplot_by_site = +999,   # new value in the prediction data
    New.Species.name = +999)  # new value in the prediction data

pos2 = posterior_survfit(
    mod, type = "surv",
    newdata = nd2,
    times = 0)

identical(pos, pos2)

# can plot the predictions for each covariate profile
# NB: you are marginalising over the random effects so the uncertainty bounds will be huge...
plot(pos)

Note that your prediction data pos has 3000 rows – that is, 30 covariate profiles and 100 rows per covariate profile, where the 100 rows correspond to 100 evenly distributed time points between t=0 and t=10 (the min and max times). You can change that behaviour via the times, control, and extrapolate arguments to posterior_survfit().

3 Likes

Thank you! I appreciate your help @sambrilleman I am sorry if this is a basic question but if I want to get the specific 30 covariates profiles that I set in my new dataset and I have 100 rows evenly distributed for each time point (3000 rows). Is this means that If I want specifically the one estimation for id=23 (New.Treatment=B, Openness=9) which is the estimated survival in days for that specific id, then I will take the value with median ~0.5 because that is median even free probability correct? rather than taking the mean pos$median (from the 100 values equal to id=23) and then find to which pos$time (in days) it corresponds?

Thank you so much!

@nohemihuanca The survival probability (which is the default quantity returned by posterior_survfit) is a time-varying quantity. So if you want just one value for say id = 23 then you would have to predict at a specific time (e.g. set the argument times equal to the time you want, and specify the argument extrapolate = False). This would give you the predicted survival probability at the specific time. You could then compare this across different ids or covariate profiles. So for e.g. if you specified times = 5 then you could answer things like “what is the estimated five year survival probability?” (and you could compare that estimate across your different ids or covariate profiles).

However the “one value” you seem to be after in your question is the median survival time? The posterior_survfit() function doesn’t return median survival time at the moment. I just hadn’t chosen to implement it yet. Having said that, most people define the median survival time as the time at which the survival probability is equal to 0.5. So if you can use posterior_survfit to find the pos$time when pos$median = 0.5 then you would have the median survival time. But like I mentioned earlier the function isn’t set up to return you that. Also, it isn’t the same as taking the mean of pos$median like you had suggested – mean of pos$median is just a function of the times you choose to predict at.

1 Like