Interpretation of stan_surv with categorical variables in rstanarm

Hello,
I am trying to compare survival between site and treatment (categorical variables) in rstanarm using stan_surv(). I am first looking to compare the hazard of death and I am not sure if my interpretation is correct but I think print() is giving me in “SiteB:New.TreatmentNE” the hazard rate with respect to the “site A and Treatment NE” but I would like to compare the hazard of death from Site A in treatment “E” to the same site A in treatment "NE ", then site B in treatment “E” to site B in treatment “NE” as I am interested in compare “hazard of death” for each site by treatment. So, my question is how I can extract those values?
I tried posterior_surv () type =“haz” but that is giving me an estimate for each point in time.

Also, is it possible with this model to get the hazard death for species “S1” in site “A” under the treatment “E”?
Here is a trial data

library(rstanarm)

data_NHN <- expand.grid(
  Site = c("A","B","C"),
  New.Treatment = c(rep("E",21),rep("NE",40)))
data_NHN$subplot_by_site = c(rep(1,63), rep(2,60), rep(3,60))
data_NHN$Status_surv = sample(0:1,183, replace = TRUE) 
data_NHN$New.Species.name = c(rep("S1",10), rep("S2",40), rep("S3",80), rep("S2",20), rep("S1",33))
data_NHN$days = sample(10, size = nrow(data_NHN), replace = TRUE)

mod1 = stan_surv(
  formula = Surv(days, Status_surv) ~  Site*New.Treatment + (1| subplot_by_site) + (Site | New.Species.name),
  data = data_NHN,
  basehaz = "weibull",
  chains = 4,
  iter = 30,
  cores = 4)

print(mod1, digits=2)
summary(mod1, digits=2)

  • Operating System: Mac OS Catalina 10.15.6
  • R version: 4.0
  • rstan version: 2.21.2
  • rstanarm Version: rstanarm_2.21.2

TIA!

Hi @nohemihuanca, I’m not sure but I’m tagging @sambrilleman and @ermeel who worked on that function and know it better than I do. Hopefully one of them can help you out here (I’m sure they’ll also be happy to know that people are using stan_surv!)

2 Likes

I assume you are interested in the hazard-ratio? One way to get it goes as follows:

mod1 %>%  
  as_tibble() %>% 
  transmute(A = New.TreatmentNE,
            B = New.TreatmentNE + `SiteB:New.TreatmentNE`,
            C = New.TreatmentNE + `SiteC:New.TreatmentNE`
         ) %>% 
  bayesplot::mcmc_intervals(prob_outer = .95)+
  coord_flip()

I can see no direct way to get this hazard ratio using the rstanarm interface (due to the interaction) - Maybe @sambrilleman knows one?

image

Please note that the model leads to sampler divergences on my computer, so you should be careful here and follow established Stan advice/workflow recommendations.

Are you interested in the hazard ratio here as well?

If you would like to compare HR’s of treatment E versus NE, one way would be to code New.Treatment as an ordered factor (with NE < E).

2 Likes

@ermeel’s answer is great. Yeah, the issue is that the contrasts you are interested in need to be formed using more than just one parameter/coefficient from the model (because of the inclusion of the interaction term). @ermeel’s answer gives you a method to add the required parameters together which in turn gives you the log hazard ratio for the contrasts/comparisons you are interested in. Note that if you want the hazard ratio (instead of the log hazard ratio) then you need to take the exponential of those estimates. So slightly adapting @ermeel’s answer to include the exp():

mod1 %>%  
    as_tibble() %>% 
    transmute(A = exp(New.TreatmentNE),
              B = exp(New.TreatmentNE + `SiteB:New.TreatmentNE`),
              C = exp(New.TreatmentNE + `SiteC:New.TreatmentNE`)
    ) %>% 
    bayesplot::mcmc_intervals(prob_outer = .95)

You should be able to use posterior_survfit() for this. Something like:

pred_data <- data.frame(
  Site = 'A',
  New.Species.name = 'S1',
  New.Treatment = 'E',
  subplot_by_site = 'New'
)

pred_haz <- posterior_survfit(
  mod1,
  newdata = pred_data,
  type = 'haz',
  times = 0
)

You get some warnings about the New.Species.name not being a factor, but I wouldn’t worry about them. It is an issue somewhere in the handling of character variables when building the prediction matrices I think. I probably need to fix it but I don’t think it is an issue for the predictions.

The above predictions used subplot_by_site = 'New' in the prediction data – which means a subplot_by_site was used that wasn’t present in the estimation data – so the predictions will be marginalised over that grouping factor (potentially increasing the uncertainty in the predicted estimates).

The predictions will also be a function of time. This is because the hazard is a function of time in your model. If you want an estimate that isn’t a function of time, then maybe you are after something like the hazard ratio, as @ermeel suggested.

Hope that helps somewhat!

3 Likes

Thank you very much! @ermeel and @sambrilleman yes, I was looking for hazard-ratio. Please, since my goal is to contrast if there is any difference in hazard ratio at each “Site” between treatment “E” and “NE”, I think I did what you mention @ermeel and converted the New.Treatment variable to a numerical value to have it as an ordered factor (NE=1, N=2)?. here

library(rstanarm)

data_NHN <- expand.grid(
  Site = c("A","B","C"),
  New.Treatment = c(rep(2,21),rep(1,40)))#"NE=1, E=2
data_NHN$subplot_by_site = c(rep(1,63), rep(2,60), rep(3,60))
data_NHN$Status_surv = sample(0:1,183, replace = TRUE) 
data_NHN$New.Species.name = c(rep("S1",10), rep("S2",40), rep("S3",80), rep("S2",20), rep("S1",33))
data_NHN$days = sample(10, size = nrow(data_NHN), replace = TRUE)

mod1 = stan_surv(
  formula = Surv(days, Status_surv) ~ Site*New.Treatment + (1| subplot_by_site) + (Site | New.Species.name),
  data = data_NHN,
  basehaz = "weibull",
  chains = 4,
  iter = 30,
  cores = 4)

print(mod1, digits=3)
summary(mod1, digits=2)



a=mod1 %>%  
  as_tibble() %>% 
  transmute(A = New.Treatment,
            B = New.Treatment + `SiteB:New.Treatment`,
            C = New.Treatment + `SiteC:New.Treatment`
  ) %>% 
  bayesplot::mcmc_intervals(prob_outer = .95)+
  coord_flip()

a

Now my question is :

  1. After using the code you described above, I got this figure.
    image
    So, If my interpretation is correct, then that is already the ratio of how much Treatment “N” was higher to treatment “NE” by site, correct? Does this mean that in Site A, the hazard ratio in treatment “E” is 0.5-fold (or exp(-0.690)) higher than treatment “NE”?
    Similarly, B = New.Treatment + SiteB:New.Treatment" will also give me a media of -1.37 which means that treatment “E” has a 0.25 -fold increase in the hazard of death from treatment “NE” in site B. However, as the 95% CI in Site A and B overlap zero it means that treatment “NE” and “E” may not be different and the only different site will be Site C, correct?

  2. Now if I am interested in the hazard-ratio for each species at each site by treatment, then for species “S1” in treatment “N” versus “NE” I will have

b=mod1 %>%  
  as_tibble() %>% 
  transmute(A_S1 = `b[(Intercept) New.Species.name:S1]`,
            B_S1 = `b[(Intercept) New.Species.name:S1]` + `b[SiteB New.Species.name:S1]`,
            C_S1 = `b[(Intercept) New.Species.name:S1]` + `b[SiteC New.Species.name:S1]`
  ) %>% 
  bayesplot::mcmc_intervals(prob_outer = .95)+
  coord_flip()

b

Will the above code give me the appropriate contrast between species S1 in treatment “NE” versus “N” at each site? Or should I need to add, for example, for site B the following "New.Treatment + SiteB:New.Treatment"?

(pd) This trial data set will give sampler divergences but my real data does not, thank you!

TIA

Just be careful here – because you want the variable to still be categorical and the intercept (i.e. baseline hazard) to hold it’s meaning as a reference group – so use indicator variables (i.e. 0 and 1 instead of 1 and 2), or convert it to an actual factor variable before fitting your model, as in:

data_NHN <- expand.grid(
  Site = c("A","B","C"),
  New.Treatment = factor(
    c(rep("E",21),rep("NE",40)),
    levels = c("NE", "E"),
    ordered = TRUE)
)

Just be a bit careful with terminology here. It is the hazard that is 0.5 fold higher, not the hazard ratio. The hazard ratio is the relative measure of the difference in hazards. So “the hazard ratio tells us that, for site A, the estimated hazard in treatment E is 0.5-fold the estimated hazard in treatment NE”.

Again, try to be a bit careful with terminology here. It is a bit unclear what you mean by “only different site”. Site C appears to be the only site where there is evidence that the hazard in treament NE is different to treatment E. At Sites A and B there is no real evidence that the hazard differs between treatments NE and E."

If you want to know the association with treatment (E vs NE) then you need to add in each coefficient that is for a variable related to New.Treatment. But if I understand your model correctly, the effect of New.Treatment (i.e. E vs NE, or vice versa) doesn’t change as a function of New.Species.name. So if you are interested in the effect of treatment at each site – then the answer is the same as in your earlier question, because the New.Species.name does not change the hazard ratio you are interested in. In other words the hazard ratio for treament varies by site, but it does not vary by species. Hope that makes sense?

1 Like

Thank you @sambrilleman, Yes, it does make sense!