How do I reproduce the fit from conditional_effects function of brms package when using the Stan fit myself?

When I use the brms function to do a hurdle model stan fit and then plot the obtained fit with conditional_effects, it is never the same as when I use the actual stan code and stan function to do the fit. It seems as if brms does some black box transformation on the parameters that cannot be reproduced by using the stan function, is this normal?

I am essentially using the following hurdle fit:

hurdlefit <- brm(brmsformula(Y ~ X, hu ~ X),
                 family="hurdle_lognormal", data = data_fitting)

And the following stan fit:

N <- length(Y)
K <- 2
X <- cbind(rep(c(1), each=N), X)
K_hu <- 2
X_hu <- X
prior_only <- 0

hurdlefit <- stan('hurdle_model.stan', data = list(N=N, Y=Y, K=K, X=X, K_hu=K_hu, X_hu=X_hu, prior_only=prior_only))

Where the file hurdle_model.stan was simply generated by doing this:

make_stancode((brmsformula(Y ~ X, hu ~ X),
                 family="hurdle_lognormal", data = data_fitting)

Now afterwards, when I obtain the best-fit parameters from the stan fit and try to plot the curve, it never follows the data and never looks like the same curve as the brms fit. Why is this?

Can you show the code you’re running to go from the hurdlefit object to the plot that’s intended to mimic the output of conditional_effects?

1 Like

Yes. I will first explain the reasoning to better understand my code. I start by storing the hurdlefit object. Afterwards, I store the simulation attributes in the object summary_data. With this, I essentially take the mean of every parameter prediction over all 4 chains of simulations. I do this for all 4 parameters, I call them beta0, beta1, gamma0, gamma1. Since the gamma parameters come from a lognormal regression, I take the exp() of them.

hurdlefit <- stan('backup.stan', data = list(N=N, Y_obs=Y_obs, K=K, X=X, K_hu=K_hu, X_hu=X_hu, prior_only=prior_only))

beta0 <- median((c(summary_data[[1]][[1]]$b_hu_Intercept, summary_data[[1]][[2]]$b_hu_Intercept, summary_data[[1]][[3]]$b_hu_Intercept, summary_data[[1]][[4]]$b_hu_Intercept)))
beta1 <- median((c(summary_data[[1]][[1]]$`b_hu[1]`, summary_data[[1]][[2]]$`b_hu[1]`, summary_data[[1]][[3]]$`b_hu[1]`, summary_data[[1]][[4]]$`b_hu[1]`)))
gamma0 <- median(exp(c(summary_data[[1]][[1]]$b_Intercept, summary_data[[1]][[2]]$b_Intercept, summary_data[[1]][[3]]$b_Intercept, summary_data[[1]][[4]]$b_Intercept)))
gamma1 <- median(exp(c(summary_data[[1]][[1]]$`b[1]`, summary_data[[1]][[2]]$`b[1]`, summary_data[[1]][[3]]$`b[1]`, summary_data[[1]][[4]]$`b[1]`)))

Afterwards, it is a little strange, but I prefer plotting in python so I simply save these values to a file and load them up in python. After which I plot them using the theoretical equation for a lognormal hurdle model, which if I’m not mistaken is:

\frac{1}{1+e^-(\beta_0+\beta_1 x)}*(\gamma_0 + \gamma_1 x)

Where the first part of the equation is the logistic portion and the second is the linear portion.

Here is the code for plotting:

# logistic function
def logit(x, beta0, beta1):
    return 1/(1+np.exp(-(beta0 + beta1*x)))
  
# log_linear function
def linear(x, gamma0, gamma1):
    return gamma0 + gamma1*x
  
# hurdle model functional form
def hurdle(x, beta0, beta1, gamma0, gamma1):
    return logit(x, beta0, beta1)*linear(x, gamma0, gamma1)

sigma_plot_values = np.linspace(1, 2.8, num=100)

fitted_line = hurdle(sigma_plot_values, beta0, beta1, gamma0, gamma1)

fig = plt.figure(figsize=(10,8), layout='constrained')
plt.xlim(1.0, 2.8)
plt.ylim(-0.5, 11)

plt.plot(sigma_plot_values, fitted_line, color='black')
plt.show()

And for some reason, this is never the same curve as the one obtained from doing conditional_effects() with the brms object, which should be the same.

I suspect that brms conditional effects plots are the mean of the function of the parameters, whereas you are plotting A function of the mean of the parameters. You should first calculate the conditional effects for each sample of the posterior,and then take the mean of the result.

Thanks for the info, this could be the difference. I was wondering, how would you take the mean of the function? I’m not sure how you could find this computationally. Also, how would you extract the parameter values after? Because a brms fit outputs the parameter values. And my last question is: wouldn’t this change the intrinsic function?

It’s quite straightforward. Instead of using the mean of each parameter, use the full posterior (a vector of values) for each parameter to compute the (full posterior distribution of the) conditional effects. Then take the mean of that posterior distribution of conditional effects. This is what I mean by “take the mean of a function of the posterior distribution”. You can extract the posterior samples of each parameter directly from your stanfit object. See also the rstan documention (assuming you’re using stan, like you indicate above): Extract samples from a fitted Stan model — extract • rstan

In case it might be helpful, in this answer I showed how to reproduce what conditional_effects is doing using tidybayes::epred_draws to extract the expected values from the posterior predictive distribution and summarize them to get the median and credible intervals.

The other thing to be aware of is that for each predictor variable in the model, conditional_effects creates a data frame of values at which to calculate the conditional effect of that predictor variable. For a numeric variable, this data frame will have 100 rows of values spanning the range of that variable. For a categorical variable, the number of rows will equal the number of categories. If there are other predictor variables in the model, they will be set to their mean if numeric or the reference category if categorical.

For example, for the regression in your initial post, you can see the data frame conditional_effects creates for X by running:

ce = conditional_effects(hurdlefit)
ce[["X"]]

ce is a list of data frames, one data frame for each of the predictor variables, and it also of course includes the median and credible intervals for each row, since conditional_effects takes care of summarizing the posterior expectations. You would need to create a similar data frame (or data frames, for multiple predictor variables) to pass to epred_draws.

Thank you very much for your help, but if I understand correctly, this would translate to doing this:

testing <- extract(hurdlefit)

beta0 <- median(testing$b_hu_Intercept)
beta1 <- median(testing$b_hu)
gamma0 <- median(exp(testing$b_Intercept))
gamma1 <- median(exp(testing$b)

To obtain the best-fit values of the 4 parameters, but this yields the same result as what I was doing before:

hurdlefit <- stan('hurdle_model.stan', data = list(N=N, Y=Y, K=K, X=X, K_hu=K_hu, X_hu=X_hu, prior_only=prior_only))
    

summary_data <- attr(x=hurdlefit, which='sim')

beta0 <- median((c(summary_data[[1]][[1]]$b_hu_Intercept, summary_data[[1]][[2]]$b_hu_Intercept, summary_data[[1]][[3]]$b_hu_Intercept, summary_data[[1]][[4]]$b_hu_Intercept)))
beta1 <- median((c(summary_data[[1]][[1]]$`b_hu[1]`, summary_data[[1]][[2]]$`b_hu[1]`, summary_data[[1]][[3]]$`b_hu[1]`, summary_data[[1]][[4]]$`b_hu[1]`)))
gamma0 <- median(exp(c(summary_data[[1]][[1]]$b_Intercept, summary_data[[1]][[2]]$b_Intercept, summary_data[[1]][[3]]$b_Intercept, summary_data[[1]][[4]]$b_Intercept)))
gamma1 <- median(exp(c(summary_data[[1]][[1]]$`b[1]`, summary_data[[1]][[2]]$`b[1]`, summary_data[[1]][[3]]$`b[1]`, summary_data[[1]][[4]]$`b[1]`)))

But maybe I am misunderstanding what you meant.

Thank you very much for the detailed explanation. The only problem is that for this method, it is necessary to create the model using brm() to use the function:

tidybayes::epred_draws

But in my case, I cannot use brm as I modified the Stan code directly and as of my knowledge, there is no way to obtain my exact Stan code using a brm model. Is there a way around this?

If I understand your goal correctly, you’ll have to calculate the conditional effects (mu) for new data (population effects fixed at their means) in the generated quantities block or, calculate them outside of the Stan program using draws of parameter estimates and multiplying by the means of your predictors .

Ignore my last response, I think I understand better now.

If I am not mistaken, your method would be to essentially take the vector of posterior samples for each parameter. Generate a predicted Y for a certain range of X (for each x entry in X vector) with each entry of the vector of posterior samples. Take the mean of this Y vector for each given x entry to get a smoothed best-fit curve.

The only issue I am having with understanding this method is: After obtaining this smoothed best-fit curve, how do I obtain the 4 parameter values that generate this exact curve? Would I have to refit this curve with another method (using Stan again or a Python curve fit, etc)?

Glad it’s clearer now.

You don’t. The source of this curve is the full posterior. If you want readers of your paper to be able to reproduce it, you can simply include the full set of posterior samples of parameter values as a supplemental file.

Thanks this makes total sense. It would seem as if the linear portion of my hurdle model underfits the data when compared to the brms fit that perfectly fits the data. Do you know if brms does a certain transformation behind the scenes to the posterior samples of the linear parameters?

I don’t know, but my first intuition would be to recheck whether I have calculated the linear portion correctly. For instance, assuming that you have a log link function, you should make sure that you exponentiate the linear predictor before taking the mean of posterior distribution of conditional effects. If you don’t, you are essentially plotting the geometric mean instead of the arithmetic mean of the posterior, which is closer to zero (explaining your reported pattern).