I am new to Stan, and I am trying to fit a log binomial random intercept model. I was encouraged to use Stan by colleagues who suggested it was quite efficient, so I used the brms package in R to output the Stan code for a logistic random intercept model. I then updated the link to be log instead of logit (and added a line to directly calculate an additive interaction parameter of interest, the relative excess risk due to interaction). My code is below.
This code works well for the unadjusted model and when I adjust for one dichotomous covariate. However, when I try to adjust for a few more covariates, I get sampling errors. I believe this is due to the fact that I’m using a log link, and the risks are being pushed outside the parameter space (>1). So, I would like to add explicit constraints into the code such that the mu[n]'s are < 0 or exp(mu) < 1. However, I have not been able to figure out how to successfully incorporate that constraint into the Stan code.
Any suggestions would be much appreciated!
// generated with brms 1.6.1 (and then edited to change from logit to log link; and to directly calculate RERI)
functions {
}
data {
int<lower=1> N; // total number of observations
int Y[N]; // response variable
int<lower=1> K; // number of population-level effects
matrix[N, K] X; // population-level design matrix
// data for group-level effects of ID 1
int<lower=1> J_1[N];
int<lower=1> N_1;
int<lower=1> M_1;
vector[N] Z_1_1;
int prior_only; // should the likelihood be ignored?
}
transformed data {
int Kc;
matrix[N, K - 1] Xc; // centered version of X
vector[K - 1] means_X; // column means of X before centering
Kc = K - 1; // the intercept is removed from the design matrix
for (i in 2:K) {
means_X[i - 1] = mean(X[, i]);
Xc[, i - 1] = X[, i] - means_X[i - 1];
}
}
parameters {
vector[Kc] b; // population-level effects
real temp_Intercept; // temporary intercept
vector<lower=0>[M_1] sd_1; // group-level standard deviations
vector[N_1] z_1[M_1]; // unscaled group-level effects
}
transformed parameters {
// group-level effects
vector[N_1] r_1_1;
r_1_1 = sd_1[1] * (z_1[1]);
}
model {
vector[N] mu;
mu = Xc * b + temp_Intercept;
for (n in 1:N) {
mu[n] = mu[n] + (r_1_1[J_1[n]]) * Z_1_1[n];
}
// prior specifications
sd_1 ~ gamma(2,0.1);
z_1[1] ~ normal(0, 1);
// likelihood contribution
if (!prior_only) {
Y ~ bernoulli(exp(mu));
}
}
generated quantities {
real b_Intercept; // population-level intercept
real RERI; // relative excess risk due to interaction
b_Intercept = temp_Intercept - dot_product(means_X, b);
RERI = exp(b[1]+b[2]+b[6]) - exp(b[1]) - exp(b[2]) + 1;
}
What is the motivation for a log link, which are typically reserved for modeling positive but otherwise unbounded quantities, for modeling a probability constrained to [0, 1]?
Thank you for your prompt response! The motivation to use the log link is that our applications have common outcomes, and we are interested in a parameter that is a function of three relative risks – using a logit link and thus odds ratios would result in severe overestimation of our parameter of interest. There is a paper on a Bayesian approach to estimating a log binomial model (without random intercepts), but they use WinBUGs. I have been told that Stan is much better/faster than WinBUGs :)
Chu H, Cole SR. Estimation of risk ratios in cohort studies with common outcomes: A Bayesian approach. Epidemiology. 2010;21(6):855-862.
Quickly searching around the literature I can’t find any motivation for a log binomial that isn’t trying to making something nonlinear appear linear (relative risks) or improving the performance of point estimators. Neither are particularly relevant to Bayesian inference.
In order to understand relative risks, for example, the full probability curve (with uncertainties!) can be produced in the generated quantities block by varying the input covariates. The inclusion of principled priors and reframing the inferences in terms of entire distributions avoid the fragile estimators.
The computational issues encountered in Chu et al. are clear signs that their model is not appropriate to the system being analyzed. While the constraint that the latent linear response is less than one can possibly be implemented (depending on the structure of the design matrix) I don’t think that it would be a wise modeling choice.
In my past work with clinicians, we occasionally used a logit link and generated probabilities based off particular combinations of covariate values. However, at least for our particular objectives, we found the log binomial model to be more useful, if not perfect (as no models are). This was all under a frequentist approach (for which there is much literature on estimating relative risks from regression), but it sounds like Stan makes it pretty straightforward to generate the full probability curve for various covariate combinations from a logistic model. I will look more into that!
(Along those lines, do you know of any references or good examples of succinctly reporting such probabilities when there are a medium-large number of covariates?)
Thank you again for your responses. I appreciate it!
The huge advantage of Stan is that you can just build the full model, let Stan figure out which model configurations (i.e. parameters) are consistent with the data, and then use those consistent model configurations to in an infinite number of ways depending on the application or decision to be made.
The log binomial model is an approximation that works only for certain regimes of covariates responses. The violations you see (and were seen in Chu et al) indicate that the approximation isn’t valid for the data you are considering. Trying to avoid the violations through computational means doesn’t make the approximation any better, it just obfuscates the problems.
Hence my asking about the modeling goals before worrying about the computation.
The curse of enabling very general Bayesian inference is that everyone’s application will be different, from the model specified to how the model is applied, including how inferences are reported. Which probabilities, or differences in probabilities or ratios of probabilities or summaries thereof, that you report intimate depends on the exact message you’re trying to convey and the audience you are targeted.
For example, in practice one often ends up having to create proxy measures that summarize the Stan output to mirror the metrics that have been internalized in a field, even if those metrics are not particularly useful.
Ultimately you’ll probably end up having to tailor custom reports/visualizations/etc. Sharing more about your specific communication goals and audience can facilitate targeted recommendations and advice – probably worth starting a new thread in that case.
The Gelman and Pardoe (2007) paper on average predictive comparisons might be a starting point. It can be a bit computational intensive but if you do it in Stan generated quantities (=C++) that is probably not the biggest issue.
In a predictive model, what is the expected difference in the outcome associated with a unit difference in one of the inputs? In a linear regression model without interactions, this average predictive comparison is simply a regression coefficient (with associated uncertainty). In a model with nonlinearity or interactions, however, the average predictive comparison in general depends on the values of the predictors. We consider various definitions based on averages over a population distribution of the predictors, and we compute standard errors based on uncertainty in model parameters. We illustrate with a study of criminal justice data for urban counties in the United States. The outcome of interest measures whether a convicted felon received a prison sentence rather than a jail or non-custodial sentence, with predictors available at both individual and county levels. We fit three models: (1) a hierarchical logistic regression with varying coefficients for the within-county intercepts as well as for each individual predictor; (2) a hierarchical model with varying intercepts only; and (3) a nonhierarchical model that ignores the multilevel nature of the data. The regression coefficients have different interpretations for the different models; in contrast, the models can be compared directly using predictive comparisons. Furthermore, predictive comparisons clarify the interplay between the individual and county predictors for the hierarchical models and also illustrate the relative size of varying county effects.
The big problem, as @betanalpha is pointing out, is that the log isn’t an adequate link function for probabilities. So what happens is that you chop out bits of the probability space that don’t lead to values in (0, 1). This might not be a problem in practice—sometimes you can just do simple linear regressions for probabilities if everything’s constrained enough. If it does crop up, what happens is that Stan has to do a rejection when it steps into an illegal region. Also, you have to be careful with normalization if the volume of the illegal region depends on parameters.
The compound declare define makes things cleaner:
int Kc = K - 1;
...
vector[N_1] r_1_1 = sd_1[1] * z_1[1];
You can turn th definition of `mu` into a one-liner by adding elementwise products and multi-indexing.
vector[N] mu = Xc * b + temp_Intercept + r_1_1[J_1] .* Z_1_1;
Ther emight be a way to remove some redundant multiplies in that `r_1_1` and `Z_1_1` term depending on how `J_1` is structured. The point is to eliminate redundant/repetitive computations whenever possible.
I'll leave the two expressions in teh generated quantities up to you.
I'm not sure why you only give one entry of `z_1` a prior here:
z_1[1] ~ normal(0, 1);
The easiest way to deal with "prior only" conditions is to just have size zero data.