Making a multiple regression model more efficient

I have the following Stan model that I would like to apply to cases where N is of the order of 10^5, ND at most 10 and NF around 20. The following options come to my mind, in order to get more efficient w.r.t. computational resources per effective sample:

  • Optimisation of this model through different datatypes. Is there any way to get rid of the last for loop, or is it actually not a problem? Any other changes to the model that could improve it?
  • Using a QR decomposition for X.
  • Conditioned on X, all models (the normal as well as the logistic regression models) are independent. I could equally well run ND+1 different Stan models, right? I wouldn’t loose anything? What about the run time? Here the benefit is that I can parallelise further, right?

I would appreciate your suggestions and comments. Many thanks.

data {
    int<lower=0> N;     // number of individuals
    int<lower=0> NF;   // number of covariates (e.g. age and PC's)
    int<lower=0> ND;  // number of diseases
    vector<lower=0>[N] biomarker; // biomarker 
    int<lower=0, upper=1> has_disease[ND,N];  // array of vectors each indicating if disease or not 
    matrix[N, NF] X;   // design matrix
}
/******************************************************************************/
parameters {
    real<lower=0> sigma_biomarker;
    real alpha_biomarker;
    vector[ND] alpha_disease;
    vector[NF] beta_biomarker;
    matrix[ND,NF] beta_disease;
}
/******************************************************************************/
model {
    sigma_biomarker ~ cauchy(0,1);                         
    alpha_biomarker ~ normal(80,20);                  
    beta_biomarker ~ normal(0,5);

    alpha_disease ~ normal(0,5);
    to_vector(beta_disease) ~ normal(0,5);
    // we do not add a T[0,] here, since the prior for alpha_biomarker effectively constraints it to be positive
    biomarker ~ normal(alpha_biomarker+X*beta_biomarker, sigma_biomarker); 
    for(i in 1:ND) 
        has_disease[i] ~ bernoulli_logit(alpha_disease[i] + X*(beta_disease[i])');
}

I know that the efficiency also depends on the data (that is biomarker, has_disease and X), but are there any “algorithmic” tricks that would help in general?

One option could be to center the columns of the design matrix and then re-adjust the intercept afterwards. You can see how this works using the make_stancode() function of the brms R package. For instance

make_stancode(y~x, data = data.frame(y = rnorm(10), x = rnorm(10)))

1 Like

Thanks I will try. What about introducing dependencies between the logistic regressions, i.e. putting a mutlvariate prior over the regression coefficients? Would that cause a serious bottleneck scaling things to N=10^5.

Also I just saw that I could have used brm (as described in the vignette brms_multivariate) for the above.

I think a multivariate normal prior will likely increase computation time even if set up via the non-centered parameterization. Also, for 10^5 observations, the prior will likely not make much of a difference unless set very informatively.

And indeed, I also think you can set up the model via brms directly if you want to.

Thanks.

So if not done via the prior, how would I account for dependencies between different disease outcomes (Bernoulli outcomes in has_disease) in my model on that scale?

Where is the dependency coming from?

There are scientific reasons to assume that, say, having disease A is not independent of having disease B for a given individual. But I guess this is the point of conditioning on the data?!

Out of curiosity why do you leave one column out when centering? Does brms create a column for intercepts (which is anyways not used in this case since a separate intercept exists)?

data { 
  int<lower=1> N;  // total number of observations 
  vector[N] Y;  // response variable 
  int<lower=1> K;  // number of population-level effects 
  matrix[N, K] X;  // population-level design matrix 
  int prior_only;  // should the likelihood be ignored? 
} 
transformed data { 
  int Kc = K - 1; 
  matrix[N, K - 1] Xc;  // centered version of X 
  vector[K - 1] means_X;  // column means of X before centering 
  for (i in 2:K) { 
    means_X[i - 1] = mean(X[, i]); 
    Xc[, i - 1] = X[, i] - means_X[i - 1]; 
  } 
} 

I remove the intercept column and add it back again outside of the design matrix as a single parameter. If you specify a model without an intercept in brms, no centering will happen. To see this replace y~x with y~ 0 + x in my example above.

How are you controlling for possible endogeneity (due to confounding, measurement error, etc.)?

To be honest I haven’t considered this, yet. Any suggestions how to approach this would be appreciated (or literature pointers).

If I understand your model(s) correctly (and forgive any ignorance on my part), disease presence is a function of a biomarker plus covariates. I would imagine that there may be some measurement error and omitted confounders (I would be hard-pressed to imagine a situation where there would not be, save perhaps for an RCT). The statistical consequence of this is correlations between predictors and the error term(s) of the model, which leads to inconsistent estimates. Instrumental variables are the classical way to deal with this issue, but finding relevant and valid instruments is nigh impossible (unless, for example, you have measures on random genetic variation/alleles). The difficulties with locating good instruments have led to a variety of other solutions, the most recent of which is just modeling the correlations between the predictors and the error term using copula functions (in which case the requisite assumptions are milder). I am pretty sure that one can implement copulas fairly handily in the Stan environment…? :-)

Park, S., & Gupta, S. (2012). Handling Endogenous Regressors by Joint Estimation Using Copulas. Marketing Science, 31(4), 567-586.

Tran, K.C., & Tsionas, E.G. (2015). Endogeneity in Stochastic Frontier Models: Copula Approach without External Instruments. Economics Letters, 133, 85-88.

Blauw, S.L, & Franses, Ph.H.B.F. (2016). Off the Hook: Measuring the Impact of Mobile Telephone Use on Economic Development of Households in Uganda using Copulas. Journal of Development Studies, 52(3), 315–330.

Shemyakin, A., & Kniazev, A. (2017). Introduction to Bayesian Estimation and Copula Models of Dependence. Hoboken, NJ: Wiley.

Craiu, V.R., & Sabeti, A. (2012). In mixed company: Bayesian inference for bivariate conditional copula models with discrete and continuous outcomes. Journal of Multivariate Analysis, 110, 106-120.

Atique, F., & Attoh-Okineb, N. (2018). Copula parameter estimation using Bayesian inference for pipe data analysis. Canadian Journal of Civil Engineering, 45(1), 61-70.

Pitt, M., Chan, D., & Kohn, R. (2006). Efficient Bayesian Inference for Gaussian Copula Regression Models. Biometrika, 93(3), 537-554.

My two cents.

If you know the measurement error of a covariate x you can use me() terms in brms, for instance me(x, sdx) instead of x.

In addition, I co-authored a recent paper that gives a fairly friendly intro/backgrounder to the endogeneity problem:

Ketokivi, M., & McIntosh, C.N. (2017). Addressing the endogeneity dilemma in operations management research: Theoretical, empirical, and pragmatic considerations. Journal of Operations Management, 52, 1-14. https://doi.org/10.1016/j.jom.2017.05.001

Thanks for the references. I will check them. I’m wondering whether someone out there in the Stanyverse has already worked with corpulas?

I did so and was happy that make_stancode is also available for these multivariate models!

In this regard I realized that you avoid working with for-loops and rather “hardcode” the different outcomes. Any deeper reason behind this (e.g. optimization)?

That’s because the predictor terms of the outcomes may vary structurally, which cannot be represented in a loop.