Making a multiple regression model more efficient

ermeel · April 19, 2018, 10:28am

I have the following Stan model that I would like to apply to cases where N is of the order of 10^5, ND at most 10 and NF around 20. The following options come to my mind, in order to get more efficient w.r.t. computational resources per effective sample:

Optimisation of this model through different datatypes. Is there any way to get rid of the last for loop, or is it actually not a problem? Any other changes to the model that could improve it?
Using a QR decomposition for X.
Conditioned on X, all models (the normal as well as the logistic regression models) are independent. I could equally well run ND+1 different Stan models, right? I wouldn’t loose anything? What about the run time? Here the benefit is that I can parallelise further, right?

I would appreciate your suggestions and comments. Many thanks.

data {
    int<lower=0> N;     // number of individuals
    int<lower=0> NF;   // number of covariates (e.g. age and PC's)
    int<lower=0> ND;  // number of diseases
    vector<lower=0>[N] biomarker; // biomarker 
    int<lower=0, upper=1> has_disease[ND,N];  // array of vectors each indicating if disease or not 
    matrix[N, NF] X;   // design matrix
}
/******************************************************************************/
parameters {
    real<lower=0> sigma_biomarker;
    real alpha_biomarker;
    vector[ND] alpha_disease;
    vector[NF] beta_biomarker;
    matrix[ND,NF] beta_disease;
}
/******************************************************************************/
model {
    sigma_biomarker ~ cauchy(0,1);                         
    alpha_biomarker ~ normal(80,20);                  
    beta_biomarker ~ normal(0,5);

    alpha_disease ~ normal(0,5);
    to_vector(beta_disease) ~ normal(0,5);
    // we do not add a T[0,] here, since the prior for alpha_biomarker effectively constraints it to be positive
    biomarker ~ normal(alpha_biomarker+X*beta_biomarker, sigma_biomarker); 
    for(i in 1:ND) 
        has_disease[i] ~ bernoulli_logit(alpha_disease[i] + X*(beta_disease[i])');
}

I know that the efficiency also depends on the data (that is biomarker, has_disease and X), but are there any “algorithmic” tricks that would help in general?

paul.buerkner · April 19, 2018, 10:36am

One option could be to center the columns of the design matrix and then re-adjust the intercept afterwards. You can see how this works using the make_stancode() function of the brms R package. For instance

make_stancode(y~x, data = data.frame(y = rnorm(10), x = rnorm(10)))

ermeel · April 19, 2018, 10:45am

Thanks I will try. What about introducing dependencies between the logistic regressions, i.e. putting a mutlvariate prior over the regression coefficients? Would that cause a serious bottleneck scaling things to N=10^5.

Also I just saw that I could have used brm (as described in the vignette brms_multivariate) for the above.

paul.buerkner · April 19, 2018, 11:31am

I think a multivariate normal prior will likely increase computation time even if set up via the non-centered parameterization. Also, for 10^5 observations, the prior will likely not make much of a difference unless set very informatively.

And indeed, I also think you can set up the model via brms directly if you want to.

ermeel · April 19, 2018, 12:03pm

Thanks.

So if not done via the prior, how would I account for dependencies between different disease outcomes (Bernoulli outcomes in has_disease) in my model on that scale?

paul.buerkner · April 19, 2018, 12:07pm

Where is the dependency coming from?

ermeel · April 19, 2018, 12:11pm

There are scientific reasons to assume that, say, having disease A is not independent of having disease B for a given individual. But I guess this is the point of conditioning on the data?!

ermeel · April 19, 2018, 12:55pm

Out of curiosity why do you leave one column out when centering? Does brms create a column for intercepts (which is anyways not used in this case since a separate intercept exists)?

data { 
  int<lower=1> N;  // total number of observations 
  vector[N] Y;  // response variable 
  int<lower=1> K;  // number of population-level effects 
  matrix[N, K] X;  // population-level design matrix 
  int prior_only;  // should the likelihood be ignored? 
} 
transformed data { 
  int Kc = K - 1; 
  matrix[N, K - 1] Xc;  // centered version of X 
  vector[K - 1] means_X;  // column means of X before centering 
  for (i in 2:K) { 
    means_X[i - 1] = mean(X[, i]); 
    Xc[, i - 1] = X[, i] - means_X[i - 1]; 
  } 
}

paul.buerkner · April 19, 2018, 1:36pm

I remove the intercept column and add it back again outside of the design matrix as a single parameter. If you specify a model without an intercept in brms, no centering will happen. To see this replace y~x with y~ 0 + x in my example above.

Cam · April 19, 2018, 1:42pm

How are you controlling for possible endogeneity (due to confounding, measurement error, etc.)?

ermeel · April 19, 2018, 2:07pm

To be honest I haven’t considered this, yet. Any suggestions how to approach this would be appreciated (or literature pointers).

Cam · April 19, 2018, 2:48pm

If I understand your model(s) correctly (and forgive any ignorance on my part), disease presence is a function of a biomarker plus covariates. I would imagine that there may be some measurement error and omitted confounders (I would be hard-pressed to imagine a situation where there would not be, save perhaps for an RCT). The statistical consequence of this is correlations between predictors and the error term(s) of the model, which leads to inconsistent estimates. Instrumental variables are the classical way to deal with this issue, but finding relevant and valid instruments is nigh impossible (unless, for example, you have measures on random genetic variation/alleles). The difficulties with locating good instruments have led to a variety of other solutions, the most recent of which is just modeling the correlations between the predictors and the error term using copula functions (in which case the requisite assumptions are milder). I am pretty sure that one can implement copulas fairly handily in the Stan environment…? :-)

Park, S., & Gupta, S. (2012). Handling Endogenous Regressors by Joint Estimation Using Copulas. Marketing Science, 31(4), 567-586.

Tran, K.C., & Tsionas, E.G. (2015). Endogeneity in Stochastic Frontier Models: Copula Approach without External Instruments. Economics Letters, 133, 85-88.

Blauw, S.L, & Franses, Ph.H.B.F. (2016). Off the Hook: Measuring the Impact of Mobile Telephone Use on Economic Development of Households in Uganda using Copulas. Journal of Development Studies, 52(3), 315–330.

Shemyakin, A., & Kniazev, A. (2017). Introduction to Bayesian Estimation and Copula Models of Dependence. Hoboken, NJ: Wiley.

Craiu, V.R., & Sabeti, A. (2012). In mixed company: Bayesian inference for bivariate conditional copula models with discrete and continuous outcomes. Journal of Multivariate Analysis, 110, 106-120.

Atique, F., & Attoh-Okineb, N. (2018). Copula parameter estimation using Bayesian inference for pipe data analysis. Canadian Journal of Civil Engineering, 45(1), 61-70.

Pitt, M., Chan, D., & Kohn, R. (2006). Efficient Bayesian Inference for Gaussian Copula Regression Models. Biometrika, 93(3), 537-554.

My two cents.

paul.buerkner · April 19, 2018, 2:51pm

If you know the measurement error of a covariate x you can use me() terms in brms, for instance me(x, sdx) instead of x.

Cam · April 19, 2018, 2:52pm

In addition, I co-authored a recent paper that gives a fairly friendly intro/backgrounder to the endogeneity problem:

Ketokivi, M., & McIntosh, C.N. (2017). Addressing the endogeneity dilemma in operations management research: Theoretical, empirical, and pragmatic considerations. Journal of Operations Management, 52, 1-14. https://doi.org/10.1016/j.jom.2017.05.001

ermeel · April 19, 2018, 8:46pm

Thanks for the references. I will check them. I’m wondering whether someone out there in the Stanyverse has already worked with corpulas?

ermeel · April 19, 2018, 8:50pm

I did so and was happy that make_stancode is also available for these multivariate models!

In this regard I realized that you avoid working with for-loops and rather “hardcode” the different outcomes. Any deeper reason behind this (e.g. optimization)?

paul.buerkner · April 20, 2018, 7:06am

That’s because the predictor terms of the outcomes may vary structurally, which cannot be represented in a loop.

Topic		Replies	Views
Efficient implementation of the for loop? Modeling rstan , fitting-issues	1	550	February 18, 2022
Speeding up a multi_normal model Algorithms	4	2941	November 2, 2017
New User of Stan Model 'many' regressions Modeling	2	367	July 5, 2021
Large data sets with stan code Modeling	1	629	September 18, 2018
Simple Multinomial Logistic Regression Performance Modeling performance	5	854	September 30, 2020

Making a multiple regression model more efficient

Related Topics