Optimizing performance of mixed effect model on large data set

ykoski · June 24, 2019, 11:30am

Hi all,

I’m running a Linear mixed effects model on a large data set (around 1 million observations) where I have ten predictor variables and one response variable. The data is time series data from around 80000 individuals and I want to include a random intercept for each individual. I’m running this on my laptop which has 32GB RAM and I’m getting following performance estimates when I run the model.

Chain 1: Gradient evaluation took 0.201982 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 2019.82 seconds.

I have to repeat this analysis for two similar datasets and also create a similar model with random slope for one variable. What this means is that I would like to reduce the runtime as much as possible. By my estimations each model would run for quite a long time (around 2 days).

So what I’m actually wondering is: Are these estimations reasonable for my model and is there any optimizations that could improve the runtime? Also should I subsample my data and should I consider running my analyses on a server with more computing power?

Here is my stan-model. It is worth mentioning that I’m doing QR-reparametrization in R before feeding it to RStan and I have Z-normalized my data.

data {
    // Train data
    int<lower=1> N; // Amount of events
    int<lower=1> K; // Amount of variables measured at each event
    int<lower=1> Nind; // Amount of individuals in our data
    matrix[N, K] Q_star; // Q_matrix
    matrix[K, K] R_star; // R_matrix
    matrix[K, K] R_star_inv;
    vector[N] Y; // Response values (that we are trying to predict)
    int<lower=1, upper=Nind> ind[N]; // individuals identifier

    // Test data
    int<lower=1> Ntest; // Amount of test observations
    int<lower=1> Ntest_ind; // Amount of test set individuals
    matrix[Ntest, K] x_test; // Test data
    int<lower = 1, upper=Ntest_ind> test_ind[Ntest]; // Test individual identifier
}
parameters {
    vector[K] theta; // Coefficients on Q_star
    real<lower=0> sigmab; // Variance of the individual variation parameter bin
    real<lower=0> sigmaeps; // Variance of the random noise
    real alpha;

    vector[Nind] b_ind; // Individual specific random effect
}
model {
    vector[N] yhat; 

    alpha ~ normal(0,10);
    theta ~ normal(0,10);
    sigmab ~ inv_gamma(0.01, 0.01);
    sigmaeps ~ inv_gamma(0.01, 0.01);
    b_ind ~ normal(0,sigmab);

    yhat = alpha + Q_star * theta + b_ind[ind]; // Update for each event

    Y ~ normal(yhat, sigmaeps);   
}
generated quantities {
    vector[K] beta; // Actual coefficients on x
    vector[Ntest_ind] test_ind_b;
    vector[Ntest] y_pred;
    beta = R_star_inv * theta;

    for (i in 1:Ntest_ind) {
        test_ind_b[i] = normal_rng(0,sigmab);
    }

    // Predict using parameters and test data
    for (i in 1:Ntest) {
        y_pred[i] = normal_rng(alpha + x_test[i]*beta + test_ind_b[test_ind[i]], sigmaeps);
    }
}

Best regards,
Yrjö

Topic		Replies	Views
Theoretical lowest runtime – hierarchical linear model Modeling fitting-issues , specification , performance	14	1390	September 22, 2019
Optimising stan fit for fixed effects and interactions Modeling rstan , specification	1	73	March 21, 2025
Stan how to optimize the estimation process? Algorithms	6	1860	August 30, 2017
Differences in estimating random effects between Stan and NONMEM Modeling	4	1881	July 8, 2017
ODE with random effects running slow Modeling	0	287	November 9, 2022

Optimizing performance of mixed effect model on large data set

Related topics