Pystan model takes too long to fit

Operating System: MacOS
Python Version:3.6

Hello,

I am trying to fit a model using pystan. Code is below. I have approximately 600000 records (N) and dimension of x is 60 (D). x has a mixture of binary, categorical and continuous variables. Outcome is binary (0,1). When i fit the model with all the data, it takes more than 10hrs to run just ‘1’ iteration. If I run it with 60000 records and reduce dim(x) to 10 then the model fits (2000 iterations, 2 chains) in about 1 hr. is there a way to use all the records (600000) and all the features in x (60) at the same time and ensure that the model fits in about, let’s say 10-12hrs without increasing the hardware resources at my disposal :) Can the following code be optimized further? I tried to get rid of the for loop but got into errors…

data{
int<lower=1> N; //number of records
int <lower=1> D; //dim of x
row_vector[D] x[N];
int<lower=0, upper=1> outcome[N];//bernoulli outcome of record
}

transformed data{
real N_real=N;
real avg=sum(outcome)/N_real;
real logOddsAvg = log(avg/(1-avg));
}

parameters{
real logOdds;
vector[D] beta;
}

model{
logOdds ~ normal(logOddsAvg,0.1);
beta~normal(rep_vector(0,D),rep_vector(1,D));
for (n in 1:N)
outcome[n]~bernoulli_logit(logOdds+x[n]*beta);
}

Can you change row_vector[D] x[N]; to matrix[N, D] x; and then do:

outcome ~ bernoulli_logit(logOdds + x * beta);

?

I think that’ll vectorize the bernoulli statement. But if you can do 1/10th of your data in an hour, then I’d think the reason you’re having trouble scaling up is that there are correlations and stuff in the parameters. Have you run the full 60 parameters for a small bit of the data set? Maybe like 1000 or 2000 points or something and looked at the pairs plot? Maybe it’s some of the later parameters that are being difficult?

Have you looked at the mass matrix from your little run? With that much data I imaging your scaling is all wrong so pre-scaling might help you avoid a lot of adaptation.

I will try what you suggested regarding changing x to a matrix and see what happens.

I have not run the full 60 parameters for a smaller set but I did run 60000 records with 10 parameters (took 1 hr) and then 80000 records with 10 parameters (took 1.5 hrs approx). That does make it seem that increasing the records increases the time, right? I also ran the 600000 records with 10 parameters but never got passed the 1st iteration, shut it down after 12 hrs

@sakrejda: was not able to completely follow your suggestion. if you can elaborate and guide me a bit more, that will be great

Hm…, if you can’t get past the first iteration this is probably not relevant… but Stan estimates a stepsize and mass matrix during warmup. If you have a huge amount of data both will need to be adapted to very small numbers and that will slow things down. You can set starting points for both in the arguments now (recent PR, not sure if it’s released outside of develop) and that will save time, so it might be worth trying on an intermediate data size to see the effect.

Yeah, more data will take more time just cause it takes more work to evaluate the log density.

Another way for your model to slow down is if your posterior gets more complicated, which’ll happen as you add more and more covariates (and parameters for each of them). Wouldn’t be surprising if you needed to reparameterize this to get it to sample – ticket to figuring this out is to find a way to sample all your parameters, and see if any of them are being difficult.