Pystan model takes too long to fit

harresbintariq · November 3, 2017, 3:55pm

Operating System: MacOS
Python Version:3.6

Hello,

I am trying to fit a model using pystan. Code is below. I have approximately 600000 records (N) and dimension of x is 60 (D). x has a mixture of binary, categorical and continuous variables. Outcome is binary (0,1). When i fit the model with all the data, it takes more than 10hrs to run just ‘1’ iteration. If I run it with 60000 records and reduce dim(x) to 10 then the model fits (2000 iterations, 2 chains) in about 1 hr. is there a way to use all the records (600000) and all the features in x (60) at the same time and ensure that the model fits in about, let’s say 10-12hrs without increasing the hardware resources at my disposal :) Can the following code be optimized further? I tried to get rid of the for loop but got into errors…

data{
int<lower=1> N; //number of records
int <lower=1> D; //dim of x
row_vector[D] x[N];
int<lower=0, upper=1> outcome[N];//bernoulli outcome of record
}

transformed data{
real N_real=N;
real avg=sum(outcome)/N_real;
real logOddsAvg = log(avg/(1-avg));
}

parameters{
real logOdds;
vector[D] beta;
}

model{
logOdds ~ normal(logOddsAvg,0.1);
beta~normal(rep_vector(0,D),rep_vector(1,D));
for (n in 1:N)
outcome[n]~bernoulli_logit(logOdds+x[n]*beta);
}

bbbales2 · November 3, 2017, 9:53pm

Can you change row_vector[D] x[N]; to matrix[N, D] x; and then do:

outcome ~ bernoulli_logit(logOdds + x * beta);

?

I think that’ll vectorize the bernoulli statement. But if you can do 1/10th of your data in an hour, then I’d think the reason you’re having trouble scaling up is that there are correlations and stuff in the parameters. Have you run the full 60 parameters for a small bit of the data set? Maybe like 1000 or 2000 points or something and looked at the pairs plot? Maybe it’s some of the later parameters that are being difficult?

sakrejda · November 3, 2017, 9:57pm

Have you looked at the mass matrix from your little run? With that much data I imaging your scaling is all wrong so pre-scaling might help you avoid a lot of adaptation.

harresbintariq · November 4, 2017, 6:19am

I will try what you suggested regarding changing x to a matrix and see what happens.

I have not run the full 60 parameters for a smaller set but I did run 60000 records with 10 parameters (took 1 hr) and then 80000 records with 10 parameters (took 1.5 hrs approx). That does make it seem that increasing the records increases the time, right? I also ran the 600000 records with 10 parameters but never got passed the 1st iteration, shut it down after 12 hrs

harresbintariq · November 4, 2017, 6:21am

@sakrejda: was not able to completely follow your suggestion. if you can elaborate and guide me a bit more, that will be great

sakrejda · November 4, 2017, 11:26am

Hm…, if you can’t get past the first iteration this is probably not relevant… but Stan estimates a stepsize and mass matrix during warmup. If you have a huge amount of data both will need to be adapted to very small numbers and that will slow things down. You can set starting points for both in the arguments now (recent PR, not sure if it’s released outside of develop) and that will save time, so it might be worth trying on an intermediate data size to see the effect.

bbbales2 · November 4, 2017, 3:13pm

Yeah, more data will take more time just cause it takes more work to evaluate the log density.

Another way for your model to slow down is if your posterior gets more complicated, which’ll happen as you add more and more covariates (and parameters for each of them). Wouldn’t be surprising if you needed to reparameterize this to get it to sample – ticket to figuring this out is to find a way to sample all your parameters, and see if any of them are being difficult.

Topic		Replies	Views
How to improve model speed as the number of datapoints increase Modeling fitting-issues , performance	7	491	September 17, 2021
Large data sets with stan code Modeling	1	726	September 18, 2018
How to improve model sampling speed when applied to high-dimension data Modeling fitting-issues , performance , cmdstanr	8	306	August 21, 2025
Model estimation very long: how to gain time / optimize? Modeling rstan	1	626	April 5, 2022
Initialization failed Modeling fitting-issues	3	2687	July 3, 2018

Pystan model takes too long to fit

Related topics