Fit.extract takes unreasonably long time

nerpa · April 29, 2020, 2:27pm

Dear forum,

I am using pickle to load saved model+fit and then trying to extract the data using the following code:

with open('model.pkl', 'rb') as f:
    data = pickle.load(f)
    f.close()

fit = data[1]
params = fit.extract(pars='theta', permuted=True)

The line with fit.extract() takes just too long - over two hours. The *.pkl file is ~3GB, so I guess it is fine it doesn’t take 5 minute, but over two hours is just wrong. I tried using permuted=False, but it didn’t help.
I am using Python3.6, pystan 2.19 and PyCharm 2020.1
Please help.

ahartikainen · April 29, 2020, 2:29pm

What are the dimensions for the theta?

edit. how many draws, samples, theta shape

nerpa · April 29, 2020, 2:39pm

it’s 4 chains with 1500 iterations each (thinning = 1, warmup = 500), so the shape of the permuted version is 4000x30x2000
I know it is not small, but when I extract the first model in every python instance, it takes about 10 mins. Starting the second model it takes two hours and more even if I close and delete the previously loaded data.
Just to clarify, the two hours and more is not the final estimate. It’s just when I give up and open another python instance to load every model manually.

ahartikainen · April 29, 2020, 3:29pm

Ok, that is interesting. So you can read one model and extract data, but the second model will be really slow?

How much do you have RAM?

Have you considered using ArviZ InferenceData? It is build on top of xarray datasets which are build over the netcdf4 format.

import arviz as az
idata = az.from_pystan(fit)
idata.to_netcdf("my_file.nc")

nerpa · April 29, 2020, 3:39pm

yes, first model loads fairly well, but then it gets stuck. I have 16GB of memory, so it should be fine for this data.
I am trying now to use ArviZ, but from the time it takes to extract data from fit of the first model, looks like it performs worse than fit.extract()

ahartikainen · April 29, 2020, 3:51pm

It uses basically same algo as pystan extract does.

But once you have your data in idata, it is easier / faster to access

nerpa · April 29, 2020, 4:03pm

Yes, you are right. It works almost the same for the first model. Second model was extracted in 15 mins! Hope this will persist in a loop. Thanks!
Btw, is there an equivalent for permuted=True when I extract from idata?

ahartikainen · April 29, 2020, 4:10pm

Nothing ready. You can probably use some numpy / xarray shuffle if needed. Why would you want to permute the sample?

Can call del oldfit; gc.collect() between the loadings?

nerpa · April 29, 2020, 4:21pm

I need to extract at the end only the mean value. I cannot use summary stats because I have nans in my posterior samples.
I can call it all, but could you be more specific about gc.collect()? I never used it before.

ahartikainen · April 29, 2020, 4:29pm

You there is some ram problems, then calling gc.collect will “collect” the garbage (puthon does this automatically, but sometimes it is “easier” to collect manually.

Do you have nans in your draws? Or you have a some cells with nan value only?

np.nanmean can be used with idata.

nerpa · April 29, 2020, 4:38pm

Thanks! I incorporated now garbage collections in my python code.
I have some cells with nans because of how my data is structured (I have different number of trials for different subjects, so for posterior predictive checks some trails are nan)
np.nanmean was a great solution so far.
Thank you so much!

nerpa · April 29, 2020, 5:44pm

It still doesn’t work. Almost an hour and still loading using idata = az.from_pystan(fit)

Topic		Replies	Views
Fit.extract() takes a long time PyStan	9	3097	May 22, 2018
Combine Fit Objects PyStan	7	1576	January 12, 2019
Save fit model in pystan 2 General	26	6555	August 19, 2021
Unpickle saved results without stan model, pystan Modeling	3	1899	December 31, 2019
[Interface roadmap] fit objects and `extract` Developers	44	2362	September 17, 2019

Fit.extract takes unreasonably long time

Related topics