Fit.extract takes unreasonably long time

Dear forum,

I am using pickle to load saved model+fit and then trying to extract the data using the following code:

with open('model.pkl', 'rb') as f:
    data = pickle.load(f)
    f.close()

fit = data[1]
params = fit.extract(pars='theta', permuted=True)

The line with fit.extract() takes just too long - over two hours. The *.pkl file is ~3GB, so I guess it is fine it doesn’t take 5 minute, but over two hours is just wrong. I tried using permuted=False, but it didn’t help.
I am using Python3.6, pystan 2.19 and PyCharm 2020.1
Please help.

What are the dimensions for the theta?

edit. how many draws, samples, theta shape

it’s 4 chains with 1500 iterations each (thinning = 1, warmup = 500), so the shape of the permuted version is 4000x30x2000
I know it is not small, but when I extract the first model in every python instance, it takes about 10 mins. Starting the second model it takes two hours and more even if I close and delete the previously loaded data.
Just to clarify, the two hours and more is not the final estimate. It’s just when I give up and open another python instance to load every model manually.

Ok, that is interesting. So you can read one model and extract data, but the second model will be really slow?

How much do you have RAM?

Have you considered using ArviZ InferenceData? It is build on top of xarray datasets which are build over the netcdf4 format.

import arviz as az
idata = az.from_pystan(fit)
idata.to_netcdf("my_file.nc")

yes, first model loads fairly well, but then it gets stuck. I have 16GB of memory, so it should be fine for this data.
I am trying now to use ArviZ, but from the time it takes to extract data from fit of the first model, looks like it performs worse than fit.extract()

It uses basically same algo as pystan extract does.

But once you have your data in idata, it is easier / faster to access

Yes, you are right. It works almost the same for the first model. Second model was extracted in 15 mins! Hope this will persist in a loop. Thanks!
Btw, is there an equivalent for permuted=True when I extract from idata?

Nothing ready. You can probably use some numpy / xarray shuffle if needed. Why would you want to permute the sample?

Can call del oldfit; gc.collect() between the loadings?

I need to extract at the end only the mean value. I cannot use summary stats because I have nans in my posterior samples.
I can call it all, but could you be more specific about gc.collect()? I never used it before.

You there is some ram problems, then calling gc.collect will “collect” the garbage (puthon does this automatically, but sometimes it is “easier” to collect manually.

Do you have nans in your draws? Or you have a some cells with nan value only?

np.nanmean can be used with idata.

Thanks! I incorporated now garbage collections in my python code.
I have some cells with nans because of how my data is structured (I have different number of trials for different subjects, so for posterior predictive checks some trails are nan)
np.nanmean was a great solution so far.
Thank you so much!

It still doesn’t work. Almost an hour and still loading using idata = az.from_pystan(fit)