Inference of huge data causes an encoding error by PyStan

TaskeHAMANO · October 22, 2017, 4:24pm

Operating System: CentOS Linux release 7.2.1511
Python Version: Python 3.6.1 by miniconda3-4.1.1
Interface Version: PyStan 2.16.0
Compiler/Toolkit: GCC 6.2.1

Hello, stanimaniacs

Currently, I am analyzing very large series data with Stan based on state space model.
The model is next (although it changed slightly, it is almost the same).

data {
int I ;
int D[I] ;
}
parameters {
real flex0 ;
real<lower=0, upper=I> O ;
vector<lower=-pi()/2, upper=pi()/2>[I-1] flex_raw ;
real<lower=0> sigma_flex ;
}
transformed parameters{
vector[I] lambda ;
vector[I] flex ;
vector[I] trend ;
flex[1] = flex0 ;
for(i in 2:I){
flex[i] = flex[i-1] + sigma_flex * tan(flex_raw[i-1]) ;
}
for(i in 1:I){
trend[i] = 4.0 / I * fabs(fabs(i - O / 2.0 / pi() * I) - I / 2.0) ;
}
lambda = exp(flex + trend) ;
}
model {
D ~ poisson(lambda) ;
}
generated quantities {
vector[I] log_lik ;
for(i in 1:I){
log_lik[i] = poisson_lpmf(D[i] | lambda[i]) ;
}
}

This model converges roughly well, but if the value of I is too big, it causes the following error. I checked it caused when the value is more than 50000, but not 10000. The number of chains is 3, the number of iterations is 3000, of which 1000 is destroyed as warm up and the rest are the default parameters.

multiprocessing.pool.MaybeEncodingError: Error sending result: ‘[(0, <stanfit4anon_model_411a89ab6b222a0cdc0bd46bcfb83dbf_5214787532780245793.PyStanHolder object at 0x2b5b565fc7b8>)]’. Reason: ‘error(“‘i’ format requires -2147483648 <= number <= 2147483647”,)’

It seems that overflow occurs when converting C ++ results to python. Perhaps the number of data (and parameter) is too large.

Is there any way to solve this? For now, I am compressing the data by median value. However, I would like to know how much the result worsens compared to using all the data if possible.

sakrejda · October 22, 2017, 4:28pm

The easiest way to solve it currently is to use cmdstan and paste the resulting .CSV files.

ahartikainen · October 22, 2017, 6:29pm

Hi,

Yes, this is a bug in multiprocessing. It should use the pickle to serialize the objects, and apparently it uses default protocol (=3). That fails if the object is over 4gb. I will research further if it is easy to change the protocol from our side. There is a PR going in for multiprocessing to fix this, but it’s from 2015 and I’m not sure when it will be merged.

Right now you can run your models without multiprocessing (n_jobs=1).

edit. This probably will not fail with Threading. @ariddell, was there something specific that prevent us using threading? (C++ was not thread-safe?)

TaskeHAMANO · October 26, 2017, 1:49am

@sakrejda and @ahartikainen

Thanks for your replies.
For the moment I will try using CmdStan.
I hope that the problems related to parallel computation will be solved!

florapython · March 25, 2022, 3:18pm

Hello
I have the same issue. Is there a simpler way to solve this problem of multiprocessing nowadays, then using cmdstan ?
thank you in advance

ahartikainen · March 25, 2022, 4:31pm

There is probably no issue with PyStan 3, but not sure. Also using CmdStanPy is one option.

florapython · March 25, 2022, 5:48pm

thank you, will try !

Topic		Replies	Views
Pystan converting reals to ints? Modeling pystan	3	608	November 3, 2021
Ram problem when pytan is running with very large files General	3	42	October 30, 2024
Pystan -Initilization failed- Modeling pystan	2	437	July 15, 2020
Cache problem with a big model Modeling pystan , fitting-issues	3	527	November 10, 2022
Learning Pystan and vectorizing models Modeling	2	886	July 3, 2017

Inference of huge data causes an encoding error by PyStan

Related topics