Inference of huge data causes an encoding error by PyStan

Operating System: CentOS Linux release 7.2.1511
Python Version: Python 3.6.1 by miniconda3-4.1.1
Interface Version: PyStan 2.16.0
Compiler/Toolkit: GCC 6.2.1

Hello, stanimaniacs

Currently, I am analyzing very large series data with Stan based on state space model.
The model is next (although it changed slightly, it is almost the same).

data {
int I ;
int D[I] ;
}
parameters {
real flex0 ;
real<lower=0, upper=I> O ;
vector<lower=-pi()/2, upper=pi()/2>[I-1] flex_raw ;
real<lower=0> sigma_flex ;
}
transformed parameters{
vector[I] lambda ;
vector[I] flex ;
vector[I] trend ;
flex[1] = flex0 ;
for(i in 2:I){
flex[i] = flex[i-1] + sigma_flex * tan(flex_raw[i-1]) ;
}
for(i in 1:I){
trend[i] = 4.0 / I * fabs(fabs(i - O / 2.0 / pi() * I) - I / 2.0) ;
}
lambda = exp(flex + trend) ;
}
model {
D ~ poisson(lambda) ;
}
generated quantities {
vector[I] log_lik ;
for(i in 1:I){
log_lik[i] = poisson_lpmf(D[i] | lambda[i]) ;
}
}

This model converges roughly well, but if the value of I is too big, it causes the following error. I checked it caused when the value is more than 50000, but not 10000. The number of chains is 3, the number of iterations is 3000, of which 1000 is destroyed as warm up and the rest are the default parameters.

multiprocessing.pool.MaybeEncodingError: Error sending result: β€˜[(0, <stanfit4anon_model_411a89ab6b222a0cdc0bd46bcfb83dbf_5214787532780245793.PyStanHolder object at 0x2b5b565fc7b8>)]’. Reason: β€˜error(β€œβ€˜i’ format requires -2147483648 <= number <= 2147483647”,)’

It seems that overflow occurs when converting C ++ results to python. Perhaps the number of data (and parameter) is too large.

Is there any way to solve this? For now, I am compressing the data by median value. However, I would like to know how much the result worsens compared to using all the data if possible.

1 Like

The easiest way to solve it currently is to use cmdstan and paste the resulting .CSV files.

Hi,

Yes, this is a bug in multiprocessing. It should use the pickle to serialize the objects, and apparently it uses default protocol (=3). That fails if the object is over 4gb. I will research further if it is easy to change the protocol from our side. There is a PR going in for multiprocessing to fix this, but it’s from 2015 and I’m not sure when it will be merged.

Right now you can run your models without multiprocessing (n_jobs=1).

edit. This probably will not fail with Threading. @ariddell, was there something specific that prevent us using threading? (C++ was not thread-safe?)

1 Like

@sakrejda and @ahartikainen

Thanks for your replies.
For the moment I will try using CmdStan.
I hope that the problems related to parallel computation will be solved!

Hello
I have the same issue. Is there a simpler way to solve this problem of multiprocessing nowadays, then using cmdstan ?
thank you in advance

There is probably no issue with PyStan 3, but not sure. Also using CmdStanPy is one option.

thank you, will try !