First Pystan Poisson Model

After getting my first pystan model to infer just the mean (see: Help with first model in Pystan), I am now going through and adding one variable at a time.

However, after adding my first variable, I’m getting the following error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-69-8a82764b37a4> in <module>
----> 1 sm_fit = sm.sampling(data=data1)
      2 print(sm_fit)

~\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\site-packages\pystan\model.py in sampling(self, data, pars, chains, iter, warmup, thin, seed, init, sample_file, diagnostic_file, verbose, algorithm, control, n_jobs, **kwargs)
    756                 raise ValueError("Warmup samples must be greater than 0 when adaptation is enabled (`adapt_engaged=True`)")
    757         seed = pystan.misc._check_seed(seed)
--> 758         fit = self.fit_class(data, seed)
    759 
    760         m_pars = fit._get_param_names()

stanfit4anon_model_6a60194cf24baf0e81dd4e0d08e2c3ea_5929338110356435324.pyx in stanfit4anon_model_6a60194cf24baf0e81dd4e0d08e2c3ea_5929338110356435324.StanFit4Model.__cinit__()

RuntimeError: Exception: int variable contained non-int values; processing stage=data initialization; variable name=y1; base type=int  (in 'unknown file name' at line 5)

Below is the code for my model (note: my x variable has been normalized for easier sampling):

liability_code = """
data {
    int<lower=0> N;
    vector[N] driver_age;
    int<lower=0> y1;
}
transformed data {}
parameters {            
    real a;        //parameters is what we want to infer....'a' in this case is the coefficient for driver_age
    real b;        

}
transformed parameters{
    real mu[N];
    for (i in 1:N) {
        mu[i] <- a*driver_age[i] + b;
    }
}
model {
    a ~ normal(0, 1);  
    b ~ normal(0,5);
    y1 ~ poisson(mu);   
}

sm = StanModel(model_code=liability_code)

Here is the data and fit line:

data1 = {'N':len(y), 'driver_age':x1.values.tolist(), 'y1':y.iloc[:, 0].values.tolist()}
sm_fit = sm.sampling(data=data1)

y1 is a float.

Thank you.

That’s the problem, since in a Poisson model it has to be an int (and that’s how it’s defined in the Stan model).

The error message was already giving you most of the information: Exception: int variable contained non-int values; processing stage=data initialization; variable name=y1;

You don’t need to transform numpy array to list.

PyStan (2) works with numpy arrays / list
PyStan (3; pystan-next) works any data that can be transformed to numpy array.

.values creates a numpy array

So:

'y1':y.iloc[:, 0].values.astype(int)

edit. Assuming y1 contains integer values as floats)

Thank you. That error went away but now I have another error:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\jordan.howell\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\jordan.howell\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 44, in mapstar
    return list(map(*args))
  File "stanfit4anon_model_269d3633d188cbe5d52b63cdd659b1c8_5427897343901720503.pyx", line 373, in stanfit4anon_model_269d3633d188cbe5d52b63cdd659b1c8_5427897343901720503._call_sampler_star
  File "stanfit4anon_model_269d3633d188cbe5d52b63cdd659b1c8_5427897343901720503.pyx", line 406, in stanfit4anon_model_269d3633d188cbe5d52b63cdd659b1c8_5427897343901720503._call_sampler
RuntimeError: std::bad_alloc
"""

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-97-8a82764b37a4> in <module>
----> 1 sm_fit = sm.sampling(data=data1)
      2 print(sm_fit)

~\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\site-packages\pystan\model.py in sampling(self, data, pars, chains, iter, warmup, thin, seed, init, sample_file, diagnostic_file, verbose, algorithm, control, n_jobs, **kwargs)
    811         call_sampler_args = izip(itertools.repeat(data), args_list, itertools.repeat(pars))
    812         call_sampler_star = self.module._call_sampler_star
--> 813         ret_and_samples = _map_parallel(call_sampler_star, call_sampler_args, n_jobs)
    814         samples = [smpl for _, smpl in ret_and_samples]
    815 

~\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\site-packages\pystan\model.py in _map_parallel(function, args, n_jobs)
     83         try:
     84             pool = multiprocessing.Pool(processes=n_jobs)
---> 85             map_result = pool.map(function, args)
     86         finally:
     87             pool.close()

~\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py in map(self, func, iterable, chunksize)
    266         in a list that is returned.
    267         '''
--> 268         return self._map_async(func, iterable, mapstar, chunksize).get()
    269 
    270     def starmap(self, func, iterable, chunksize=None):

~\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py in get(self, timeout)
    655             return self._value
    656         else:
--> 657             raise self._value
    658 
    659     def _set(self, i, obj):

RuntimeError: std::bad_alloc

I thought the dims declared was the int<lower=0> N: line in my data declaration? Full model code below for reference:

liability_code = """
data {
    int<lower=0> N;
    vector[N] driver_experience;
    int<lower=0> y1[N];
}
transformed data {}
parameters {            
    real a;        //parameters is what we want to infer....'a' in this case is the coefficient for driver_age
    real b;        

}
transformed parameters{
    real mu[N];
    for (i in 1:N) {
        mu[i] <- a*driver_experience[i] + b;
    }
}
model {
    a ~ normal(0, 1);  
    b ~ normal(0,5);
    y1 ~ poisson(mu);   
}
"""

Hi

edit. Wait is that memory error? How is your memory use?

See https://mc-stan.org/docs/2_21/functions-reference/poisson.html

Your mu should be (strickly?) positive.

To this to happen, you need to figure out, how you can force a * x + b be positive. Maybe add <lower=0> for mu, a and b, but that kind of depends on your x (are they pos or neg)?

I think I see a potential problem here:

transformed parameters{
    real mu[N];
    for (i in 1:N) {
        mu[i] <- a*driver_experience[i] + b;
    }
}

mu is a parameter. That means that it isn’t a temporary value. Rather, PyStan has to have enough memory to store not only a single instance of mu, but however many mu's are generated, i.e. the number of iterations times the number of parallel chains. Now if mu were just a single real value, 8 bytes in size, that wouldn’t be that big a deal. For 1000 iterations and 2 chains (which you had in your first model), that would be 8\ \mathrm{bytes} \times 1000 \times 2 or about 16 kilobytes – a fairly trivial amount of memory on a modern computer.

However, I remember that in your first model that you set N to about 1 million. So now, at each iteration mu is about 8 megabytes, and if you have the same number of iterations and chains as before, you now need to store about 16 gigabytes in total. That’s far less trivial.

I would recommend getting rid of your transformed parameters block and writing your model block like this:

model {
    a ~ normal(0, 1);  
    b ~ normal(0,5);
    y1 ~ poisson(a*driver_experience + b);   
}

That should use far less memory.

As long as driver_experience is positive, declaring a and b with <lower=0> should be enough to keep the argument to poisson() positive, right?

Yes, I think that should work.