Chain not finishing for linear regression

I’m trying to perform linear regression using some synthetic data that I generated. When I try to fit the model, one chain typically fails to run (see output below) though once in a while all four chains finish. The model, python code, and synthetic data can be download as a zip at this link. This seems like a simple example so I’d like to understand why it’s failing.

My model is

data {
  int<lower=0> N;
  vector[N] p1;
  vector[N] sex;
}
parameters {
  real alpha;
  real beta; 
  real<lower=0> sigma;
}
model {
  p1 ~ normal(alpha + beta * sex, sigma);
}

and my python code is

import pandas as pd
import pystan

data = pd.read_table('example.tsv', index_col=0)

data.ix[data.sex == 1, 'p1'] += 1
data['p1'] += 2

p1_dat = {'N': data.shape[0],
          'p1': data.p1.values,
          'sex': data.sex.values,
         }

fit = pystan.stan(file='model.stan', data=p1_dat, iter=1000, chains=4)
print('finished')

Python info:

$ python --version
Python 2.7.13 :: Anaconda 4.3.0 (x86_64)

I’m on OSX 10.12.5. I installed pystan using conda:

>>> pystan.__version__
'2.14.0.0'

Output:

$ python model.py 
INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_57b35883c13d5c0484d7cd7da3f6582c NOW.
Iteration:   1 / 1000 [  0%]  (Warmup) (Chain 0)
Iteration:   1 / 1000 [  0%]  (Warmup) (Chain 1)
Iteration:   1 / 1000 [  0%]  (Warmup) (Chain 2)
Iteration:   1 / 1000 [  0%]  (Warmup) (Chain 3)
Iteration: 100 / 1000 [ 10%]  (Warmup) (Chain 1)
Iteration: 100 / 1000 [ 10%]  (Warmup) (Chain 2)
Iteration: 100 / 1000 [ 10%]  (Warmup) (Chain 3)
Iteration: 200 / 1000 [ 20%]  (Warmup) (Chain 1)
Iteration: 200 / 1000 [ 20%]  (Warmup) (Chain 2)
Iteration: 200 / 1000 [ 20%]  (Warmup) (Chain 3)
Iteration: 300 / 1000 [ 30%]  (Warmup) (Chain 3)
Iteration: 300 / 1000 [ 30%]  (Warmup) (Chain 1)
Iteration: 300 / 1000 [ 30%]  (Warmup) (Chain 2)
Iteration: 400 / 1000 [ 40%]  (Warmup) (Chain 3)
Iteration: 400 / 1000 [ 40%]  (Warmup) (Chain 2)
Iteration: 400 / 1000 [ 40%]  (Warmup) (Chain 1)
Iteration: 500 / 1000 [ 50%]  (Warmup) (Chain 2)
Iteration: 501 / 1000 [ 50%]  (Sampling) (Chain 2)
Iteration: 500 / 1000 [ 50%]  (Warmup) (Chain 1)
Iteration: 500 / 1000 [ 50%]  (Warmup) (Chain 3)
Iteration: 501 / 1000 [ 50%]  (Sampling) (Chain 3)
Iteration: 501 / 1000 [ 50%]  (Sampling) (Chain 1)
Iteration: 600 / 1000 [ 60%]  (Sampling) (Chain 1)
Iteration: 600 / 1000 [ 60%]  (Sampling) (Chain 2)
Iteration: 600 / 1000 [ 60%]  (Sampling) (Chain 3)
Iteration: 700 / 1000 [ 70%]  (Sampling) (Chain 1)
Iteration: 700 / 1000 [ 70%]  (Sampling) (Chain 2)
Iteration: 700 / 1000 [ 70%]  (Sampling) (Chain 3)
Iteration: 800 / 1000 [ 80%]  (Sampling) (Chain 1)
Iteration: 800 / 1000 [ 80%]  (Sampling) (Chain 2)
Iteration: 800 / 1000 [ 80%]  (Sampling) (Chain 3)
Iteration: 900 / 1000 [ 90%]  (Sampling) (Chain 1)
Iteration: 900 / 1000 [ 90%]  (Sampling) (Chain 2)
Iteration: 900 / 1000 [ 90%]  (Sampling) (Chain 3)
Iteration: 1000 / 1000 [100%]  (Sampling) (Chain 2)
# 
#  Elapsed Time: 0.282239 seconds (Warm-up)
#                0.277842 seconds (Sampling)
#                0.560081 seconds (Total)
# 
Iteration: 1000 / 1000 [100%]  (Sampling) (Chain 1)
# 
#  Elapsed Time: 0.28384 seconds (Warm-up)
#                0.284098 seconds (Sampling)
#                0.567938 seconds (Total)
# 
Iteration: 1000 / 1000 [100%]  (Sampling) (Chain 3)
# 
#  Elapsed Time: 0.281971 seconds (Warm-up)
#                0.317535 seconds (Sampling)
#                0.599506 seconds (Total)

Hi, I was unable to replicate the error.

I created a new conda-env and then executed the code ~20 times.

conda create -n stan27 python=2.7 numpy cython pandas matplotlib pystan
source activate stan27
for i in {1..20}; do    python model.py; done

Your data has 3 unique points and all the parameters have uniform prior so it could be a problem with initial values.

Thanks, I apologize I posted the wrong data file. I’ve fixed it now: https://www.dropbox.com/s/csw3lt4isybb734/linreg.zip?dl=0

When I try running the model on my home computer, it seems to finish but I also get this warning that doesn’t show up on my work machine:

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception thrown at line 12: normal_log: Scale parameter is 0, but must be > 0!
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

I’m using

$ python --version
Python 2.7.13

and

>>> pystan.__version__
'2.14.0.0'

So the problem seems to be specific to my work machine perhaps.

As @ahartikainen points out, you are probably going to need some prior information to control your regression with three data points and three parameters.

Sorry @Bob_Carpenter, I think I’m misunderstanding something. The example.tsv file has 2,000 data points (rows) right? Here’s a description of what the data looks before I try to fit the model:

In [1]: import pandas as pd

In [2]: data = pd.read_table('example.tsv', index_col=0)

In [3]: data.ix[data.sex == 1, 'p1'] += 1

In [4]: data['p1'] += 2

In [5]: data['p1'].describe()
Out[5]: 
count    2000.000000
mean        2.739000
std         1.076905
min        -1.509304
25%         1.974494
50%         2.776316
75%         3.470591
max         5.873075
Name: p1, dtype: float64

I tried adding interval constraints but still generally see at least one chain fail to finish.

data {
  int<lower=0> N;
  vector[N] p1;
  vector[N] sex;
}
parameters {
  real<lower=-5, upper=5>alpha;
  real<lower=-5, upper=5> beta; 
  real<lower=0> sigma;
}
model {
  p1 ~ normal(alpha + beta * sex, sigma);
}

This sounds like it may be a problem with some aspect of PyStan. This shouldn’t be a problem for Stan, as you can see by the 0.5 second run times. I’m guessing there’s something wrong with your installation, memory, or something else. You may need to start another thread with “PyStan” in the title or open an issue on the PyStan issue tracker to get the PyStan developers’ attention.

P.S. We fit models like this all the time, even ones with improper priors like you’ve provided, though we’d recommend only bounding sigma below at zero and putting proper priors on all the parameters.

Thanks @Bob_Carpenter, I agree that this looks like a pystan error. I’ll open a pystan thread to investigate. Thanks for the help.