I have a dataset for which I would essentially like to run the same Stan model on each column. Rather than using 1 core for each chain I’d like to use 1 core for each column. I was able to get this working for PyStan 2.19 but I’m not sure how to do so (if even possible) with the beta version. I recognize that things are still in development so apologies if this is not supported at the moment.
I’ve simplified the script code to focus on the issue at hand but please feel free to ask for more information/data/code.
import dask
from dask.distributed import Client
import numpy as np
import pandas as pd
import stan
def main():
"""
Format of the data table:
============================================
OTU9 OTU15 OTU20 OTU41 OTU47
11835.11 103.0 89.0 1271.0 39.0 64.0
11835.12 895.0 616.0 66.0 29.0 47.0
11835.13 0.0 0.0 14.0 314.0 140.0
11835.14 27.0 30.0 2.0 103.0 50.0
11835.15 0.0 0.0 36.0 68.0 55.0
"""
"""
Format of the Stan code:
============================================
data {
int<lower=0> N;
int y[N];
( ... )
}
parameters { ... }
model {
y ~ ( ... )
}
"""
dat = { ... }
@dask.delayed
def fit_single_column(values):
dat["y"] = values.astype(int) # update dat each iteration
sm = stan.build(stancode, data=dat, random_seed=42)
fit = sm.sample(num_chains=1, num_samples=100)
return fit
fits = []
for col in tbl.columns:
values = tbl[col].values.astype(int)
fits.append(fit_single_column(values))
fits = dask.compute(*fits)
if __name__ == "__main__":
client = Client(n_workers=4) # run 4 columns at a time
main()
Output here: output.txt (22.9 KB)
Looks like it’s some issue with the caching in httpstan
but I’m having trouble diagnosing further.
Runtime details
- macOS BigSur
- 2.3 GHz Dual-Core Intel Core i5
- 8 GB RAM
- PyStan version 3.0.0b7