Good evening,
Apologies for the double post (I posted this on the pystan-dev github issues page, but I am not sure if this is a bug here: Order of samples from to_frame() is scrambled (?) · Issue #357 · stan-dev/pystan · GitHub).
In a nutshell, I want to use stan instead of OLS, or other regression types. However, I would like to maintain the ability to evaluate the coefficient of the ‘stan-regression’ as these are fed in another system. What I have noticed, is that the fit.to_frame() method is not always returning the parameters in the same order when compared to the coefficients order from an OLS.
Take for instance the following stan code.
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
}
transformed data {
matrix[N, K] Q_ast;
matrix[K, K] R_ast;
matrix[K, K] R_ast_inverse;
// thin and scale the QR decomposition
Q_ast = qr_thin_Q(x) * sqrt(N - 1);
R_ast = qr_thin_R(x) / sqrt(N - 1);
R_ast_inverse = inverse(R_ast);
}
parameters {
real alpha; // intercept
vector[K] theta; // coefficients on Q_ast
real<lower=0> sigma; // error scale
}
model {
y ~ normal(Q_ast * theta + alpha, sigma); // likelihood
}
generated quantities {
vector[K] beta;
beta = R_ast_inverse * theta; // coefficients on x
}
and run and compare to the following:
# Your code here
N = 100
y= np.random.normal(100, 10, N)
f1 = y+np.random.normal(np.mean(y), np.std(y), len(y))
f2 = y+np.random.normal(1.5*np.mean(y), 6*np.std(y), len(y))
f3 = y+np.random.normal(3*np.mean(y), 2*np.std(y), len(y))
f4 = y+np.random.normal(10*np.mean(y), 0.5*np.std(y), len(y))
n0 = np.matrix([f1,f2,f3,f4]).T
obj = {
"y" : y,
"N" : N,
"x" : n0,
"K" : 4
}
posterior = stan.build(stan_code, data=obj)
fit = posterior.sample(num_chains=4, num_samples=1000)
df = fit.to_frame()
print(df.describe().T)
import statsmodels.formula.api as sm
X = pd.DataFrame(n0, columns={"f1","f2","f3","f4"})
X['y']=y
results = sm.ols(formula="y ~ f1+f2+f3+f4", data=X).fit()
results.summary()
What I noticed is that the parameters beta.1, beta.2, beta.3, and beta.4 are not referring to the parameters in the order I have specified them in the data input ([f1, f2, f3, f4]).
OLS
| coef | std err | |
|---|---|---|
| Intercept | -771.0307 | 56.405 |
| f1 | 0.0523 | 0.020 |
| f2 | 0.0652 | 0.042 |
| f3 | 0.7690 | 0.058 |
| f4 | -0.0140 | 0.007 |
STAN
| mean | std | |
|---|---|---|
| alpha | -769.905 | 57.38756 |
| beta.1 | 0.064788 | 0.042176 |
| beta.2 | -0.01387 | 0.006887 |
| beta.3 | 0.052726 | 0.0202 |
| beta.4 | 0.767813 | 0.058688 |
.You see that the coefficient for beta.4 should be the coefficient for f3, or beta.2 for f4. I am 100% certain that I am doing something horribly wrong, or even worse, I do not understand what I am doing wrong. I would appreciate any help from your end.
Also, the same behaviour exists when I use a different approach such as normal_id_glm (I can post the code here if you want). I also use this approach to generate samples for prediction but haven’t checked if the order there is correct or not.
Describe your system
- Operating System = Ubuntu in Windows (Linux DESKTOP-49RJEMJ 5.10.16.3-microsoft-standard-WSL2 #1 SMP Fri Apr 2 22:23:49 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux)
- Python Version = Python ver: Python 3.8.10
- Compiler/Toolkit = gcc version: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
- PyStan Version = 3.5
Thank you