# Scrambled order of parameters vs regression coefficients

Good evening,

Apologies for the double post (I posted this on the pystan-dev github issues page, but I am not sure if this is a bug here: Order of samples from to_frame() is scrambled (?) · Issue #357 · stan-dev/pystan · GitHub).

In a nutshell, I want to use stan instead of OLS, or other regression types. However, I would like to maintain the ability to evaluate the coefficient of the ‘stan-regression’ as these are fed in another system. What I have noticed, is that the `fit.to_frame()` method is not always returning the parameters in the same order when compared to the coefficients order from an OLS.

Take for instance the following stan code.

``````data {
int<lower=0> N;   // number of data items
int<lower=0> K;   // number of predictors

matrix[N, K] x;   // predictor matrix
vector[N] y;      // outcome vector
}
transformed data {
matrix[N, K] Q_ast;
matrix[K, K] R_ast;
matrix[K, K] R_ast_inverse;

// thin and scale the QR decomposition
Q_ast = qr_thin_Q(x) * sqrt(N - 1);
R_ast = qr_thin_R(x) / sqrt(N - 1);
R_ast_inverse = inverse(R_ast);
}

parameters {
real alpha;           // intercept
vector[K] theta;      // coefficients on Q_ast
real<lower=0> sigma;  // error scale
}

model {
y ~ normal(Q_ast * theta + alpha, sigma);  // likelihood
}
generated quantities {
vector[K] beta;
beta = R_ast_inverse * theta; // coefficients on x
}
``````

and run and compare to the following:

``````# Your code here
N = 100
y= np.random.normal(100, 10, N)
f1 = y+np.random.normal(np.mean(y), np.std(y), len(y))
f2 = y+np.random.normal(1.5*np.mean(y), 6*np.std(y), len(y))
f3 = y+np.random.normal(3*np.mean(y), 2*np.std(y), len(y))
f4 = y+np.random.normal(10*np.mean(y), 0.5*np.std(y), len(y))
n0 = np.matrix([f1,f2,f3,f4]).T
obj = {
"y"      : y,
"N"      : N,
"x"      : n0,
"K"      : 4
}

posterior = stan.build(stan_code, data=obj)
fit = posterior.sample(num_chains=4, num_samples=1000)

df = fit.to_frame()
print(df.describe().T)

import statsmodels.formula.api as sm
X = pd.DataFrame(n0, columns={"f1","f2","f3","f4"})
X['y']=y
results = sm.ols(formula="y ~ f1+f2+f3+f4", data=X).fit()
results.summary()
``````

What I noticed is that the parameters `beta.1`, `beta.2`, `beta.3`, and `beta.4` are not referring to the parameters in the order I have specified them in the data input (`[f1, f2, f3, f4]`).

`OLS`

coef std err
Intercept -771.0307 56.405
f1 0.0523 0.020
f2 0.0652 0.042
f3 0.7690 0.058
f4 -0.0140 0.007

`STAN`

mean std
alpha -769.905 57.38756
beta.1 0.064788 0.042176
beta.2 -0.01387 0.006887
beta.3 0.052726 0.0202
beta.4 0.767813 0.058688

.You see that the coefficient for `beta.4` should be the coefficient for `f3`, or `beta.2` for `f4`. I am 100% certain that I am doing something horribly wrong, or even worse, I do not understand what I am doing wrong. I would appreciate any help from your end.

Also, the same behaviour exists when I use a different approach such as `normal_id_glm` (I can post the code here if you want). I also use this approach to generate samples for prediction but haven’t checked if the order there is correct or not.

• Operating System = Ubuntu in Windows (Linux DESKTOP-49RJEMJ 5.10.16.3-microsoft-standard-WSL2 #1 SMP Fri Apr 2 22:23:49 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux)
• Python Version = Python ver: Python 3.8.10
• Compiler/Toolkit = gcc version: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
• PyStan Version = 3.5

Thank you

1 Like

Will this scrambling also happen if you access `fit["beta"]`

Hi Ari

Thank you so much for replying.

Indeed this scrambling exists when using `fit['beta']`. Of course, this returns the whole sample data, but with using something like `np.mean(fit['beta'], axis=1)` I still get the points in different order (but same as the `to_frame()` method.

Ok, can you add print calls for beta in .stan program?

Is the order always shifted or mirrored?

A gentleman in the pystan issues forum helped me. My issue had nothing to do with stan, I was (stupidly) using a set for setting up the column names in the dataframe which does not maintain the order.

The correct code should have read:

``````X = pd.DataFrame(n0, columns=["f1","f2","f3","f4"]) #use array [], and not a set {}
X['y']=y
results = sm.ols(formula="y ~ f1+f2+f3+f4", data=X).fit()
results.summary()
``````