Scrambled order of parameters vs regression coefficients

Good evening,

Apologies for the double post (I posted this on the pystan-dev github issues page, but I am not sure if this is a bug here: Order of samples from to_frame() is scrambled (?) · Issue #357 · stan-dev/pystan · GitHub).

In a nutshell, I want to use stan instead of OLS, or other regression types. However, I would like to maintain the ability to evaluate the coefficient of the ‘stan-regression’ as these are fed in another system. What I have noticed, is that the fit.to_frame() method is not always returning the parameters in the same order when compared to the coefficients order from an OLS.

Take for instance the following stan code.

data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  
  matrix[N, K] x;   // predictor matrix
  vector[N] y;      // outcome vector
 }
transformed data {
  matrix[N, K] Q_ast;
  matrix[K, K] R_ast;
  matrix[K, K] R_ast_inverse;

// thin and scale the QR decomposition
  Q_ast = qr_thin_Q(x) * sqrt(N - 1);
  R_ast = qr_thin_R(x) / sqrt(N - 1);
  R_ast_inverse = inverse(R_ast);
}

parameters {
  real alpha;           // intercept
  vector[K] theta;      // coefficients on Q_ast
  real<lower=0> sigma;  // error scale
}

model {
  y ~ normal(Q_ast * theta + alpha, sigma);  // likelihood
}
generated quantities {
  vector[K] beta;
  beta = R_ast_inverse * theta; // coefficients on x
}

and run and compare to the following:

# Your code here
N = 100
y= np.random.normal(100, 10, N)
f1 = y+np.random.normal(np.mean(y), np.std(y), len(y))
f2 = y+np.random.normal(1.5*np.mean(y), 6*np.std(y), len(y))
f3 = y+np.random.normal(3*np.mean(y), 2*np.std(y), len(y))
f4 = y+np.random.normal(10*np.mean(y), 0.5*np.std(y), len(y))
n0 = np.matrix([f1,f2,f3,f4]).T
obj = {
        "y"      : y,
        "N"      : N,
        "x"      : n0,
        "K"      : 4
}

posterior = stan.build(stan_code, data=obj)
fit = posterior.sample(num_chains=4, num_samples=1000)

df = fit.to_frame()
print(df.describe().T)

import statsmodels.formula.api as sm
X = pd.DataFrame(n0, columns={"f1","f2","f3","f4"})
X['y']=y
results = sm.ols(formula="y ~ f1+f2+f3+f4", data=X).fit()
results.summary()  

What I noticed is that the parameters beta.1, beta.2, beta.3, and beta.4 are not referring to the parameters in the order I have specified them in the data input ([f1, f2, f3, f4]).

OLS

coef std err
Intercept -771.0307 56.405
f1 0.0523 0.020
f2 0.0652 0.042
f3 0.7690 0.058
f4 -0.0140 0.007

STAN

mean std
alpha -769.905 57.38756
beta.1 0.064788 0.042176
beta.2 -0.01387 0.006887
beta.3 0.052726 0.0202
beta.4 0.767813 0.058688

.You see that the coefficient for beta.4 should be the coefficient for f3, or beta.2 for f4. I am 100% certain that I am doing something horribly wrong, or even worse, I do not understand what I am doing wrong. I would appreciate any help from your end.

Also, the same behaviour exists when I use a different approach such as normal_id_glm (I can post the code here if you want). I also use this approach to generate samples for prediction but haven’t checked if the order there is correct or not.

Describe your system

  • Operating System = Ubuntu in Windows (Linux DESKTOP-49RJEMJ 5.10.16.3-microsoft-standard-WSL2 #1 SMP Fri Apr 2 22:23:49 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux)
  • Python Version = Python ver: Python 3.8.10
  • Compiler/Toolkit = gcc version: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  • PyStan Version = 3.5

Thank you

1 Like

Will this scrambling also happen if you access fit["beta"]

Hi Ari

Thank you so much for replying.

Indeed this scrambling exists when using fit['beta']. Of course, this returns the whole sample data, but with using something like np.mean(fit['beta'], axis=1) I still get the points in different order (but same as the to_frame() method.

Ok, can you add print calls for beta in .stan program?

Is the order always shifted or mirrored?

A gentleman in the pystan issues forum helped me. My issue had nothing to do with stan, I was (stupidly) using a set for setting up the column names in the dataframe which does not maintain the order.

The correct code should have read:

X = pd.DataFrame(n0, columns=["f1","f2","f3","f4"]) #use array [], and not a set {}
X['y']=y
results = sm.ols(formula="y ~ f1+f2+f3+f4", data=X).fit()
results.summary()