Remove/estimate correlation of parameters (or get to know if correlation is real)

For the R I used one of the institutional clusters. I installed it using

install.packages("rstan", dependencies = TRUE)
install.packages("hBayesDM", dependencies=TRUE)


(I followed the instructions from STAN), I remember having problems with it and we’re solving it for quite some time but unfortunately, I do not remember the details.

For Pystan, I’m using Ubuntu 18.04 LTS or Win10. I have anaconda on both together with python 3.7. In Windows I mostly followed this protocol https://pystan.readthedocs.io/en/latest/windows.html#windows :

conda update conda
conda install libpython m2w64-toolchain -c msys2
conda install numpy cython -c conda-forge
pip install pystan


On Ubuntu I also don’t remember, I guess only install pystan worked. But I never had any problem before this one on any of the devices with any code.
Is it working on yours?

I’m on Ubuntu 19.10 running R 3.6.1 and RStan 2.19.3, and the model compiles fine for me. I do get an error when I attempt to sample:

Error in new_CppObject_xp(fields$.module, fields$.pointer, ...) :
Exception: mismatch in number dimensions declared and found in context; processing stage=data initialization; variable name=Tsubj; dims declared=(1); dims found=()  (in 'model4bc36b91a558_temp' at line 4)

failed to create the sampler; sampling not done


which is due to you attempting to subset the data down to one participant, so when the Tsubj vector is created in R it actually gets turned into a single value while Stan is expecting a vector. I vaguely recall there was a way to handle this scenario but can’t remember precisely. Can you post a data set with multiple subjects?

here

@mike-lawrence, @ahartikainen, my apologies, I have not realised I have given pickled data and did not mention it. Also, @mike-lawrence, it has been prepared for pystan so it is in a format of python dictionary. I can resave it in another way if it would help you (as I said, I’m not familiar with R so I don’t know what would be the best).

So the code works as follows:

import pickle

data_path = '/home/user/LossAversion/'

with open(os.path.join(data_path, 'data_LA_exp1_STAN_correlation.txt'), "rb") as fp:   # Unpickling


Edit:
The data in not pickled way (but still python dict) would be:
csv – data_LA_exp1_STAN_correlation_unpickle.csv (72.3 KB)
txt – data_LA_exp1_STAN_correlation_unpickle.txt (101.5 KB)

Hey!

I don’t know if you saw my message. My question would be, why there are some gamble which are -1, since these would cause trouble for bernoulli. Also instead of Phi_approx, I’d just use inv_logit.

I think I’ve got soe working code, but it runs slowly, and depending on what your answer is wrt tothe -1 issue, it could likely be sped up significantly.

Cheers!
Max

Hi Max,

thanks for the message, I got it. So I double-checked the -1 in the data. When I prepare my data, I initialise it with -1 and then replace where I have valid trials. Of course, the max trials for a subject, Tsubj' is then smaller by the number of invalid trials. So they should not cause any problems since they’re not accounted for by STAN.

I also double checked that the prospect.stan file:

works. It compiles and runs without any problems.

Why would you use inv_logit instead of Phi_approx?

And thanks again for your reply and reaching out, I very much appreciate.
J

I have looked into the console what stan prints out (haven’t thought about it before), gives me series of failures to sample (which is also happening with prospect but with decreasing rate of rejection until it starts sampling) and followed/ended by this:

Rejecting initial value:
Log probability evaluates to log(0), i.e. negative infinity.
Stan can't start sampling from this initial value.

Initialization between (-2, 2) failed after 100 attempts.
Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
Rejecting initial value:
Log probability evaluates to log(0), i.e. negative infinity.
Stan can't start sampling from this initial value.


Ah, got that now, thanks!

I reckon its a bit more stable, as the approx Phi is just a scaled inverse logit (see here Normal_lcdf underflows much faster than pnorm(log=TRUE)).

Here’s my go at it:

This model works fine for me (tested with cmdstanr):

data {
int<lower=1> N;
int<lower=1> T;
int<lower=1, upper=T> Tsubj[N];
real<lower=0> gain[N, T];
real<lower=0> loss[N, T];
real cert[N, T];
int<lower=-1, upper=1> gamble[N, T];
}
parameters {
vector[3] mu_pr;
vector<lower=0>[3] sigma;
cholesky_factor_corr[3] cors;
vector[3] rho_lambda_tau_tilde[N];
}
transformed parameters {
vector<lower=0, upper=2>[N]  rho;
vector<lower=0, upper=5>[N]  lambda;
vector<lower=0, upper=30>[N] tau;
cholesky_factor_cov[3] L_Sigma = diag_pre_multiply(sigma, cors);

for (i in 1:N) {
vector[3] logit_rho_lambda_tau_pr = inv_logit(mu_pr + L_Sigma*rho_lambda_tau_tilde[i]);
rho[i]    = logit_rho_lambda_tau_pr[1] * 2;
lambda[i] = logit_rho_lambda_tau_pr[2] * 5;
tau[i]    = logit_rho_lambda_tau_pr[3] * 30;
}
}
model {
mu_pr ~ std_normal();
sigma ~ normal(0, 0.2);
cors ~ lkj_corr_cholesky(3);

for( i in 1:N)
rho_lambda_tau_tilde[i] ~ std_normal();

for (i in 1:N) {
for (t in 1:Tsubj[i]) {
real evSafe;    // evSafe, evGamble, pGamble can be a scalar to save memory and increase speed.
real evGamble;  // they are left as arrays as an example for RL models.

// loss[i, t]=absolute amount of loss (pre-converted in R)
evSafe   = pow(cert[i, t], rho[i]);
evGamble = 0.5 * (pow(gain[i, t], rho[i]) - lambda[i] * pow(loss[i, t], rho[i]));
gamble[i, t] ~ bernoulli_logit(tau[i] * (evGamble - evSafe));
}
}
}



This version is a bit faster (better “vectorization” of the bernoulli_logit):

data {
int<lower=1> N;
int<lower=1> T;
int<lower=1, upper=T> Tsubj[N];
real<lower=0> gain[N, T];
real<lower=0> loss[N, T];
real cert[N, T];
int<lower=-1, upper=1> gamble[N, T];
}
parameters {
vector[3] mu_pr;
vector<lower=0>[3] sigma;
cholesky_factor_corr[3] cors;
vector[3] rho_lambda_tau_tilde[N];
}
transformed parameters {
vector<lower=0, upper=2>[N]  rho;
vector<lower=0, upper=5>[N]  lambda;
vector<lower=0, upper=30>[N] tau;
cholesky_factor_cov[3] L_Sigma = diag_pre_multiply(sigma, cors);

for (i in 1:N) {
vector[3] logit_rho_lambda_tau_pr = inv_logit(mu_pr + L_Sigma*rho_lambda_tau_tilde[i]);
rho[i]    = logit_rho_lambda_tau_pr[1] * 2;
lambda[i] = logit_rho_lambda_tau_pr[2] * 5;
tau[i]    = logit_rho_lambda_tau_pr[3] * 30;
}

}
model {

mu_pr ~ std_normal();
sigma ~ normal(0, 0.2);
cors ~ lkj_corr_cholesky(3);

for (i in 1:N){

vector[Tsubj[i]] evSafe;
vector[Tsubj[i]] evGamble;

for (t in 1:Tsubj[i]){
evSafe[t] = pow(cert[i, t], rho[i]);
evGamble[t] = 0.5 * (pow(gain[i, t], rho[i]) - lambda[i] * pow(loss[i, t], rho[i]));
}

head(gamble[i], Tsubj[i]) ~ bernoulli_logit(tau[i] * (evGamble - evSafe));
rho_lambda_tau_tilde[i] ~ std_normal();

}

}


You might have to fix the generated quantities block, which I have ignored here.

Results

3 chains, 600 warm-up each, 600 post-warmup iterations each

   variable    mean median     sd    mad      q5     q95   rhat ess_bulk ess_tail
<chr>      <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>    <dbl>    <dbl>
1 mu_pr[1]  -0.150 -0.158 0.113  0.109  -0.328   0.0412  1.00      690.     904.
2 mu_pr[2]  -1.66  -1.65  0.0716 0.0698 -1.77   -1.54    0.999     991.    1079.
3 mu_pr[3]  -3.60  -3.60  0.237  0.243  -4.00   -3.22    1.00      846.    1057.
4 sigma[1]   0.444  0.441 0.0581 0.0592  0.353   0.545   1.00     1236.    1397.
5 sigma[2]   0.221  0.218 0.0542 0.0518  0.136   0.311   1.00      906.     702.
6 sigma[3]   0.902  0.901 0.0977 0.0996  0.742   1.06    1.01     1035.    1318.
7 cors[1,1]  1      1     0      0       1       1      NA          NA       NA
8 cors[2,1]  0.283  0.298 0.177  0.177  -0.0285  0.550   1.00     1317.    1249.
9 cors[3,1] -0.703 -0.714 0.0943 0.0903 -0.834  -0.537   1.00     1177.    1317.
10 cors[1,2]  0      0     0      0       0       0      NA          NA       NA


Cheers!
Max

1 Like

Thank you very much Max! The code works.
I have added the generated parameters, although I’m not sure about a few steps:

1. what is the difference between the quantities generated now with your code and quantities which I’d generated in the last block using:
  Omega_corr = multiply_lower_tri_self_transpose(cors);
// corr_matrix[3] Omega = cors * cors';


? Or in general between what is generated in this block and sampled, eg. mu_pr and corresponding generated quantities or the covariance/correlation matrices?

1. Is there any difference between:
  Omega_corr = multiply_lower_tri_self_transpose(cors);


Again, thank you very much!
and

  corr_matrix[3] Omega = cors * cors';


?

1. Would you/Should I use the inv_logit instead of Phi_approx also in the generated quantities block?

Bellow my stan code (working):

Stan code
data {
int<lower=1> N;            // num of subjects
int<lower=1> T;            // num of trials per subject
int<lower=1, upper=T> Tsubj[N];  // maximum possible num of trials
real<lower=0> gain[N, T];
real<lower=0> loss[N, T];  // absolute loss amount
real cert[N, T];
int<lower=-1, upper=1> gamble[N, T];
}
parameters {
vector[3] mu_pr;
vector<lower=0>[3] sigma;
cholesky_factor_corr[3] cors;
vector[3] rho_lambda_tau_tilde[N];
}
transformed parameters {
vector<lower=0, upper=2>[N]  rho;
vector<lower=0, upper=5>[N]  lambda;
vector<lower=0, upper=30>[N] tau;
cholesky_factor_cov[3] L_Sigma = diag_pre_multiply(sigma, cors);

for (i in 1:N) {
vector[3] logit_rho_lambda_tau_pr = inv_logit(mu_pr + L_Sigma*rho_lambda_tau_tilde[i]);
rho[i]    = logit_rho_lambda_tau_pr[1] * 2;
lambda[i] = logit_rho_lambda_tau_pr[2] * 5;
tau[i]    = logit_rho_lambda_tau_pr[3] * 30;
}
}
model {
mu_pr ~ std_normal();
sigma ~ normal(0, 0.2);
cors ~ lkj_corr_cholesky(3);

for( i in 1:N)
rho_lambda_tau_tilde[i] ~ std_normal();

for (i in 1:N) {
for (t in 1:Tsubj[i]) {
real evSafe;    // evSafe, evGamble, pGamble can be a scalar to save memory and increase speed.
real evGamble;  // they are left as arrays as an example for RL models.

// loss[i, t]=absolute amount of loss (pre-converted in R)
evSafe   = pow(cert[i, t], rho[i]);
evGamble = 0.5 * (pow(gain[i, t], rho[i]) - lambda[i] * pow(loss[i, t], rho[i]));
gamble[i, t] ~ bernoulli_logit(tau[i] * (evGamble - evSafe));
}
}
}

generated quantities {
real<lower=0, upper=2>  mu_rho;
real<lower=0, upper=5>  mu_lambda;
real<lower=0, upper=30> mu_tau;

matrix[3,3] Omega_corr;
matrix[3,3] Sigma;

real log_lik[N];

// For posterior predictive check
real y_pred[N, T];

// Set all posterior predictions to 0 (avoids NULL values)
for (i in 1:N) {
for (t in 1:T) {
y_pred[i, t] = -1;
}
}

mu_rho    = Phi_approx(mu_pr[1]) * 2;
mu_lambda = Phi_approx(mu_pr[2]) * 5;
mu_tau    = Phi_approx(mu_pr[3]) * 30;
//vector[3] logit_rho_lambda_tau_pr = inv_logit(mu_pr + L_Sigma*rho_lambda_tau_tilde[i]);
//rho[i]    = logit_rho_lambda_tau_pr[1] * 2;
//lambda[i] = logit_rho_lambda_tau_pr[2] * 5;
//tau[i]    = logit_rho_lambda_tau_pr[3] * 30;

Omega_corr = multiply_lower_tri_self_transpose(cors);
// corr_matrix[3] Omega_corr = cors * cors';

{ // local section, this saves time and space
for (i in 1:N) {
log_lik[i] = 0;
for (t in 1:Tsubj[i]) {
real evSafe;    // evSafe, evGamble, pGamble can be a scalar to save memory and increase speed.
real evGamble;  // they are left as arrays as an example for RL models.
real pGamble;

// loss[i, t]=absolute amount of loss (pre-converted in R)
evSafe   = pow(cert[i, t], rho[i]);
//evGamble = 0.5 * (pow(gain[i, t], rho[i]) - lambda[i] * pow(loss[i, t], rho[i]));
evGamble = 0.5 * (pow(gain[i, t], rho[i]) - lambda[i] * pow(fabs(loss[i, t]), rho[i]));
pGamble    = inv_logit(tau[i] * (evGamble - evSafe));
log_lik[i] += bernoulli_lpmf(gamble[i, t] | pGamble);

// generate posterior prediction for current trial
y_pred[i, t] = bernoulli_rng(pGamble);
}
}
}
}


A few clarifications:
0) why rho[i] = logit_rho_lambda_tau_pr[1] * 2; and not rho[i] = logit_rho_lambda_tau_pr[i][1] * 2;?
Is it because the vector[3] logit_rho_lambda_tau_pr = inv_logit(mu_pr + L_Sigma*rho_lambda_tau_tilde[i]); is actually a matrix?

1. the tilda multivariate parameter should be the one sampled and without any correlation. That works but why is that? Could you please explain a bit more? There is a bit weird relation between rho_lambda_tau_tilde[1] and rho_lambda_tau_tilde[2], looks like a parabola.

2. the \tau, \lambda, \rho parameters will be correlated but we know their correlation (Omega) or covariance (Sigma) since we have also estimated it, those are the cors and L_sigma respectively. Meaning if I then test a correlation between eg. rho[1], lambda[1], tau[1], I should find the same correlation as from cors, right? Or does this apply only to the mu_pr[1-3] distributions?

What would the correlation be for the means, ie if I extract means for each parameter (mean of rho[1], rho[2],...lambda[51]) and plot a correlation between them?

Because weirdly enough to me, if I do it, there is some exponential correlation between rho and tau and some weird (looks like parabolic) correlation between rho and lambda.

Cool! :)

In the model code it just builds the Cholesky facor of the covariance matrix. In the generated quantities it build the (proper) covariance matrix.

I don’t understand your question here, sorry.

No, not really. The first function is a specialized version and is more efficient for lower triangular matrices, which cors is. It’s a tiny matrix, so the difference is probably not noticeable…

Yes.

In the transformed parameters logit_rho_lambda_tau_pr is a local variable, a (column) vector of length 3. It is newly created for every i in N. Therefore, we don’t need the indexing by i.

Hm, not sure if I understood correctly. rho_lambda_tau_tilde is an array of size N (for every subject) of vectors of length 3 (for \tilde\rho, \tilde\lambda, \tilde\tau). All of these are uncorrelated standard normal random variables. Their correlation is then induced by the Cholesky factor of the covariance matrix, L_Sigma. This is the so called Matt trick, or more commonly now non-centered parameterization of the multivariate normal distribution. In the univariate case z = \mu_z + \sigma_z \tilde z, \tilde z \sim N(0,1) \rightarrow z \sim N(\mu_z, \sigma_z), and the multivariate case replaces the \sigma with a Cholesky factor of the covariance matrix and everything else with vectors.

I’m not sure what you’re indexing here and I don’t think looking at the tilde variables is that insightful here.

Well, technically the model captures the (linear) correlations of the \rho, \lambda, \tau on the logit scale. The multivariate prior on these parameters is effectively a (scaled) multivariate Logit-Normal distribution.The induced correlation of the actual (loosely speaking) parameters will surely be non-linear. So this:

… makes sense to me. I thought from the model code that you started off with, that this is what you were aiming at?

The good thing about modeling the correlations on the logit scale is, that the parameters are unconstrained in that space. We model them (hierarchically) there and then transform them to the constrained space.Thus, we gain flexibility at the cost of loosing the interpretability of the correlations as parameters of linear dependence.

Hope this clears thing up? :)

Cheers!
Max

2 Likes

Thank you very much for your detailed answer, I very much appreciate it, you answered most of my queries :). The few left are bellow:

The question was what is the difference between the mu_pr which I get from stan automatically and the mu_tau, mu_lambda, mu_rho which I get from the generated quantities block. Normally, this should do the reverse Matt trick – going back from centred to non-centred distribution if I understood correctly. But I got a bit lost with the multivariate distribution. (You know – I tried to write down the code and it did not work so I went one step back and realised that I’m not sure what’s going on :).)

Thank you :).

And thank you for this, yes, this is what I was asking about. I did not realise where the Cholesky factor jumps into play.

Sorry, I should have been clearer. By rho_lambda_tau_tilde[1] I mean that I take and plot the distribution of means for all subjects for the first dimension=parameter, ie. \tilde\rho. Similarly, rho_lambda_tau_tilde[2] corresponds to a distribution of N(number of subjects) means for \tilde\lambda and rho_lambda_tau_tilde[3] for \tilde\tau.

Yes, I aimed to estimate the correlation which exists between the parameters and which is imposed by the model/sampling and then interpret the result, to figure out what to do with it. The motivation is that I think that if I have a model in which two parameters are correlated, then I could be able to rewrite the model in a way that simplifies the model, eg. remove one of the parameters, because I should be able to recover the information given by the correlated parameter without this parameter. Correct me if I’m wrong.
So having established that there is a correlation and how the correlation looks like (using Omega_corr), the question would be if I can simplify/rewrite my model by removing some parameters. Or how do I actually use the knowledge of the parameters’ correlation.

Thanks again :).
Cheers,
Jan

Hey!

Yes these should be the same. They’re are merely different ways to arrive at the same result.

Yeah, got it – I think I glossed over the plot too quickly the first time. Hm, my answer still stands I guess. The relationship between the parameters are (modeled) non-linearly, so the plot does not surprise me that much.

Not sure if that’s valid reasoning, but I might be missing something. You could just fit the two parameter version an then compare posterior predictive distributions of the two models and see what the simpler model is missing, or where they differ. If I understood you correctly, then you want to recover one parameter from the other two in the simpler model? This strikes me as a bit unusual, but, again, I could be missing something. Usually, if parameters are correlated, you’d want to estimate them jointly – and if they are not correlated you might get away with estimating them marginally (Think of a hurdle model where you could estimate the two parts separately when you assume they’re not correlated.)

Why do you want a simpler model? To reduce fitting time? For interpretability? The former could be dealt with by re-working the model. The latter maybe by clever presentation of the posterior predictive draws. (not saying there could be other good reasons to fit the simpler model.) ;)

Cheers!

Hi Max,

thanks again :).

What do you mean by this? WAIC/LOOIC or something else?

For the second reason. Usually, I believe the best way to build up a model is based on some theory/expectations and with minimum possible parameters. Both because of its interpretability, clarity and simplicity. Having two parameters where one can be expressed using the other is a thing I want to avoid and correlation is hinting that this could be the case (not always true). And what do you mean by the

? :)

Thanks again :).

For example, yes. Or simple visual diagnostics even (looking at predicted probabilities etc.).

Well, actually the same as above: Finding interesting configurations of gain, loss, and cert and plotting what the models have to say about these (three parameter model vs. two parameter model).

Sorry, I’m not able to give any clear cut answers here. I think this goes into subject matter and what points you want to convey and which results are interesting (to you or your audience).

Cheers! :)

Thank you very much for the help and for you opinion :).

1 Like