Question on mutlivariate non-centered parametrizations

Hey folks !

I recently did some non-centered reparametrization on some of my models, and at least for my models who sample the parameters from independent normals this worked out very well (faster, no divergences etc.). However, I have now implemented the same non-centered reparametrization in the multivariate version of the model (which worked out ok but could be better asit has some divergences and convergence problems). The non-centered model works though, but not as well as expected. I think this is more about some errors I made as about the reparamtrization itself. So I have two questions on this:

1.) I used the reference manual to implement the cholesky non-centered, but I’m just not sure whether this works as intended, because I estimate the correlation matrix as well as sigma simultanioulsy. If I would assume uncorrelated parameters and just surpass an identity matrix as correlation matrix, would this be a valid method to simplyfy the model in this case? And: Is this non-centered implementation “correct” so far?

2.) How could I provide initial values in this case, that actually satisfy the sampler to be positiv ? As One Parameter is on log_scale, this should have boudaries from -Inf to Inf. Sampling all Parameters from the same std_normal is a problem here. Would it be “effiecient” or reasonable to outsource this parameter from the MVN and sample it independently?

I maybe I can find some ideas here, help and opinions are appreciated ! Thanks for reading and your Opinions !

*J

Here are model codes for the non-centered and centered model:

Non-Centered:

data {
  int <lower=0> N;  // number rownumber
  int <lower=0> K;  // categories 
  int <lower=0> J;  // Dims of Cov Matrix
  int <lower=0> Con;  // number of FT categories
  int R[K];         // number of responses per category
  int count[N*Con,K];   // observed data
  real scale_b; // set scaling for background noise
  vector[Con] Freetime;  // freetime conditions after distractor encoding (e.g. 500 ms, 1000 ms or 1500 ms) 
  int retrievals;
  matrix[N*Con,2] d_weight; // Weights for Response Category inbalances
}

parameters {
  
  // Defining vector for hyper and subject parameters 
  
  corr_matrix[J] Omega;
  vector<lower=0>[J] sigma;
  vector [J] hyper_pars;
  vector[J] theta[N];
  
  
}

transformed parameters{
  
  // Transform f Parameter
  vector[J] subj_pars[N];
  matrix[J, J] L;
  
  real f[N] = inv_logit(theta[,3]);
  real mu_f = inv_logit(hyper_pars[3]);
  
  L = cholesky_decompose(Omega);

  
  for (i in 1:N)
  {
    
    subj_pars[i,] = hyper_pars + sigma .* (L * theta[i,]);
    
  }

  // activations
  real acts_IIP[N*Con];
  real acts_IOP[N*Con];
  real acts_DIP[N*Con];
  real acts_DIOP[N*Con];
  real acts_NPL[N*Con];
   
  
  // probabilities
  vector[K] probs[N*Con];
  real SummedActs[N*Con];
  
  
  // loop over subjects and conditions to compute activations and probabilites
  
  
  for (i in 1:N){ // for each subject
  
  for(j in 1:Con) {
    
    acts_IIP[j + (i-1)*Con] = scale_b +  (1+ subj_pars[i,4]*Freetime[j])* (subj_pars[i,1] + subj_pars[i,2]); // Item in Position                      
    acts_IOP[j + (i-1)*Con] = scale_b +  (1+ subj_pars[i,4]*Freetime[j])* subj_pars[i,2];        // Item in Other Position
    acts_DIP[j + (i-1)*Con] = scale_b + f[i]*((exp(-subj_pars[i,5]*Freetime[j])*subj_pars[i,1]) + subj_pars[i,2]); // Distractor in Position
    acts_DIOP[j + (i-1)*Con] = scale_b + f[i]*subj_pars[i,2]; // Distractor in other Position
    acts_NPL[j + (i-1)*Con] = scale_b; // non presented Lure
    
    SummedActs[j + (i-1)*Con] = R[1] * acts_IIP[j + (i-1)*Con] + R[2] * acts_IOP[j + (i-1)*Con] + d_weight[j + (i-1)*Con,1] * acts_DIP[j + (i-1)*Con]+ // account for different numbers auf DIP / DIOP 
    d_weight[j + (i-1)*Con, 2] * acts_DIOP[j + (i-1)*Con]+ R[5] * acts_NPL[j + (i-1)*Con];
    
    probs[j + (i-1)*Con,1] = (R[1] * acts_IIP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);  
    probs[j + (i-1)*Con,2] = (R[2] * acts_IOP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
    probs[j + (i-1)*Con,3] = (d_weight[j + (i-1)*Con,1] * acts_DIP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
    probs[j + (i-1)*Con,4] = (d_weight[j + (i-1)*Con, 2] * acts_DIOP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
    probs[j + (i-1)*Con,5] = (R[5] * acts_NPL[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
  }
  }
}


model {
  
  // priors for hyper parameters
  
  hyper_pars[1] ~ normal(20,10); // c
  hyper_pars[2] ~ normal(2,10); // a
  hyper_pars[3] ~ normal(0,10); // f
  hyper_pars[4] ~ normal(1,10); // EE
  hyper_pars[5] ~ normal(1,10); // r
  

  // 
  Omega ~ lkj_corr(2);
  sigma ~ gamma(1,0.01);
  
  
  // Loop over subjects
  
  for (i in 1:N)
  {
    theta[i,] ~ normal(0,1);
  
  }

  
  
  for (j in 1:Con) {
    for (i in 1:N) {
      // draw data from probabilities determined by MMM parms
      count[j + (i-1)*Con,]  ~ multinomial(probs[j + (i-1)*Con,]);  
    }
  }
}

Centered:


data {
  int <lower=0> N;  // number rownumber
  int <lower=0> K;  // categories 
  int <lower=0> J;  // Dims of Cov Matrix
  int <lower=0> Con;  // number of categories
  int R[K];         // number of responses per category
  int count[N*Con,K];   // observed data
  real scale_b;     // set scaling for background noise
  vector[Con] Freetime ;        // freetime conditions after distractor encoding (e.g. 500 ms, 1000 ms or 1500 ms) 
}

parameters {
  
  // Defining vector for hyper and subject parameters 
  
  corr_matrix[J] Omega;
  vector<lower=0>[J] sigma;
  vector [J] hyper_pars;
  vector [J] subj_pars[N];
  
  
}


transformed parameters{
  // Transform f Parameter
  
  real f[N] = inv_logit(subj_pars[,3]);
  real mu_f = inv_logit(hyper_pars[3]);
  
  // activations
  real acts_IIP[N*Con];
  real acts_IOP[N*Con];
  real acts_DIP[N*Con];
  real acts_DIOP[N*Con];
  real acts_NPL[N*Con];
  
  
  // probabilities
  vector[K] probs[N*Con];
  real SummedActs[N*Con];
  
  
  // loop over subjects and conditions to compute activations and probabilites
  
  for (i in 1:N){ // for each subject
  
  for(j in 1:Con) {
    
    acts_IIP[j + (i-1)*Con] = scale_b + (1+ subj_pars[i,4]*Freetime[j])*( subj_pars[i,2] + subj_pars[i,1]); // Item in Position                      
    acts_IOP[j + (i-1)*Con] = scale_b + (1+ subj_pars[i,4]*Freetime[j])* subj_pars[i,2];        // Item in Other Position
    acts_DIP[j + (i-1)*Con] = scale_b + exp(-subj_pars[i,5]*Freetime[j])* subj_pars[i,3]*( subj_pars[i,2]+ subj_pars[i,1]);// Distractor in Position
    acts_DIOP[j + (i-1)*Con] = scale_b + exp(-subj_pars[i,5]*Freetime[j])* subj_pars[i,3]* subj_pars[i,2]; // Distractor in other Position
    acts_NPL[j + (i-1)*Con] = scale_b; // non presented Lure
    
    SummedActs[j + (i-1)*Con] = R[1] * acts_IIP[j + (i-1)*Con] + R[2] * acts_IOP[j + (i-1)*Con] + R[3] * acts_DIP[j + (i-1)*Con]+
    R[4] * acts_DIOP[j + (i-1)*Con]+ R[5] * acts_NPL[j + (i-1)*Con];
    
    probs[j + (i-1)*Con,1] = (R[1] * acts_IIP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);  
    probs[j + (i-1)*Con,2] = (R[2] * acts_IOP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
    probs[j + (i-1)*Con,3] = (R[3] * acts_DIP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
    probs[j + (i-1)*Con,4] = (R[4] * acts_DIOP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
    probs[j + (i-1)*Con,5] = (R[5] * acts_NPL[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
  }
  }
}


model {
  
  // priors for hyper parameters
  
  hyper_pars[1] ~ normal(20,10); // c
  hyper_pars[2] ~ normal(2,10); // a
  hyper_pars[3] ~ normal(0,10); //f
  hyper_pars[4] ~ normal(1,10); // EE
  hyper_pars[5] ~ normal(1,10); // r
  
  Omega ~ lkj_corr(2);
  sigma ~ gamma(1,0.01);
  
  
  // Loop over subjects
  subj_pars[,] ~ multi_normal(hyper_pars,quad_form_diag(Omega, sigma));
  
  
  for (j in 1:Con) {
    for (i in 1:N) {
      // draw data from probabilities determined by MMM parms
      count[j + (i-1)*Con,]  ~ multinomial(probs[j + (i-1)*Con,]);  
    }
  }
}

See SUG 1.13 (“Multivariate priors for hierarchical models”); Your implementation looks correct, but it will sample faster if you structure things so that the correlation-related paramter is of type cholesky_factor_corr. Below is a version with that plus derivation of the correlation matrix in the GQ; note that subj_pars is transposed relative to what you had, but I made the appropriate transpose of the indices where it’s used later:

functions{
	// flatten_lower_tri: function that returns the lower-tri of a matrix, flattened to a vector
	vector flatten_lower_tri(matrix mat) {
		int n_cols = cols(mat) ;
		int n_uniq = (n_cols * (n_cols - 1)) %/% 2;
		vector[n_uniq] out ;
		int i = 1;
		for(c in 1:(n_cols-1)){
			for(r in (c+1):n_cols){
				out[i] = mat[r,c];
				i += 1;
			}
		}
		return(out) ;
	}

}
...
parameters {
  
  // Defining vector for hyper and subject parameters 
  
  cholesky_factor_corr[J] L_Omega;
  vector<lower=0>[J] sigma;
  vector [J] hyper_pars;
  matrix[J,N] theta ;
  
  
}

transformed parameters{
  // non-centered multivariate
  matrix[J,N] subj_pars =  (
    diag_pre_multiply( sigma, L_Omega )
    * theta
    + rep_matrix(hyper_pars,N)
  ) ;

...  
for(j in 1:Con) {
    
    acts_IIP[j + (i-1)*Con] = scale_b +  (1+ subj_pars[4,i]*Freetime[j])* (subj_pars[1,i] + subj_pars[2,i]); // Item in Position                      
    acts_IOP[j + (i-1)*Con] = scale_b +  (1+ subj_pars[4,i]*Freetime[j])* subj_pars[2,i];        // Item in Other Position
    acts_DIP[j + (i-1)*Con] = scale_b + f[i]*((exp(-subj_pars[5,i]*Freetime[j])*subj_pars[1,i]) + subj_pars[2,i]); // Distractor in Position
    acts_DIOP[j + (i-1)*Con] = scale_b + f[i]*subj_pars[2,i]; // Distractor in other Position
    acts_NPL[j + (i-1)*Con] = scale_b; // non presented Lure
    
    SummedActs[j + (i-1)*Con] = R[1] * acts_IIP[j + (i-1)*Con] + R[2] * acts_IOP[j + (i-1)*Con] + d_weight[j + (i-1)*Con,1] * acts_DIP[j + (i-1)*Con]+ // account for different numbers auf DIP / DIOP 
    d_weight[j + (i-1)*Con, 2] * acts_DIOP[j + (i-1)*Con]+ R[5] * acts_NPL[j + (i-1)*Con];
    
    probs[j + (i-1)*Con,1] = (R[1] * acts_IIP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);  
    probs[j + (i-1)*Con,2] = (R[2] * acts_IOP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
    probs[j + (i-1)*Con,3] = (d_weight[j + (i-1)*Con,1] * acts_DIP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
    probs[j + (i-1)*Con,4] = (d_weight[j + (i-1)*Con, 2] * acts_DIOP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
    probs[j + (i-1)*Con,5] = (R[5] * acts_NPL[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
  }
...
}

...

generated quantities{
	vector[(J*(J-1))%/%2] cor_mat_lower_tri = flatten_lower_tri(multiply_lower_tri_self_transpose(L_Omega)) ;
}

But do you intentionally not use the 3rd parameter in subj_pars in the model? (In your version that’s subj_pars[,3] and in mine it’s subj_pars[3,])

The default behaviour of Stan’s sampler is to initialize with random values sampled as uniform from -2 to +2 (on the unconstrained scale). Do you encounter issues with the default initialization?

2 Likes

Hey Mike,

thanks for your very detailed and helpful answer ! I hope my answer here is no overkill !
The model runs better from what I see from the the gradient evaluation times and only a handful divergences! I still have the problem, that I have negative probabilities in the likelihood of the mutlinomial, which seems to be related to the transformation of the theta[3, ] (f) parameter:

Chain 1: Rejecting initial value: Chain 1: Error evaluating the log probability at the initial value. Chain 1: Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[1] = -4.51233, but should be greater than or equal to 0 (in 'string', line 124, column 6 to column 66)

I think this is about your question about the subj_par[3,i]. This parameter is strictly speaking a simplex between 0 and 1 and modulates the activations in this lines.

  acts_DIP[j + (i-1)*Con] = scale_b + f[i]*((exp(-subj_pars[5,i]*Freetime[j])*subj_pars[1,i]) + subj_pars[2,i]);

  acts_DIOP[j + (i-1)*Con] = scale_b + f[i]*subj_pars[2,i]; 

and these activations are subsequently normalized to probabilities:

 probs[j + (i-1)*Con,3] = (d_weight[j + (i-1)*Con,1] * acts_DIP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
 probs[j + (i-1)*Con,4] = (d_weight[j + (i-1)*Con, 2] * acts_DIOP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);

As this model is meant to describe individual differences, it’s convenient to use normal distributions for
the parameter values, even though it is only an approximation as the parameters are contrained to be positive. As the the f paramter is meant to be a in a range between [0;1] to modulate the beforementioned activations, it is is initially sampled from a normally distributed variable though, but subsequently transformed through a logistic transformation. That’s why I passed inits to the model to prevent the negative probabilites in a MVN, as it is not possible to constrain only a subset of parameters and f has to be un contrained. For some reason, f becomes a negative, regardless of the transformation, which I suspect causes the problem described above.

In a less complex multivariate model version I solved this like this:

Inits:

init_fun <- function() 
{
  
  list(subj_pars=cbind(runif(stan_full_noFT.dat$N,1,100), # c
                       runif(stan_full_noFT.dat$N,1,100), # a
                       runif(stan_full_noFT.dat$N,-10,100), # f
                       runif(stan_full_noFT.dat$N,1,100), # ee
                       runif(stan_full_noFT.dat$N,1,100))) #  r
}

// Transform f Parameter
  
real f[N] = inv_logit(subj_pars[,3]);
real mu_f = inv_logit(hyper_pars[3]);
.....

acts[i,3] = scale_b + f[i]* (subj_pars[i,1]+subj_pars[i,2]);// Distractor in Position
acts[i,4] = scale_b + f[i]* subj_pars[i,2]; // Distractor in other Position

model{
 
  // priors for hyper parameters
  
  Omega ~ lkj_corr(2);
  sigma ~ gamma(1,0.01);
  
  hyper_pars[1] ~ normal(20,10);
  hyper_pars[2] ~ normal(2,10);
  hyper_pars[3] ~ normal(0,10);

  // Sample Parameters
 
  subj_pars[,] ~ multi_normal(hyper_pars,quad_form_diag(Omega, sigma));
   
  // Loop over subjects
  for(i in 1:N){

    // draw data from probabilities determined by MMM parms
    count[i,]  ~ multinomial(probs[i,]);
  }
}

Now for the non-centered version I passed inits for theta:

init_fun_FT <- function() 
{
  
  list(theta=cbind(runif(stan_full_FT.dat$N,0,1), # c
                       runif(stan_full_FT.dat$N,0,1), # a
                       runif(stan_full_FT.dat$N,-0.5,0.5), # f
                       runif(stan_full_FT.dat$N,0,1), # ee
                       runif(stan_full_FT.dat$N,0,1))) #  r
}

So I now transform theta[3, ] (f) with the same logic as above:

Sample set of theta:

model {
  
  // priors for hyper parameters
  
   ....
  // Priors for variance 
  ...
  
  
  // Loop over subjects and sample theta[,i] for all subjects
  
  for (i in 1:N) 
  {
    
    theta[,i] ~ normal(0,1);
  }

Transform theta[3,], reparametrize subject parameters and subsequently calculate activations and probabilites.

 row_vector[N] f = inv_logit(theta[3,]);

  matrix[J,N] subj_pars =  (
    diag_pre_multiply( sigma, L_Omega )
    * theta
    + rep_matrix(hyper_pars,N)) ;


....
for(j in 1:Con) {
    

acts_DIP[j + (i-1)*Con] = scale_b + f[i]*((exp(-subj_pars[5,i]*Freetime[j])*subj_pars[1,i]) + subj_pars[2,i]); // Distractor in Position
acts_DIOP[j + (i-1)*Con] = scale_b + f[i]*subj_pars[2,i]; // Distractor in other Position

    
    probs[j + (i-1)*Con,3] = (d_weight[j + (i-1)*Con,1] * acts_DIP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
    probs[j + (i-1)*Con,4] = (d_weight[j + (i-1)*Con, 2] * acts_DIOP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
  
   }
  }

In this version, the model runs faster, but has problmes with the init values - sampling begins after some time, but its random whether all chains are initialized in the end (even thoug this happens most of the time) . It makes no difference, if I don’t pass a init function. Chains look good though!

Again, thank you for your time !

best wishes

Jan