My first (hierarchical) model, and it's not working


I’m trying to fit a (relatively) simple hierarchical model and it runs very slowly if at all (using RStan). The set up: I have a number of pairs of portfolios. Each pair of portfolios has a bunch of returns (say r_1,t, r_2,t). The returns are correlated. These are stacked into a matrix and I have a second index variable.

For each portfolio I want to model the difference in risk adjusted returns (so mean (r_1)/ sd (r_1) - mean (r_2) / sd (r_2)) and this difference is then drawn from a parent distribution. This is my model code:

data {
  int<lower=1>   N;  // Number of periods
  int<lower=1>   M;  // Number of pairs of portfolios  

  int            index [N * M];
  vector [2]     rets [N * M];

parameters {
    real                          pop_mu;      // Population mean
    real<lower=0>                 pop_sigma;   // Population sd
    vector [M]                    mu1 ; 
    vector<lower=0> [2]           sigma[M] ;   // 2 groups
    vector<lower=0, upper=1> [M]  r ;
    real                          diff [M]; 

transformed parameters {
  cov_matrix [2]    T [M];
  vector [2]        mu [M];

  for (m in 1:M) {
    real mu2;       // Temporary variable 
    mu2 = (diff [m] + mu1 [m] /  sigma [m, 1]) *  sigma [m, 2];  

    mu [m, 1] = mu1 [m];
    mu [m, 2] = mu2; 

    T [m, 1, 1] = square (sigma [m, 1]);
    T [m, 2, 2] = square (sigma [m, 2]);
    T [m, 1, 2] = r [m] * sigma [m, 1] * sigma [m, 2];
    T [m, 2, 1] = T [m, 1, 2];

model {
  diff ~ normal (pop_mu, pop_sigma); 

  pop_mu ~ normal (0.0, 1.0/12.0);
  pop_sigma ~ gamma (6.0, 150.0);
  r ~ uniform (0, 1); 
  mu1 ~ normal (0, 0.04);

  for (m in 1:M)
    sigma [m] ~ normal (0, 0.04);

  for (m in 1:M)
    for (n in 1:N)
      rets [(m - 1) * N + n] ~ multi_normal (mu [index [(m - 1) * N + n]], T [index [(m - 1) * N + n]]); 

generated quantities {
  real  diff_ppc [10];    // How many of these do I need?  

  for (n in 1:10) 
    diff_ppc [n] = normal_rng (pop_mu, pop_sigma);

This runs for a simply generated test set where (say) M is 10 but when I try to run with M = 48 (and N = 240) I get nothing really happening. So have I set the code up wrong? Is there a better way to do this?



Do you recover your original parameters? Are the Neffs high with no divergences?


I ran with some real data but only for 1 chain / 300 iterations / 150 draws. n_eff varies from 150 (yay) down to 33 for some of the sigma parameters. There don’t seem to be any divergences though.

Looking at the traceplots the pop_mu and pop_sigma seem to converge after around 50 draws.

The correlation parameters seem to be noisier - given I know they are going to be positive would a beta prior (as opposed to a uniform one) help?


Certainly could change things. Since you’re estimating correlations, have you checked out the LKJ priors in the reference manual?

In terms of just computing speed, you might try to rewrite:

for (m in 1:M)
    for (n in 1:N)
      rets [(m - 1) * N + n] ~ multi_normal (mu [index [(m - 1) * N + n]], T [index [(m - 1) * N + n]]); 


for (m in 1:M)
  rets[starts[m]:ends[m]] ~ multi_normal_cholesky (mu[m]], cholesky_of_T[m]); 

where rets has been sorted so that all the m == 1, m == 2, etc. elements are grouped together, starts/ends are length M integer arrays of the start and finish indices of those groups, and cholesky_of_T is the Cholesky factor of your covariance (which since you’re constructing a 2x2 manually you can probably get without too much trouble).

But I’d focus on getting the parameterization right first.

Do you see any super correlated parameters in the posterior? This can suggest a reparameterization could help things. When you run with a bigger dataset do you get divergences or low Neffs?


Thank you for the code suggestion - on a smaller data set that halved the run time (after a brief bit of algebra to calculate the cholesky of a 2x2 covariance matrix!). No, there aren’t any divergences, and n_eff doesn’t look dreadful. There’s nothing “odd” about the data - I think it’s just an issue of computing time. Sticking it on a 20 core Linux box also helped (actually slower than my PC but I can run more chains so get more samples overall).

Thanks for all the suggestions - all helped.

Maybe would be cleaner to just specify the covariance matrix with lkj prior and not manually break it up into correlation and variances.

Yeah and a follow up to Andrew, if you ended up using the LKJ prior, there’s a version (lkj_corr_cholesky) that works directly on a Cholesky factor (so you don’t have to do the decomposition yourself and it’s faster).

Sorry for slow reply. That helped too - thank you very much to everyone for being helpful.

I think this should be lower = -1, as it’s a correlation.

Even if you believe all the true values are positive, uncertainty can place mass at the boundary at 0, which will then bias the posterior means in a positive direction.

We generally recommend using our LKJ priors on correlation matrices and then scaling. You can do it all in matrix operations and then just push it straight into the multi_normal_cholesky parameterization.

I’m not sure what you’re suggesting here. Our LKJ priors only have support over correlation matrices, not full covariance matrices.

Sorry, I was being sloppy: when talking about LKJ covariance matrix, I was taking about the decomposition that Bob describes.

Been on holiday – a belated thank you to Bob and Andrew. I’ll try running the different variations and see how they compare.

disclaim.txt (1.16 KB)