I’m trying to fit a (relatively) simple hierarchical model and it runs very slowly if at all (using RStan). The set up: I have a number of pairs of portfolios. Each pair of portfolios has a bunch of returns (say r_1,t, r_2,t). The returns are correlated. These are stacked into a matrix and I have a second index variable.

For each portfolio I want to model the difference in risk adjusted returns (so mean (r_1)/ sd (r_1) - mean (r_2) / sd (r_2)) and this difference is then drawn from a parent distribution. This is my model code:

data {
int<lower=1> N; // Number of periods
int<lower=1> M; // Number of pairs of portfolios
int index [N * M];
vector [2] rets [N * M];
}
parameters {
real pop_mu; // Population mean
real<lower=0> pop_sigma; // Population sd
vector [M] mu1 ;
vector<lower=0> [2] sigma[M] ; // 2 groups
vector<lower=0, upper=1> [M] r ;
real diff [M];
}
transformed parameters {
cov_matrix [2] T [M];
vector [2] mu [M];
for (m in 1:M) {
real mu2; // Temporary variable
mu2 = (diff [m] + mu1 [m] / sigma [m, 1]) * sigma [m, 2];
mu [m, 1] = mu1 [m];
mu [m, 2] = mu2;
T [m, 1, 1] = square (sigma [m, 1]);
T [m, 2, 2] = square (sigma [m, 2]);
T [m, 1, 2] = r [m] * sigma [m, 1] * sigma [m, 2];
T [m, 2, 1] = T [m, 1, 2];
}
}
model {
diff ~ normal (pop_mu, pop_sigma);
pop_mu ~ normal (0.0, 1.0/12.0);
pop_sigma ~ gamma (6.0, 150.0);
r ~ uniform (0, 1);
mu1 ~ normal (0, 0.04);
for (m in 1:M)
sigma [m] ~ normal (0, 0.04);
for (m in 1:M)
for (n in 1:N)
rets [(m - 1) * N + n] ~ multi_normal (mu [index [(m - 1) * N + n]], T [index [(m - 1) * N + n]]);
}
generated quantities {
real diff_ppc [10]; // How many of these do I need?
for (n in 1:10)
diff_ppc [n] = normal_rng (pop_mu, pop_sigma);
}

This runs for a simply generated test set where (say) M is 10 but when I try to run with M = 48 (and N = 240) I get nothing really happening. So have I set the code up wrong? Is there a better way to do this?

I ran with some real data but only for 1 chain / 300 iterations / 150 draws. n_eff varies from 150 (yay) down to 33 for some of the sigma parameters. There don’t seem to be any divergences though.

Looking at the traceplots the pop_mu and pop_sigma seem to converge after around 50 draws.

The correlation parameters seem to be noisier - given I know they are going to be positive would a beta prior (as opposed to a uniform one) help?

Certainly could change things. Since you’re estimating correlations, have you checked out the LKJ priors in the reference manual?

In terms of just computing speed, you might try to rewrite:

for (m in 1:M)
for (n in 1:N)
rets [(m - 1) * N + n] ~ multi_normal (mu [index [(m - 1) * N + n]], T [index [(m - 1) * N + n]]);

as

for (m in 1:M)
rets[starts[m]:ends[m]] ~ multi_normal_cholesky (mu[m]], cholesky_of_T[m]);

where rets has been sorted so that all the m == 1, m == 2, etc. elements are grouped together, starts/ends are length M integer arrays of the start and finish indices of those groups, and cholesky_of_T is the Cholesky factor of your covariance (which since you’re constructing a 2x2 manually you can probably get without too much trouble).

But I’d focus on getting the parameterization right first.

Do you see any super correlated parameters in the posterior? This can suggest a reparameterization could help things. When you run with a bigger dataset do you get divergences or low Neffs?

Thank you for the code suggestion - on a smaller data set that halved the run time (after a brief bit of algebra to calculate the cholesky of a 2x2 covariance matrix!). No, there aren’t any divergences, and n_eff doesn’t look dreadful. There’s nothing “odd” about the data - I think it’s just an issue of computing time. Sticking it on a 20 core Linux box also helped (actually slower than my PC but I can run more chains so get more samples overall).

Yeah and a follow up to Andrew, if you ended up using the LKJ prior, there’s a version (lkj_corr_cholesky) that works directly on a Cholesky factor (so you don’t have to do the decomposition yourself and it’s faster).

I think this should be lower = -1, as it’s a correlation.

Even if you believe all the true values are positive, uncertainty can place mass at the boundary at 0, which will then bias the posterior means in a positive direction.

We generally recommend using our LKJ priors on correlation matrices and then scaling. You can do it all in matrix operations and then just push it straight into the multi_normal_cholesky parameterization.

I’m not sure what you’re suggesting here. Our LKJ priors only have support over correlation matrices, not full covariance matrices.