Reparameterization of correlation parameter for improved computation?

Sorry about this long post. Experience tell me it’s better to include too much information than too little.

Background
I am working on a model for analysis (and later prediction) of hunting harvest. The system has three geographical levels of interest: the nation, counties, and hunting management precincts. Each precinct has a number of reporting hunting teams, whose location are unknown beyond the level of precinct, that report their area and number of harvested animals per game species. I am primarily after a model that estimates the harvest rate per area unit, {\mu _{k,l}}, for every precinct k in county l. There are a few other bits and pieces to the framework, such as potentially a dependence of the area of hunting team and/or an association between average hunting rate and variability among hunting teams, but the most essential bit is a simple hierarchical structure with (random) effects at the county ({\lambda _l}) and precinct ({\chi _{k,l}}) level as

\log \left( {{\mu _{k,l}}} \right) = {\chi _{k,l}} + {\lambda _l} + \omega,

and the number of harvested individuals of team r in precinct k in county l can be modeled e.g. as

K_{r,k,l} \sim {\rm{Gamma - Poisson}}\left( {{\mu _{k,l}},\alpha } \right),
where \alpha is the shape parameter of the associate gamma distribution. For \omega, I’ve implemented a vague normal prior, and for {\chi _{k,l}} and {\lambda _l}, I’ve implemented
\eqalign{ & {\chi _{k,l}} \sim {\mathop{\rm Normal}\nolimits} \left( {0,\sigma } \right) \cr & {\lambda _l} \sim {\mathop{\rm Normal}\nolimits} \left( {0,\tau } \right) \cr} ,
where standard deviations \sigma and \tau are estimated in the analyses.

I am currently adding a temporal component to the framework. This is in part motivated by that, for some species with low and variable harvest, posteriors get very wide, with 95% credibility intervals of harvest rates often spanning several orders of magnitudes. Since it’s expected that hunting is at least somewhat similar across years, it makes sense to borrow some information across years. I am at this stage not after a model that e.g. investigates a (linear) trend in the hunting rate. In a first step towards a temporal model, I’ve assumed that (adding subscripts t for time steps in years) that

{\omega _t}\sim{\rm{Normal}}\left( {{\omega _{t - 1}},S} \right),

where S is the standard deviation of between year change in (geometric) mean hunting rate.

To ensure that county and precinct level effects are distributed around 0, I’ve taken a slightly different approach for {\chi _{k,l}} and {\lambda _l}, assuming that for t>1,
\eqalign{ & \left[ {{\lambda _{t - 1,l}},{\lambda _{t,l}}}\right]^\intercal \sim{\rm{MVN}}\left( {\left[ {0,0} \right]^\intercal ,{\Sigma _\lambda }} \right) \cr & \left[ {{\chi _{t - 1,k,l}},{\chi _{t,k,l}}} \right]^\intercal \sim{\rm{MVN}}\left( {\left[ {0,0} \right]^\intercal ,{\Sigma _\chi }} \right) \cr} ,
where
{\Sigma _\lambda } = \left[ {\matrix{ {{\tau ^2}} & {{\rho _\lambda }{\tau ^2}} \cr {{\rho _\lambda }{\tau ^2}} & {{\tau ^2}} \cr } } \right]

{\Sigma _\chi } = \left[ {\matrix{ {{\sigma ^2}} & {{\rho _\chi }{\sigma ^2}} \cr {{\rho _\chi }{\sigma ^2}} & {{\sigma ^2}} \cr } } \right],

i.e. (at least for now) assuming equal variance across years and any correlation over larger time scales is only implicitly modelled via correlation of intermediate time steps. If I’ve done my math right, the conditional distributions of {\chi _{k,l}} and {\lambda _l} are

\eqalign{ & {\lambda _{t,l}}\sim{\rm{Normal}}\left( {{\rho _\lambda }{\lambda _{t - 1,l}},\tau \sqrt {1 - \rho _\lambda ^2} } \right) \cr & {\chi _{t,k,l}}\sim{\rm{Normal}}\left( {{\rho _\chi }{\chi _{t - 1,k,l}},\sigma \sqrt {1 - \rho _\chi ^2} } \right) \cr} ,

which is the way I’ve implemented it in my code below. I’ve implemented reparameterization of S, \sigma and \tau to avoid funnel-like behavior (the “raw” parameters in the code below), which improves computation.

Mixing Problem fpr correlation parameetrs
I however have some issues with the correlation parameters, in particular {\rho _\chi }. Following suggestions, I’ve plotted parameters against energy to identify parameters that might benefit from some type of reparameterization, and while the trends vary somewhat across species (the analysis will eventually be performed for around 50 species), {\rho _\chi } almost consistently have a negative correlation with energy. Below is an example.

For {\rho _\lambda } , there is no apparent correlation with energy, but the trace plots are not looking amazing for either of the correlation parameters.

This analysis gave me an effective sample size of 100-200 for the correlation parameters, but there are other datasets that give me even worse. If I were only doing this once, I’d just increase the number of iterations and be done with it, but 5000 iterations takes around 24 hours to run (the data currently has around 70000 reports from 11 years, 21 counties and 3-400 circuits), and with 50 species it’s worth spending some time to make the code more efficient. I will also compare a few different models and likely repeat the analysis with additional data in the future

Finally, a question
Does anyone have any ideas for useful reparameterization of the correlation parameters? Also happy for other input along the lines of “why don’t you just use [insert simpler and/or more robust approach], you numbnut?” but I want to stay away from spline territory.

data {
int<lower=1> NrOfYears;
int<lower=1> NrOfLans;
int<lower=1> NrOfKretsar;
// int<lower=0> NrOfCovs;
int<lower=1> Reports;
int<lower=0> leftout;
int<lower=0> K[Reports];
real<lower=0> A[Reports];
real logrelA[Reports];
int<lower=1> KretsNrSeqID[Reports];
int<lower=1> LansSeqID[NrOfLans];
int<lower=1> LanSeqID[Reports];
int<lower=1> KretsarSeqID[NrOfKretsar];
int<lower=1> KretsListLanSeqID[NrOfKretsar];
int<lower=1> KretsYearSeq[NrOfKretsar];
int<lower=0, upper=1> includealpha;//Boolean declaration
// int<lower=0, upper=1> usecov;//Boolean declaration
int<lower=0, upper=1> includegamma;//
int<lower=0, upper=1> includephi;//
int<lower=0, upper=1> AnalyzeWAICandLOO;
int<lower=0, upper=1> doexactloo;
//  matrix[NrOfKretsar,NrOfCovs] COVMat;
int<lower=0, upper=1> FirstKretsAppearance[NrOfKretsar];
int<lower=0> FwdConnectionSeqID[NrOfKretsar,14];

}

parameters {

real lambda_raw[NrOfLans,NrOfYears];//County effect on average bagging rate
real chi_raw[NrOfKretsar];//for reparimeterization of chi
real omega1; //Nationwide effect on logmu year 1
real upsilon[includealpha ? 1 : 0];//Nationwide effect on alpha
real phi[includephi ? 1 : 0];//effect of hunting area on er area bagging rate
real<lower=0> sigma;//standard deviation of county effects on average bagging rate
real<lower=0> tao;//standard deviation of county level effects on average bagging rate
real gamma[includegamma ? 1 : 0];////effect of bagging intensity on alpha
//vector[NrOfCovs] beta;

real<lower=0> Somega;//Standard deviation of between year change of W

real omegachange_raw[NrOfYears-1]; // raw between year change in nationwide effect on logm

real<lower=-1,upper=1> rholambda; //correlation, county effects
real<lower=-1,upper=1> rhochi;//correlation, HMP effects

}

transformed parameters {
real omega[NrOfYears]; //Nationwide effect on average bagging rate
real<lower=0> alpha[includealpha ? NrOfKretsar : 0];// Shape parameter of the Gamma-Poisson
real logeffect[NrOfKretsar];
real<lower=0> mu[NrOfKretsar]; //Mean bagging rate in HMP
real chi[NrOfKretsar];  //County level effect on average bagging rate
real lambda[NrOfLans,NrOfYears];  //County level effect on average bagging rate
// vector[usecov ? NrOfKretsar : 0] B;
real omegachange[NrOfYears-1]; //between year change in nationwide effect on logm

omega[1]=omega1;
for (iy in 2:NrOfYears){
omegachange[iy-1]=omegachange_raw[iy-1] * Somega;//
omega[iy] =omega[iy-1]+omegachange[iy-1];
}

// if (usecov){
//
//   B=COVMat*beta;
//
// }

for (ik in 1:NrOfKretsar){
chi[ik]=chi_raw[ik] * sigma;// reparametrisation of chi for improved computation

}
for (il in 1:NrOfLans){
for (iy in 1:NrOfYears){

lambda[il,iy]=lambda_raw[il,iy] * tao;// reparametrisation of lambda for improved computation
}
}

for (ik in 1:NrOfKretsar){
logeffect[ik]=omega[KretsYearSeq[ik]] + lambda[KretsListLanSeqID[ik],KretsYearSeq[ik]] + chi[ik] ;

// if (usecov){
//
//   logeffect[ik]+=B[ik];
//
// }
mu[ik] = exp(logeffect[ik]);
if (includealpha ){
if (includegamma) {

alpha[ik] = exp(upsilon[1] + gamma[1]*(lambda[KretsListLanSeqID[ik],KretsYearSeq[ik]] + chi[ik]));

}else{
alpha[ik] = exp(upsilon[1]);

}
}

}

}
model{

for (r in 1:Reports){
if (r!=leftout){

// Number of killed animals in report r is negative binomial distributed,
// but defined from the mean (m[KretsNrSeqID[r]] * A[r]) and shape (c[Lan[r]]) of the gamma distribution of the equivalent gamma-poisson mixture
if (includealpha ){
if (includephi){

K[r] ~ neg_binomial_2(mu[KretsNrSeqID[r]] * A[r] * exp(logrelA[r] * phi[1]) ,alpha[KretsNrSeqID[r]]);

}else{
K[r] ~ neg_binomial_2(mu[KretsNrSeqID[r]] * A[r] ,alpha[KretsNrSeqID[r]]);
}

} else {
if (includephi){
K[r] ~ poisson(mu[KretsNrSeqID[r]] * A[r] * exp(logrelA[r] * phi[1]));
}else{
K[r] ~ poisson(mu[KretsNrSeqID[r]] * A[r]);

}

}
}

}

for(ik in 1:NrOfKretsar){

if (FirstKretsAppearance[ik]){
chi_raw[ik] ~ normal(0,1);//   reparametrisation for improved computation

}else{

chi_raw[ik] ~ normal(rhochi*chi_raw[FwdConnectionSeqID[ik,1]],sqrt((1-rhochi^2)));
}

}

for (il in 1:NrOfLans){

lambda_raw[il,1] ~ normal(0,1);
for (iy in 2:NrOfYears){
lambda_raw[il,iy] ~ normal(rholambda*lambda_raw[il,iy-1],sqrt((1-rholambda^2)));

}

}

if (includealpha){
upsilon[1] ~ normal(0,3.0);//based on 95% coef var between 0.1 and 10

}

if (includegamma){

gamma[1] ~ normal(0,1); //Based on being ~95% sure that shap parameter changes less than four times as high or a quarter as low alpha with twice the intensity. Same as CoV doubling or halfing.

}

if (includephi){
phi[1] ~ normal(0,0.5);// based on 95% between -0.585 and 0.585, which corresponds to maximum 300% increas in hunting rate with twice the area

}

omega1 ~ normal(-8.7,4.3);
for (iy in 2:NrOfYears){
omegachange_raw[iy-1]~normal(0,1);

}
Somega ~ cauchy(0,2.5);

tao ~ lognormal(0.20,2.3);

sigma ~ lognormal(-1.3,0.86);

}

generated quantities {
vector[doexactloo ? 1 : 0] log_lik_exact;
vector[AnalyzeWAICandLOO ? Reports : 0] log_lik; // can't be computed if doexactloo

if (doexactloo){
if (includealpha){
if (includephi){
log_lik_exact[1] = neg_binomial_2_lpmf(K[leftout] | mu[KretsNrSeqID[leftout]] * A[leftout] * exp(logrelA[leftout] * phi[1]), alpha[KretsNrSeqID[leftout]]);
}else{
log_lik_exact[1] = neg_binomial_2_lpmf(K[leftout] | mu[KretsNrSeqID[leftout]] * A[leftout], alpha[KretsNrSeqID[leftout]]);

}
} else {
if (includephi){

log_lik_exact[1] = poisson_lpmf(K[leftout] | mu[KretsNrSeqID[leftout]] * A[leftout] * exp(logrelA[leftout] * phi[1]));
}else{

log_lik_exact[1] = poisson_lpmf(K[leftout] | mu[KretsNrSeqID[leftout]] * A[leftout]);
}

}

}

if (AnalyzeWAICandLOO){
for (r in 1:Reports){

if (includealpha){
if (includephi){
log_lik[r] = neg_binomial_2_lpmf(K[r] | mu[KretsNrSeqID[r]] * A[r] * exp(logrelA[r] * phi[1]), alpha[KretsNrSeqID[r]]);
}else{
log_lik[r] = neg_binomial_2_lpmf(K[r] | mu[KretsNrSeqID[r]] * A[r], alpha[KretsNrSeqID[r]]);

}
} else {
if (includephi){

log_lik[r] = poisson_lpmf(K[r] | mu[KretsNrSeqID[r]] * A[r] * exp(logrelA[r] * phi[1]));
}else{

log_lik[r] = poisson_lpmf(K[r] | mu[KretsNrSeqID[r]] * A[r]);
}

}

}

}
}

Yeah these \sqrt{1 - \rho^2} terms can cause problems.

Can you change:

chi_raw[ik] ~ normal(rhochi*chi_raw[FwdConnectionSeqID[ik,1]],sqrt((1-rhochi^2)));


and:

chi[ik]=chi_raw[ik] * sigma;


to

chi_raw[ik] ~ normal(0, 1);


and

chi[ik] = rhochi*chi[FwdConnectionSeqID[ik,1]] + chi_raw[ik] * sqrt((1-rhochi^2)) * sigma;


and similarly for lambda? Is that possible here?

Wow, that appear to have made a massive difference. I will run longer chains to investigate long-term patterns, but here is a comparison of my previous code and the above suggestion by @bbbales2. Thought it might be useful if someone else stumblers on this looking for ways to improve similar code.

Traceplot of rhochi, my previous code

Traceplot of rhochi, bbbales2*s transformation

Energy vs rhochi, my previous code

Energy vs rhochi, bbbales2*s transformation

Effective sample size went from 59 to 375. In addition, computation time for the same number of iteration appear to have been reduced by around 50%, which was unexpected to me. Is this because there are less calculations to evaluate the gradients?

Thank you so much @bbbales2 for this!

1 Like

Good stuff!

Probably. The trick to Stan’s sampler (which evolved from NUTS) is it can adapt how far it goes each MCMC draw. It must be u-turning faster with the new parameterization and that’ll mean less gradients.

I’ve run some longer chains to make sure all model parameter have converged. I obtain an effective sample size up 20 times larger for some data sets while still substantially reducing computation time. So, yeah, good stuff indeed! :)

1 Like