Nice histograms vs better loo - what should be the priority?

I feel more confident that the priors are in the right place if the posterior histograms are roughly bell-shaped - or at least going to zero at the top and bottom ends of the posterior range. If some of them end in a cliff at one end or the other I assume that the priors should be expanded. But often expanding them like that ends up with a bit worse loo - maybe 10 or 20 points worse. (Of course sometimes they have to be like that for constraints, identifiability etc. but I don’t mean those cases.)

Intuitively I feel the better-shaped posteriors are worth the trade-off. Loo is an estimate after all and could be exaggerated in the chopped off case. But I have no real analytical reasoning for this position. Is there any? Is this a common thing to happen?

The models I am using here have identifiabiity issues - transforms of the parameters can get essentially the same model - but known solutions to that seem to end up eliminating some good models so I am trying to control that with priors. That could be driving what is happening here. But maybe it is fairly common - I would like to know.

What kind of priors and models are you using? Can you show the Stan code?

It’s about fitting affine processes for interest rates. We are using one CIR-type process and two Vasiceks, which are correlated. These 3 processes evolve slowly over time stochastically. The evolution distribution provides the prior for the t+1 fitted value from the t one. The fitted interest rate at each time for each maturity is a linear function of the 3 processes, where the coefficients C(tau) and D_j(tau) vary by tau but are constant over time. They are fairly complex functions of the risk-neutal (rn) parameters for the processes. The code starts by computing those C and Ds. Then the real-world parameters are computed by adding in risk transforms lambda and psi to the rn parameters, and the processes are estimated at each time point t. Then the linear functions are computed.

We are not Stan experts and the code is pretty primitive. I’m sure it could be improved. There are known rotations of the parameters that can produce the same fits. We try to control that by using a less general model and choice of priors - empirically avoiding what appear to be alternate modes. The two Vasicek processes can be switched and our only control for that is to force one of them to have a higher variance. Anyway, code for our most general model with closed-form C and D functions is below. More general models need solutions of ODEs for C and D. We have code for that too.

data {
int<lower=0> N; //# rows
int<lower=0> U; //# columns
matrix[N,U] y;
real ts[U];
real opy; // observations per year
}
transformed data {

int z[U*N,2]; //index for rows and columns of each observation
int o; //# observations in y
o = 0;
for (n in 1:N) {
for (t in 1:U) {
o = o+1;
z[o,1] = n;
z[o,2] = t;
}
}
}

parameters {
real<lower=0.005, upper=0.8> kaprn1;
real<lower=0.005, upper=0.08> kaprn2;
real<lower=0.02, upper=0.0525> kaprn3;
real<lower=0.5, upper=2.2> omrn1;
real<lower=-40, upper=-7> omrn2;
real<lower=5, upper=45> omrn3;
real<lower=-0.6, upper=0.6> lb;
real<lower=-0.4, upper=0.75> ls2;
real<lower=0.02, upper=0.33> ls3; //diff sig3 - sig2
real<lower=-0.98, upper=-0.65> corr;
real<lower=-6, upper=-1> log_sigma_y; //log stdev of residuals
real<lower=-1.75, upper=-0.7> lam1;
real<lower=11, upper=40> lam2;
real<lower=-60, upper=-20> lam3;
real<lower=-1.25, upper=0.3> psi2;
real<lower=-3.5, upper=0> psi3;
real<lower=-6, upper=-0.5> ldel; //log of delta_0
real gam1;
vector<lower=0, upper=20>[N] rC; //CIR short rate
matrix<lower=-25, upper=25>[N,2] rV; //Vasicek short rate for each observed time point}
}

transformed parameters {
real beta;
vector[3] kap;
vector[3] om;
vector[3] sig;
vector[3] kaprn; //rn for risk-neutral
vector[3] omrn;
vector[U] tau;
vector[3] lam;
matrix[3,U] C; //C(T)
matrix[3,U] D; //D(T)
vector[U] W;
real del;
real sigma_y;
real h;
vector[U] corradj;
vector[2] Cmv; //first CIR mean and variance
vector[N-1] Cmean;
vector[N-1] Cvar;
matrix[N-1,2] Vmean; // means after first one for Vasicek processes
vector[2] Vm; //first Vasicek mean
vector[2] Vvar; //first Vasicek variance
matrix[2,2] Sig; // covariance matrix for Vasicek
matrix[N,U] meen; // fitted value at each point

//Compute C and D functions

del = exp(ldel);
beta = exp(lb);
kaprn[1] = kaprn1;
omrn[1] = omrn1;
sig[1] = 1;

kaprn[2] = kaprn2;
omrn[2] = omrn2;
sig[2] = exp(ls2);

kaprn[3] = kaprn3;
omrn[3] = omrn3;
sig[3] = sig[2]+ls3;

for (t in 1:U) tau[t] = ts[t];

h = (kaprn[1]^2 + 2beta)^0.5;
//W = 1/Q for CIR
for (t in 1:U) {
W[t] = 2
h-(kaprn1+h)(1-exp(htau[t])); //tau[t] is tau
D[1,t] = -2*(1-exp(htau[t]))/W[t]/tau[t]+gam1; //CIR
D[2,t] = (1-exp(-kaprn2
tau[t]))/kaprn2/tau[t]; //Vasicek
D[3,t] = (1-exp(-kaprn3tau[t]))/kaprn3/tau[t]; //Vasicek
}
for (t in 1:U) {
C[1,t] = -(omrn1/beta/tau[t])
(2log(2h/W[t])+tau[t]kaprn1+tau[t]h) + del;
C[2,t] = sig[2]^2
(-2+2
D[2,t]+kaprn[2]tau[t]D[2,t]^2)/4/kaprn[2]^2+omrn[2]/kaprn[2](1-D[2,t]);
C[3,t] = sig[3]^2
(-2+2*D[3,t]+kaprn[3]tau[t]D[3,t]^2)/4/kaprn[3]^2+omrn[3]/kaprn[3](1-D[3,t]);
}
for (t in 1:U) corradj[t] = (D[2,t]+D[3,t]-1+(exp(-tau[t]
(kaprn[2]+kaprn[3]))-1)/(kaprn[2]+kaprn[3])/tau[t])corrsig[2]*sig[3]/kaprn[2]/kaprn[3];

// Compute short rate processes

lam[1] = lam1;
kap[1] = kaprn1 - beta * lam1;
om[1] = omrn1;

lam[2] = lam2;
om[2] = omrn2 + sig[2] * lam2;
kap[2] = kaprn[2]-sig[2]*psi2 ;

lam[3] = lam3;
om[3] = omrn3 + sig[3] * lam3;
kap[3] = kaprn[3]-sig[3]*psi3;

Cmv[1] = om[1]/kap[1];
Cmv[2] = om[1]/kap[1]^2beta/2; //variance om1/kap1^2beta/2

for (n in 1:N-1) {
Cmean[n] = rC[n]/exp(kap[1]/opy) + om[1](1-exp(-kap[1]/opy))/kap[1]; // mean for CIR process
Cvar[n] = beta/kap[1]
(1-exp(-kap[1]/opy))(rC[n]/exp(kap[1]/opy) + om[1](1-exp(-kap[1]/opy))/2)/kap[1];
} //gamma mean and variance for next CIR step, used in prior for next rC where step d = 1/opy

Vm[1] = om[2]/kap[2];
Vm[2] = om[3]/kap[3];
Vvar[1] = sig[2]^2/2/kap[2];
Vvar[2] = sig[3]^2/2/kap[3];

for (n in 1:N-1){ Vmean[n,1] = rV[n,1]+(om[2]-kap[2]*rV[n,1])/opy;
Vmean[n,2] = rV[n,2]+(om[3]-kap[3]rV[n,2])/opy;} //mean for rV[n+1,i]
for (i in 1:2) Sig[i,i] = sig[i+1]^2/opy;
Sig[1,2] = corr
sig[3]*sig[2]/opy;
Sig[2,1] = Sig[1,2];

// Finally moments of the fit
for (n in 1:N) {
for (u in 1:U) {
meen[n,u] = D[1,u]*rC[n]+C[1,u]+D[2,u]*rV[n,1]+C[2,u]+D[3,u]*rV[n,2]+C[3,u]+corradj[u];
}
}
sigma_y = exp(log_sigma_y); //makes prior unbiased

//Priors not specified are uniform over their defined ranges, like non-negative reals for sigma_y; transformed parameters are simulated but don’t have priors.
}

model {
gam1 ~ double_exponential(0,0.01);
rC[1] ~ gamma(2om[1]/beta,2kap[1]/beta); // long term distribution of short rate
for (n in 2:N) rC[n] ~ gamma(Cmean[n-1]^2/Cvar[n-1],Cmean[n-1]/Cvar[n-1]); //in Stan mean of gamma is alpha/beta, etc
rV[1,1] ~ normal(Vm[1],Vvar[1]^0.5);
rV[1,2] ~ normal(Vm[2],Vvar[2]^0.5);
for (n in 2:N) {rV[n,] ~ multi_normal(Vmean[n-1,],Sig);}
for (u in 1:U) {for (n in 1:N) y[n,u] ~ normal(meen[n,u],sigma_y);}
}

generated quantities {
vector[o] log_lik;
for (j in 1:o) log_lik[j] = normal_lpdf(y[z[j,1], z[j,2]] | meen[z[j,1], z[j,2]], sigma_y);
}

Don’t use uniform priors, or hard constraints more generally, unless the bounds represent true constraints (such as scale parameters being restricted to be positive, or correlations restricted to being between -1 and 1). See more on “Don’t use uniform priors” in Prior Choice Recommendations wiki

Now you have a huge number of lower,upper constraints with seemingly ad hoc bounds.

Use constraints only for true constraints and smooth priors for otherwise informing likely ranges for the parameters.

Those are in response to another problem we were having - it takes several days to do just 1000 iterations on a PC, plus there seem to be numerous bad nodes around. Pushing the priors to where the better parameters seem to be was our try at a work around on those problems. But if we get posteriors that look too narrow by the shape of their histograms, we expand the priors. Usually that will get better histograms but often with worse loo. That was the source of my original question - go for the better loo or the better-shaped histograms?

But seeing your answer also make me think we might be able to steer away from the bad nodes not by the hard ranges and uniform priors, but by lighter-tailed priors that allow those points but make them less likely.

Yes, this what we have often seen in other problems. With lighter-tailed priors that allow those points but make them less likely, it is still possible that likelihood and prior are in conflict. You will then not see cliff edges, but you should compare prior and posterior and check the sensitivity to prior changes. Last time Balancing loo performance and Prior sensitivity analysis the similar problem with prior vs loo, the reason was a typo. You should also check the loo diagnostics, as failing importance sampling can also sometimes describe this kind of behavior.

Is there a good site with details and examples of what to look for in the loo diagnostics?

I have some output back - I had to make some big guesses on some of the priors so I think it will improve - but some observations:

  • Not getting cliff edges
  • Ran in about 1/3 of the time
  • But loo is much worse: elpd was about 1570, now down to around 1370 (LL is positive because interest rates are in a very small range)
  • But before it seemed too high compared to competing models. Best other model is at about 1340. This model is supposed to be better but I was never comfortable with that much better. Any ideas about why it was? Maybe we were finding a mode that was not representative of the population? (Assuming we don’;t get back to the old range of loo when we tweak the priors. That seems unlikely.)

The loo package has this

please let me know, if you think there is something unclear or missing

Uniform constraints use logistic transformation which can warp the parameter space strange, especially when there is “cliff edge”, which can lead to slow sampling. Getting rid of unnecessary constraints makes the posterior easier to sample.

What do loo diagnostic tell for these two models?

Hi~ I meet some questions and the model is the same as this topic. If you have time, please look my question, very thanks~
https://discourse.mc-stan.org/t/latent-variables-divergent-transitions-and-prior-selection/24125?u=paparize