Huge RHATs, Extreme Model Performance Variability Depending on Priors

Hi all,
I’ve used Stan for a while, but this is my first time posting here. The issues I’m currently experiencing I’ve never seen before. The following program run with the HMC sampling algorithm, adapt_delta=0.85, and 4000 iterates (2000 warmups) leads to a point mass for parameters (specifically \rho at 1, always) and different values of the log-posterior depending on starting values. The model generates 0 divergent transitions, but Rhats are on the order of 100 trillion for most parameters and 100 million for others.

data {

int<lower=1> T1;           // Total number of periods.
int<lower=1> T0;           // Number of periods for differenced values.
vector[T1] wtildeg;        // Wage-weighted prices (goods)
vector[T1] wtildes;        // Wage-weighted prices (services)
vector[T0] ptilde;         // Log price differences
vector[T0] qtilde;         // Log-consumption ratios differences
vector[2] lwtilde[T0];     // log(w / p_j) for each j in {g,s}
vector[2] KL[T0];          // log(K/L) for each firm

// Fixed Parameters
real nug_mean;
real nus_mean;
real omegag_beta;
real omegas_beta;
real alpha_alpha;
real alpha_betag;
real alpha_betas;
vector[3] zero_vec;

}

// PARAMETER DECLARATIONS
parameters {

real<lower=0,upper=1> rho;              // elasticity of substitution for c's
real<lower=0> nnug;
real<lower=0> nnus;
real<lower=0,upper=1> omegag;
real<lower=0,upper=1> omegas;
vector<lower=0,upper=1>[2] alpha;             // elasticities of substitution between K_f and L_f in production
cholesky_factor_corr[3] L_Omega_cov;          // cholesky for cov's of L_Omega
vector<lower=0>[3] L_sigma_omega;             // s.d.'s for differences shocks

}

// TRANSFORMATION BLOCK
transformed parameters{

matrix[3,3] L_Omega;                    // Combined variance/covariance matrix for zeta's
vector[3] expect[T0];                   // This term is the centering vector for the likelihood.
real<upper=0> nug;
real<upper=0> nus;

nug = - nnug;
nus = - nnus;

// Variance/covariance matrices
L_Omega = diag_pre_multiply(L_sigma_omega,L_Omega_cov);

for(t in 1:T0){
  expect[t][1] = (1-rho)/rho * qtilde[t] - (rho-nug)/rho/nug
  * log(omegag + (1-omegag) * (wtildeg[t+1] * omegag / (1-omegag) )^(nug/(nug-1)))
  + (rho-nus)/rho/nus
  * log(omegas + (1-omegas) * (wtildes[t+1] * omegas / (1-omegas) )^(nus/(nus-1)))
  + (rho-nug)/rho/nug
  * log(omegag + (1-omegag) * (wtildeg[t] * omegag / (1-omegag) )^(nug/(nug-1)))
  - (rho-nus)/rho/nus
  * log(omegas + (1-omegas) * (wtildes[t] * omegas / (1-omegas) )^(nus/(nus-1)))
  + (1.0/rho) * ptilde[t];
  for(j in 1:2){
    expect[t][j+1] = lwtilde[t][j] - alpha[j] * KL[t][j];
  }
}

}

// MODEL BLOCK
model {

// Structural Parameter Priors
rho ~ beta(1.0,1.0);
omegag ~ beta(1.0,omegag_beta);
omegas ~ beta(1.0,omegas_beta);
nnug ~ lognormal(nug_mean,1.0);
nnus ~ lognormal(nus_mean,1.0);
alpha[1] ~ beta(alpha_alpha,alpha_betag);
alpha[2] ~ beta(alpha_alpha,alpha_betas);

// Variance/covariance draws
L_Omega_cov ~ lkj_corr_cholesky(2.0);
L_sigma_omega ~ cauchy(0.0,2.0);

// Likelihood Error
for(t in 1:T0){
  expect[t] ~ multi_normal_cholesky(zero_vec,L_Omega);
}

}

By theory, \rho \in (0,1) or \rho < 0. When I run a separate model with \rho < 0, everything converges properly with Rhats near 1 and high effective sample size:

// DATA BLOCK
data {

int<lower=1> T1;           // Total number of periods.
int<lower=1> T0;           // Number of periods for differenced values.
vector[T1] wtildeg;        // Wage-weighted prices (goods)
vector[T1] wtildes;        // Wage-weighted prices (services)
vector[T0] ptilde;         // Log price differences
vector[T0] qtilde;         // Log-consumption ratios differences
vector[2] lwtilde[T0];     // log(w / p_j) for each j in {g,s}
vector[2] KL[T0];          // log(K/L) for each firm

// Fixed Parameters
real nug_mean;
real nus_mean;
real omegag_beta;
real omegas_beta;
real alpha_alpha;
real alpha_betag;
real alpha_betas;
vector[3] zero_vec;

}

// PARAMETER DECLARATIONS
parameters {

real<lower=0> nrho;              // elasticity of substitution for c's
real<lower=0> nnug;
real<lower=0> nnus;
real<lower=0,upper=1> omegag;
real<lower=0,upper=1> omegas;
vector<lower=0,upper=1>[2] alpha;             // elasticities of substitution between K_f and L_f in production
cholesky_factor_corr[3] L_Omega_cov;          // cholesky for cov's of L_Omega
vector<lower=0>[3] L_sigma_omega;             // s.d.'s for differences shocks
}

// TRANSFORMATION BLOCK
transformed parameters{

matrix[3,3] L_Omega;                    // Combined variance/covariance matrix for zeta's
vector[3] expect[T0];                   // This term is the centering vector for the likelihood.
real<upper=0> nug;
real<upper=0> nus;
real<upper=0> rho;

rho = - nrho;
nug = - nnug;
nus = - nnus;

// Variance/covariance matrices
L_Omega = diag_pre_multiply(L_sigma_omega,L_Omega_cov);

for(t in 1:T0){
  expect[t][1] = (1-rho)/rho * qtilde[t] - (rho-nug)/rho/nug
  * log(omegag + (1-omegag) * (wtildeg[t+1] * omegag / (1-omegag) )^(nug/(nug-1)))
  + (rho-nus)/rho/nus
  * log(omegas + (1-omegas) * (wtildes[t+1] * omegas / (1-omegas) )^(nus/(nus-1)))
  + (rho-nug)/rho/nug
  * log(omegag + (1-omegag) * (wtildeg[t] * omegag / (1-omegag) )^(nug/(nug-1)))
  - (rho-nus)/rho/nus
  * log(omegas + (1-omegas) * (wtildes[t] * omegas / (1-omegas) )^(nus/(nus-1)))
  + (1.0/rho) * ptilde[t];
  for(j in 1:2){
    expect[t][j+1] = lwtilde[t][j] - alpha[j] * KL[t][j];
  }
}

}

// MODEL BLOCK
model {

// Structural Parameter Priors
nrho ~ lognormal(-0.5,1);
omegag ~ beta(1.0,omegag_beta);
omegas ~ beta(1.0,omegas_beta);
nnug ~ lognormal(nug_mean,1.0);
nnus ~ lognormal(nus_mean,1.0);
alpha[1] ~ beta(alpha_alpha,alpha_betag);
alpha[2] ~ beta(alpha_alpha,alpha_betas);

// Variance/covariance draws
L_Omega_cov ~ lkj_corr_cholesky(2.0);
L_sigma_omega ~ cauchy(0.0,2.0);

// Likelihood Error
for(t in 1:T0){
  expect[t] ~ multi_normal_cholesky(zero_vec,L_Omega);
}

}

The two models above are simultaneous equations models. If I eliminate the 2nd and 3rd equations in the likelihood, and consider \rho \in (0,1), the program runs efficiently and converges quickly:

// DATA BLOCK
data {

int<lower=1> T1;           // Total number of periods.
int<lower=1> T0;           // Number of periods for differenced values.
vector[T1] wtildeg;        // Wage-weighted prices (goods)
vector[T1] wtildes;        // Wage-weighted prices (services)
vector[T0] ptilde;         // Log price differences
vector[T0] qtilde;         // Log-consumption ratios differences

// Fixed Parameters
real nug_mean;
real nus_mean;
real omegag_beta;
real omegas_beta;

}

// PARAMETER DECLARATIONS
parameters {

real<lower=0,upper=1> rho;              // elasticity of substitution for c's
real<lower=0> nnug;
real<lower=0> nnus;
real<lower=0,upper=1> omegag;
real<lower=0,upper=1> omegas;
real<lower=0> precision;

}

// TRANSFORMATION BLOCK
transformed parameters{

vector[T0] expect;                   // This term is the centering vector for the likelihood.
real<upper=0> nug;
real<upper=0> nus;
real<lower=0> sigma;

nug = - nnug;
nus = - nnus;
sigma = sqrt(1.0 / precision);

for(t in 1:T0){
  expect[t] = (1-rho)/rho * qtilde[t] - (rho-nug)/rho/nug
  * log(omegag + (1-omegag) * (wtildeg[t+1] * omegag / (1-omegag) )^(nug/(nug-1)))
  + (rho-nus)/rho/nus
  * log(omegas + (1-omegas) * (wtildes[t+1] * omegas / (1-omegas) )^(nus/(nus-1)))
  + (rho-nug)/rho/nug
  * log(omegag + (1-omegag) * (wtildeg[t] * omegag / (1-omegag) )^(nug/(nug-1)))
  - (rho-nus)/rho/nus
  * log(omegas + (1-omegas) * (wtildes[t] * omegas / (1-omegas) )^(nus/(nus-1)))
  + (1.0/rho) * ptilde[t];
}

}

// MODEL BLOCK
model {

// Structural Parameter Priors
rho ~ beta(1,1);
omegag ~ beta(1.0,omegag_beta);
omegas ~ beta(1.0,omegas_beta);
nnug ~ lognormal(nug_mean,1.0);
nnus ~ lognormal(nus_mean,1.0);

// Variance draw
precision ~ gamma(2.0,4.0);

// Likelihood Error
for(t in 1:T0){
  expect[t] ~ normal(0.0,sigma);
}

}

I find this behavior very curious. I am attaching two datasets (stan_agg_ge_naive.r and stan_agg_pe_naive.r). The first dataset should be used on the first two programs listed above and the second on the last program, without the simultaneous equations. Any help here is greatly appreciated.

  • Nick

stan_agg_ge_naive.r (11.7 KB)
stan_agg_pe_naive.r (5.9 KB)

2 Likes

The issues w.r.t. huge R-hat and low effective sample size are identical to those discussed here: Huge R-hat and low effective size in piecewise exponential model when using more than one predictor and interval

However, I do not have a piece-wise model with possible issues regarding differentiability. The posterior is continuous and differentiable in all parameters.

Is the model defined also for \rho=0? And if it is, is there possibility of numerical problems near 0? For example generalized Pareto distribution is defined with k=0, but the behavior near 0 for naive implementation is numerically unstable. See code example at https://mc-stan.org/users/documentation/case-studies/gpareto_functions.html#generalized-pareto-distribution

This is probably the world record.

2 Likes

So I ran the variational inference (VI) algorithm, instead. It has no issues converging, but shows that the mean of \rho \approx 0.975 with a standard deviation of 0.0018, so the posterior for \rho is very tight. The model (as currently coded) is not defined for \rho = 0, though theoretically it is. I don’t think that is the issue though. The issue is the pile up near 1 and for some reason, the HMC/NUTS algorithm cannot uncover that really tight posterior, whereas VI can. I even tried seeding the HMC/NUTS with VI means as starting values and still encountered the same problem, with huge Rhats and pile up at 1.

I know that VI is still in Beta, but is there a reason it may be more flexible for tight posteriors near support bounds? I’ve decided to stick with the VI for now, just because it seems to be working well.

I think this might be the first time that someone tells VI works better than HMC.

Now I have time to ask more questions

do you mean the chains doesn’t move at all?

Which code you are using to compute Rhats? If the chains are not moving at all and draws are equal, none of the codes I know should produce 100 trillion. If the chains are moving these values are still incredible, but would say the posterior is multimodal.

If it works well for \rho < 0, it seems to be in contradiction with VI result. What does Pareto k diagnostic say about the VI result (http://proceedings.mlr.press/v80/yao18a.html)

VI finds just one mode, so do you get the some mode with different random initializations?

That should show as divergences or rejections.

If all chains are initialized with the same starting values, then to get huge Rhat could be explained that you posterior really is leaning on 1, and with logistic transformation used to unconstrain rho, there is infinite range where the chains end up. So it seems HMC and diagnostics are really telling you something. Do you think \rho=1 (or \rho \rightarrow 1) is sensible? If not, try adding a 1 avoiding prior. It would help to understand what is the role of \rho.

2 Likes

Seriously! This is so bad it’s actually pretty cool.

Yes. For \rho, the chain is stuck at \rho = 1.

\hat{R} is computed in ShinyStan. The model is run in CmdStan and imported to R using read_stan_csv(). I then use launch_shinystan() and examine the posterior and its diagnostics with the ShinyStan API.

This is a structural economic model (macro). \rho governs the elasticity of substitution which is 1/(1-\rho). As \rho \to 1 the goods are perfect substitutes. \rho = 0 is Cobb-Douglas utility which we can rule out by theory for the particular application. Thus, either \rho < 0 or \rho \in (0,1). I estimate both models and compare them using Bayes’ factors and out-of-sample cross-validation. \rho \to 1 is a sensible value, but so is \rho < 0.

When I restrict \rho \in (0,1), Pareto-k is 4.71. With \rho < 0, Pareto-k is 1.58.

Yes.

It shows 0 divergences and no mixing, with huge \hat{R}.

1 Like

1 exactly in floating point representation or 1 as rounded in ShinyStan?

Ah, ShinyStan is using old copy pasted code from RStan 2015. The code has since been updated couple times. Can you report results also with rstan::monitor() or with posterior package?

I know you posted code and data, but it’s unlikely I have time to run until after two weeks (I can post these quick questions in middle of other things). Can you also report if Rhat is big for all parameters or some?

Which means the VI approximation is bad. If I counted correctly there are less than 15 parameters so teh bad behavior is not due to high dimensionality, but VI is underestimating the uncertainty or biased.

If \rho \to 1 is also likely then the logistic transformation in

real<lower=0,upper=1> rho;

is problematic. It would be useful to check what are the unconstrained values for \rho during the warmup. At the point when the unconstrained value is large enough \rho will round to 1 in floating point presentation and there will be numerical problems at that point and a bit before that.

I ran a second model in HMC/NUTS using a logistic transformation on \rho and changed the prior for the transformed variable to \mathcal{N}(0,10). I checked \widehat{R} in RStan 2.21.2. and got the highest values (6 to 53) for the terms in L_Omega_cov, the var/cov matrix with LKJ prior. Re-evaluating \widehat{R} for the original model (without logistic transform) in HMC/NUTS using RStan 2.21.2, I get the same results with \rho having an \widehat{R} \approx 1.31 and the var/cov terms having large values. In both models \widehat{R} \leq 3.2 for all parameters except var/cov terms, and <2 for \rho, \nu_g, \nu_s, \omega_g, and \omega_s. In both models, the effective sample size is <5 for all parameters, run with 4000 warm up draws and 4000 post warm up draws divided over 4 parallel chains (2000 total draws incl. warm up per chain).

To check the warm-up values I ran the model in RStan using:

stan_fit = stan(file=“model.stan”,data=data1,seed=262627901,chains=1,cores=1,save_warmup=TRUE)

With summary(stan_fit) I get \widehat{R} near 1 with this model only. Are warmup draws included in \widehat{R} calculations?

Further, there is a bug in RStan 2.21.2. When running

rstan::extract(stan_fit, “rho”, inc_warmup = TRUE),

the output still does not include the warmup values, even though it appears that summary statistics may be taking them into consideration. It is thus not possible to examine (as far as I can tell) the warmup values using standard methods.

Not 1 exactly, when I run the model in RStan. \rho has a mean of 0.9986 with an S.D. of 0.00128.

Like I said in the previous post, when warmups are included rstan::summary(model) reports \widehat{R} \approx 1 and higher effective sample size, suggesting the model has converged, though I suspect that this is because the summary statistics are taking into account the warmup draws.

This indicates the chains are stuck. This can happen either due to multimodality, or due to the numerical inaccuracies.

They should not, and are not used unless you explicitly include them.

What are the mean and sd for different chains for the variable in the unconstrained space?