Benchmarking with Inequality constraints on the true value in Stan

Please share your Stan program and accompanying data if possible.


When including Stan code in your post it really helps if you make it as readable as possible by using Stan code chunks (```stan) with clear spacing and indentation. For example, use

data {
  // ----------------------------
  // Basic dimensions
  // ----------------------------
  int<lower=1> L;                // number of areas
  vector[L] c;                   // lower-bound c_i
  real<lower=0> a;               // sum(theta_i) < a

  // Sample sizes
  int<lower=2> n[L];

  // Direct estimates
  real theta_hat[L];  // sample mean in each county
  real S2[L];         // sample variance in each county

  // Design for mean model
  int<lower=1> pBeta;
  matrix[L, pBeta] X_beta;

  // Design for variance model
  int<lower=1> pGamma;
  matrix[L, pGamma] X_gamma;
}

parameters {
  // ----------------------------
  // Hyperparameters
  // ----------------------------
  vector[pBeta] beta;        // regression coeffs for mean
  real<lower=0> delta2;      // variance for mean prior

  vector[pGamma] gamma;      // regression coeffs for log-precision
  real<lower=0> alpha;       // shape param for tau

  // ----------------------------
  // Area-level parameters
  // ----------------------------
  // tau_i = 1/sigma_i^2 > 0
  vector<lower=0>[L] tau;

  // theta_i  is unconstrained here; I will do the constraints in model block
  vector[L] theta;
}

transformed parameters {
  // sigma_i^2 = 1 / tau_i
  vector[L] sigma2;
  for (i in 1:L) {
    sigma2[i] = 1 / tau[i];
  }
}

model {
  // =================================================================
  //  0) Enforce constraints:  theta_i >= c_i and sum(theta_i) < a
  // =================================================================
  for (i in 1:L) {
    if (theta[i] < c[i]) {
      target += negative_infinity();
    }
  }
  if (sum(theta) >= a) {
    target += negative_infinity();
  }

  // =================================================================
  //  1) Hyperpriors (weak or vague) on beta, delta^2, gamma, alpha
  // =================================================================
  // e.g. wide normal for beta
  target += normal_lpdf(beta  | 0, 100);

  // delta^2 ~ InvGamma(0.001, 0.001)
  target += inv_gamma_lpdf(delta2 | 0.001, 0.001);

  // wide normal for gamma
  target += normal_lpdf(gamma | 0, 100);

  // alpha ~ Gamma(0.5, 0.5)
  target += gamma_lpdf(alpha | 0.5, 0.5);

  // =================================================================
  //  2) Priors: tau_i = sigma_i^-2 ~ Gamma(alpha/2, [alpha * exp(-x_i gamma)]/2)
  // =================================================================
  for (i in 1:L) {
    real rate_i = 0.5 * alpha * exp(-dot_product(X_gamma[i], gamma));
    target += gamma_lpdf(tau[i] | alpha/2, rate_i);
  }

  // =================================================================
  //  3) Priors: theta_i ~ Normal( X_beta[i]*beta, delta^2 )
  // =================================================================
  for (i in 1:L) {
    target += normal_lpdf(theta[i] | dot_product(X_beta[i], beta), sqrt(delta2));
  }

  // =================================================================
  //  4) Likelihood: Double Shrinkage
  //     (a) theta_hat[i] ~ Normal(theta[i], sigma2[i])
  //     (b) (n_i - 1)*S2[i]/sigma2[i] ~ Gamma( shape=(n_i-1)/2, rate=1/2 )
  // =================================================================
  for (i in 1:L) {
    // (a) direct estimate
    target += normal_lpdf(theta_hat[i] | theta[i], sqrt(sigma2[i]));

    // (b) sample variance
    //     (n_i-1)*S2[i]/sigma2[i] ~ Chi^2_{(n_i-1)} => Gamma( (n_i-1)/2, 1/2 )
    real shape_i = 0.5 * (n[i] - 1);
    real rate_i  = 0.5;
    target += gamma_lpdf((n[i] - 1)*S2[i]/sigma2[i] | shape_i, rate_i);
  }
}

2) FITTING THE MODEL IN STAN

=====================================================================

pBeta ← 1
X_beta ← matrix(1, nrow=L, ncol=1)
pGamma ← 1
X_gamma ← matrix(1, nrow=L, ncol=1)

stanData ← list(
L = L,
c = simData$c_i,
a = a,
n = simData$n_i,
theta_hat= simData$theta_hat,
S2 = simData$S2,
pBeta = pBeta,
X_beta = X_beta,
pGamma = pGamma,
X_gamma = X_gamma
)

init_fun ← function(chain_id = 1) {
leftover ← stanData$a - sum(stanData$c)
leftover_portion ← leftover * 0.5 / stanData$L
init_theta ← stanData$c + leftover_portion

list(
theta = init_theta,
tau = rep(1, stanData$L),
beta = as.array(rep(0, stanData$pBeta)), # force dimension [1]
gamma = as.array(rep(0, stanData$pGamma)), # likewise
alpha = 1,
delta2 = 1
)
}

fit ← stan(
file = “nandram_double_shrinkage.stan”,
data = stanData,
chains = 2,
iter = 10000,
warmup = 5000,
seed = 123,
control = list(adapt_delta = 0.999, max_treedepth = 20),
init = init_fun # pass the function here
)

I am getting these warning after running and I don’t know why:
Warning messages:
1: There were 10000 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
2: Examine the pairs() plot to diagnose sampling problems

3: The largest R-hat is 2.64, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat
4: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess
5: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess

1 ) This means the simulation of the Hamiltonian diverged. The Hamiltonian should be conserved. It almost always indicates that there’s regions of low and high curvature and the first-order simulation of the Hamiltonian we use is not stable.

3, 4, 5 ) These indicate the model hasn’t converged. It’s usually not going to converge when all the transitions are divergent.

In looking at your code, the transformed parameter code doesn’t match the doc—I think you missed a square root if you are really treating tau as the precision.

vector[L] sigma = tau^-0.5;

This won’t work in Stan. Stan assumes the model has support everywhere (i.e., always has a finite log density). You have to build the constraints into the parameter declarations. For example,

vector<upper=c>[N] theta;

will ensure that theta[n] < c[n]. The constraint on the sum is trickier. We don’t have an easy way to do that. Because theta is not constrained to be positive, you can enforce the other constaint this way.

parameters {
  vector<upper=c[1:N-1]> theta_prefix;
  real<upper=min(c[N], sum(theta_prefix))> theta_last;
  ...
transformed parameters {
  vector[N] theta = append_row(theta_prefix, theta_last);
  ...

The priors look like they come from BUGS:

inv_gamma_lpdf(delta2 | 0.001, 0.001);

These are problematic in cases where there’s not much data because they pull the posterior into the tails. @andrewgelman talks about this in this paper and covers this specific gamma case:

http://www.stat.columbia.edu/~gelman/research/published/taumain.pdf

Also, a lot of the code can be sped up with vectorization, but I recommend getting it working first.

Thank you for the help. I appreciate this.

I also see you have vector<upper=c>[N] theta; but I am dealing with a lower bound her where theta>=c. How do I go about it?

Here’s one idea:

parameters {
  simplex[L+1] theta_plus;
}
transformed parameters {
  vector[L] theta = c + (a - sum(c))*theta_plus[1:L];
}

thank you.
Is there a paper that can be used to back these ideas?

I don’t think so—it’s just straightforward algebra. These are implemented in different ways in different systems—for example, you can implement HMC to bounce off of boundary constraints if they’re simple, or you can do what Stan does and transform to sample over an unconstrained space. The Stan User’s Guide has examples of implementing custom constraints in Stan for things like a triangle density, and that kind of thing.

[edit: escaped code]

data {
  //----------------------------------------------------------------
  // Basic dimensions & data
  //----------------------------------------------------------------
  int<lower=1> L;             // number of areas
  vector[L] c;                // lower bounds: each theta[i] >= c[i]
  real<lower=sum(c)> a;       // total must be < a

  // Double-shrinkage specifics
  int<lower=2> n[L];
  real theta_hat[L];
  real S2[L];

  // Mean-model design
  int<lower=1> pBeta;
  matrix[L, pBeta] X_beta;

  // Variance-model design
  int<lower=1> pGamma;
  matrix[L, pGamma] X_gamma;
}

parameters {
  //----------------------------------------------------------------
  // (A) (L+1)-Dim Simplex:
  //   We declare 'simplex[L+1] alpha_plus;' so alpha_plus sums to 1,
  //   then only use the FIRST L entries to define theta.
  //----------------------------------------------------------------
  simplex[L+1] alpha_plus;

  //----------------------------------------------------------------
  // (B) Hierarchical parameters for the double-shrinkage model
  //----------------------------------------------------------------
  vector[pBeta] beta;    // mean-model coefficients
  real<lower=0> delta;   // scale for prior on theta
  vector[pGamma] gamma;  // log-precision model
  real<lower=0> alpha_tau;    // shape for tau
  vector<lower=0>[L] tau;     // area-level precision
}

transformed parameters {
  //----------------------------------------------------------------
  // Construct theta so that it respects constraints:
  //   theta[i] >= c[i], and sum(theta) < a
  // Using:  theta[i] = c[i] + (a - sum(c)) * alpha_plus[i]
  // Because the (L+1)th entry of alpha_plus ensures sum(alpha_plus[1:L]) < 1.
  //----------------------------------------------------------------
  vector[L] theta;
  {
    real sum_c = sum(c);
    for (i in 1:L) {
      theta[i] = c[i] + (a - sum_c) * alpha_plus[i];
    }
    // The (L+1)th component of alpha_plus is not directly used,
    // but it ensures sum(alpha_plus[1:L]) < 1 almost surely.
  }
}

model {
  //----------------------------------------------------------------
  // 1) By default, alpha_plus is uniform over the simplex[L+1].
  //    (No explicit prior needed.  You COULD do alpha_plus ~ dirichlet(...);)
  //----------------------------------------------------------------

  //----------------------------------------------------------------
  // 2) Priors on the hierarchical parameters
  //----------------------------------------------------------------
  beta ~ normal(0, 5);
  delta ~ cauchy(0, 5);     // half-Cauchy for scale
  gamma ~ normal(0, 5);
  alpha_tau ~ gamma(10, 1);

  // area-level precision:
  // tau[i] ~ Gamma(alpha_tau/2, rate_i=0.5*alpha_tau*exp(-X_gamma[i]*gamma))
  for (i in 1:L) {
    real rate_i = 0.5 * alpha_tau * exp(-dot_product(X_gamma[i], gamma));
    tau[i] ~ gamma(alpha_tau / 2, rate_i);
  }

  //----------------------------------------------------------------
  // 3) Prior that shrinks theta around X_beta*beta with scale=delta
  //    i.e.  theta[i] ~ Normal( X_beta[i]*beta, delta )
  //----------------------------------------------------------------
  for (i in 1:L) {
    target += normal_lpdf(theta[i]
                          | dot_product(X_beta[i], beta),
                            delta);
  }

  //----------------------------------------------------------------
  // 4) Double-shrinkage likelihood
  //    (a) direct estimates: theta_hat[i] ~ Normal(theta[i], 1/sqrt(tau[i]))
  //    (b) sample variance: (n[i]-1)*S2[i]*tau[i] ~ chi^2_(n[i]-1)
  //----------------------------------------------------------------
  target += normal_lpdf(theta_hat | theta, inv_sqrt(tau));

  for (i in 1:L) {
    real shape_i = 0.5 * (n[i] - 1);
    real rate_i  = 0.5;
    target += gamma_lpdf( (n[i]-1) * S2[i] * tau[i]
                          | shape_i, rate_i );
  }
}

generated quantities {
  //----------------------------------------------------------------
  // For convenience: store the final theta & its sum
  //----------------------------------------------------------------
  vector[L] theta_out;
  real sum_theta;
  {
    real sum_c = sum(c);
    for (i in 1:L) {
      theta_out[i] = c[i] + (a - sum_c) * alpha_plus[i];
    }
    sum_theta = sum(theta_out);
    // By construction, sum_theta <= a almost surely (strictly < a measure-zero).
  }
}

I have made the necessary corrections as per the suggestions.is this done correctly?

Nobody is going to be able to tell from just reading your model code. The best thing to do is test it with simulated data.

Your likelihood will be a lot faster as

theta ~ normal(X_beta * beta, delta);

And the prior as

tau ~ gamma(0.5 * alpha_tau, 0.5 * alpha_tau * exp(X_gamma * gamma));

Matrix multiply is much more efficient than pulling out rows and multiplying them.

Similarly, can simplify transformed parameters as

vector[L] theta = c + (a - sum_c) * alpha_plus[1:L];

What the simplex constraint ensures is that sum(alpha_plus) = 1 to within double precision. With computer arithmetic, notions like “almost surely” derived from math are wrong because we can overflow or underflow with finite probability.