High ESS for parameters, low ESS for transformed parameters?

Working on a model that uses a cholesky parameterization to model correlated parameters and I’m seeing something strange — all the raw parameters of the model have good rhat/ess, but some transformed parameters have poor rhat/ess. Is this a problem for interpretation? Or is this not something to worry about given all the parameters have good rhat/ess? (My hunch is that I do need to worry about this.)

Here’s an example output of the parameters. All have good-enough rhat/ess (this was run for 4,000 iterations).

#> # A tibble: 5 × 10
#>   variable          mean median     sd    mad     q5   q95  rhat ess_bulk ess_tail
#>   <chr>            <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>
#> 1 eta_od[343,1,2]  0.662  0.661 0.589  0.592  -0.320  1.62  1.00    5252.    3258.
#> 2 eta_od[343,2,2] -0.209 -0.229 0.989  0.981  -1.85   1.41  1.00    8411.    2923.
#> 3 logit_rho        4.93   4.86  0.642  0.617   4.00   6.04  1.00    1036.    1825.
#> 4 log_sigma_o     -3.24  -3.24  0.0446 0.0456 -3.31  -3.17  1.00    1589.    2403.
#> 5 log_sigma_d     -3.34  -3.33  0.0487 0.0485 -3.42  -3.26  1.00    1698.    2451.

The covariance matrix is derived from logit_rho, log_sigma_o, and log_sigma_d in the transformed parameters block. The rhat/ess for the covariance matrix is pretty similar to the raw parameters, which makes sense.

#> # A tibble: 4 × 10
#>   variable         mean  median       sd      mad      q5     q95  rhat ess_bulk ess_tail
#>   <chr>           <dbl>   <dbl>    <dbl>    <dbl>   <dbl>   <dbl> <dbl>    <dbl>    <dbl>
#> 1 Sigma_od[1,1] 0.00154 0.00153 0.000138 0.000138 0.00133 0.00178  1.00    1589.    2403.
#> 2 Sigma_od[2,1] 0.00137 0.00137 0.000115 0.000114 0.00119 0.00157  1.00    1490.    2304.
#> 3 Sigma_od[1,2] 0.00137 0.00137 0.000115 0.000114 0.00119 0.00157  1.00    1490.    2304.
#> 4 Sigma_od[2,2] 0.00127 0.00127 0.000124 0.000122 0.00108 0.00149  1.00    1698.    2451.

Another value derived in the generated quantities block, beta_od, however, has a really low ess. I’m not sure why this is the case, since it’s derived from raw parameters with far higher ess. In pseudocode, beta_od = cholesky_decompose(Sigma_od) * eta_od;

#> # A tibble: 2 × 10
#>   variable             mean   median     sd    mad      q5     q95  rhat ess_bulk ess_tail
#>   <chr>               <dbl>    <dbl>  <dbl>  <dbl>   <dbl>   <dbl> <dbl>    <dbl>    <dbl>
#> 1 beta_od[343,1,2]  0.00738  0.00746 0.0233 0.0236 -0.0313  0.0449  1.01     265.    1562.
#> 2 beta_od[343,2,2] -0.0581  -0.0581  0.0238 0.0236 -0.0973 -0.0192  1.02     219.    1333.
1 Like

beta_od has a low ess_bulk and a relatively high ess_tail. this may indicate multi-modality.
the rhat value reported is actually the max of rhat_bulk and rhat_tail. you can investigate further using the posterior package directly.

2 Likes

Interesting — I’m not seeing multi-modality if I plot the draws. And I’m seeing the expected vaguely uncorrelated gaussian blobs in parameter pairs plots.

The higher tail ess is also present at the parameter level, so my assumption is that the source of ~whatever~ is going on stems from logit_rho, log_sigma_o, and log_sigma_d.

Sorry for the late response.

What is going on is that ESS is a linear measure that’s not preserved through non-linear transform. So there’s no reason to expect the ESS of \alpha and \alpha^2 to be the same, and you’ll usually see Stan have a higher ESS for \alpha (posterior mean) than \alpha^2 (varying component of posterior variance).

A bulk ESS of 200+ should be fine for downstream estimation. The standard error on a posterior expectation (posterior mean) will be \text{sd} / \sqrt{\text{ESS}}, so at \text{ESS} = 200, you’re down to error on your mean estimates being 1 / \sqrt{200} or about 1 / 15 of a standard deviation. Given that the posterior standard deviation is your uncertainty, knowing the posterior mean much more precisely doesn’t usually give you much.

Out of curiosity, what’s the transform that leads you to beta_od?

2 Likes

Thanks @Bob_Carpenter! That makes sense — I suppose I was mostly surprised that there was a relatively large difference in ESS between beta_od and eta_od/Sigma_od, though the non-linear transformations are probably where the difference lies.

I was working through correlated random walks — beta_od is an array of T matrices with dimensions 2 \times S, where S is the number of states in the random walk. All elements T share a common covariance matrix \sigma_{od}.

I ended up scrapping this approach a few days later in favor of a simpler method that was better suited for the application anyways (hindsight being 20:20 — realizing as I look at this function now that I could have passed in the cholesky decomposition to the function rather than recalculate it every time!).

  array[] matrix random_walk(array[] vector beta0_od,
                             array[] matrix eta_od,
                             matrix Sigma_od) {
    // instantiate random walk matrices
    int T = num_elements(beta0_od[:,1]);
    int S = num_elements(eta_od[1,1,:]) + 1;
    array[T] matrix[2,S] beta_od;
    
    // convert covariance matrix to cholesky factor
    matrix[2,2] L = cholesky_decompose(Sigma_od);
    
    // build random walks
    for (t in 1:T) {
      beta_od[t,:,1] = beta0_od[t,:];
      beta_od[t,:,2:S] = L * eta_od[t];
      for (s in 2:S) {
        beta_od[t,:,s] += beta_od[t,:,s-1];
      }
    }
    
    return beta_od;
  }
1 Like

In some cases one doesn’t even need the transform to be nonlinear to see ESS go down for the particular margins one is examining. For example, suppose that exploration is efficient in the direction of y = x but inefficient in the direction of y = -x. Then you would see moderately low ESS on the margins for both x and y, but if you compute a new variable giving the projection of the samples onto y = -x (a linear transform), this new variable would have much lower (univariate) ESS. Of course this phenomenon would necessarily be accompanied by the other transforms having higher ESS (in this case the projection onto y = x).

set.seed(1)

# draw 10000 strongly autocorrelated samples from a standard normal 
# via Metropolis-Hastings
mh_step <- function(x1, delta, pdf){
  x2 <- x1 + rnorm(1, 0, delta)
  alpha <- pdf(x2)/pdf(x1)
  if(runif(1) < alpha){
    return(x2)
  } else {
    return(x1)
  }
}

mh_vec <- c(rnorm(1), rep(NA, 9999))
for(i in 2:10000){
  mh_vec[i] <- mh_step(mh_vec[i - 1], .1, dnorm)
}

# draw 10000 iid samples from a standard normal
iid_vec <- rnorm(10000)

# Let the mh_samples represent the projection of samples onto y = x and let
# the iid samples represent the projection of samples onto y = -x.  Compute the
# x, y coordinates via 45-degree clockwise rotation
M <- matrix(c(mh_vec, iid_vec), nrow = 2, byrow = TRUE)
theta <- - pi / 4
rotation_matrix <- matrix(
  c(cos(theta), -sin(theta), sin(theta), cos(theta)), 
  nrow = 2, 
  byrow = TRUE)
xy_samples <- rotation_matrix %*% M

# observe middling ESS for both x and y
posterior::ess_bulk(xy_samples[1,])
posterior::ess_bulk(xy_samples[2,])

# compute a linear combination of the xy samples
# The below combination recovers the mh samples
transform_matrix <- matrix(c(cos(- pi / 4), sin(- pi / 4)), nrow = 1)
linear_transformed_samples <- transform_matrix %*% xy_samples |>
  as.vector()

# observe lower ESS for the linear combination than for either x or y
posterior::ess_bulk(linear_transformed_samples)

# but note that there are other linear combinations of x and y with much higher ESS
# the below combination recovers the iid samples
transform_matrix_2 <- matrix(c(cos(pi / 4), sin(pi / 4)), nrow = 1)
linear_transformed_samples_2 <- transform_matrix_2 %*% xy_samples |>
  as.vector()
posterior::ess_bulk(linear_transformed_samples_2)

5 Likes