Non centered parameterization on variance parameter


#1

I’m following up on this discussion on the old stan-users mailing list regarding non centered parameterization of variance parameters:
https://groups.google.com/d/topic/stan-users/zOjAeJC4x_E/discussion

Basically Julian found that using non-centered parameterization on on the variance parameter caused subtle differences in the parameter estimates. It seemed like it was never resolved, and I came on the thread when I ran into the same thing.

It seems to me that the centered and non centered versions are actually different models, and it seems like this would be an issue beyond Justin’s example. Using Justin’s example:

t looks like (up to additive constants)
log(p(hp | dat1, dat2)) = -nlog(sig1) - dat1^2/2sig1^2 -nlog(sig2) - dat2^2/2sig2^2 -2 log(hp)+log(sig1)-hpsig1 -2log(hp)+log(sig2)-hpsig2

log(p(hp_alt | dat1, dat2)) = -nlog(c1_althp_alt) - ndat1^2/2(c1_althp_alt)^2 -log(c2_althp_alt) - dat2^2/2(c2_althp_alt)^2 +log(c1_alt)-c1_alt +log(c2_alt)-c2_alt

log(p(hp_alt | dat1, dat2)) = -dat1^2/2(c1_althp_alt)^2 - dat2^2/2(c2_althp_alt)^2 -2log(hp_alt)-c1_alt -c2_alt

so if I did my math right, log(p(hp | dat1, dat2)) and log(p(hp_alt | dat1, dat2)) will differ by -n*(log(sig1)+log(sig2))

So, I believe the reparamaterization is equivalent to imposing the gamma(2+n,epsilon) prior from Chung 2015 on the original model?

It also seems like this would be the case with normal distributions as well? My understanding from the mailing list was that the ‘Matt Trick’ was just a computational tool to speed up convergence, but working through this example it seems like it is implicitly putting a weakly informative prior on the data as well.


#2

I couldn’t quite follow the math. The non-centered reparameterizations aren’t broken, though. If I have this:

parameter {
  real Y_std;
...
transformed parameters {
  real Y = mu + Y_std * sigma;
...
model {
  Y_std ~ normal(0, 1);

I get the same distribution for Y as if I had done this:

parameters {
  real Y;
...
model {
  Y ~ normal(mu, sigma);

The “trick” is that Stan in the first case defines a density over Y_raw, which is better behaved than the distribution over Y, which is where the speedup comes from. Modulo problems with mixing and/or convergence, they will give you the same answer.


#3

I should’ve also said that you want a transform that will leave the distribution over quantities of interest unchanged, but let Stan sample from a distribution that looks more like an isotropic Gaussian than it would in the original parameterization. I don’t know how to do that for variance parameters—it would depend on the prior. You can do lognormal the same way you do normal, but I don’t know how to reparameterize something like a gamma.


#4

You can do it for any location-scale family distribution (“half” distribution in this case really).

data {
  real<lower=0> prior_location;
  real<lower=0> prior_scale;
}
parameters {
  real<lower=0> sigma_raw;
}
transformed parameters {
  real<lower=0> sigma = prior_location + prior_scale * sigma_raw;
}
model {
  target += normal_lpdf(sigma_raw | 0, 1); // or cauchy or student_t
  ...
}

For an exponential distribution you can do something different than non-centering but similar in terms of the rescaling:
data {
  real<lower=0> prior_rate;  
}
parameters {
  real<lower=0> scale_raw;
}
transformed parameters {
  real<lower=0> scale = (1 / prior_rate) * scale_raw;  // prior_scale=1/prior_rate
}
model {
  target += exponential_lpdf(scale_raw | 1);
  ...
}

#5

A word of caution: this works only if the distribution behaves appropriately under the given scaling. For example, we use centering/non-centering on Gaussians because if

x ~ normal(0, 1)

then

mu + sigma * x ~ normal(mu, sigma)

This is a special property of Gaussians and does not hold for generally for any distribution.

The exponential case also works because of the nice scaling properties of exponential distributions.


#6

Sorry about the formulas in the original post. It was a bit of a mess. I’ve cleaned it up here.

From a geometric perspective, I don’t understand why we don’t need to add the jacobian of the transformation when we are multiplying by a sampled parameter?

From, an algebraic perspective, it seems like it would be required as well?

Here is the math worked out from Julian’s original example.


#7

Because we are not transforming densities.

The centered and non-centered transformations are better thought about as different models rather than different transformations. They induce posteriors over different spaces with different densities.

More formally, the non-centered implementation shifts the transformation into function expectations where we don’t need Jacobians. For example, if we wanted to compute the mean of a group-level parameter in the centered parameterization we’d compute

E[theta] = \int d(theta) pi(theta | data) theta

whereas in the non-centered parameterization we’d write

E[theta] = \int d(theta_tilde) pi(theta_tilde | data) f(theta_tilde)

where f is the map from the non-centered space to the centered space.


#8

Thanks! I had been thinking of it as more of a computational trick to speed up convergence, rather than actually changing the model that I was estimating. Just to be clear, since the two models are different, it means that we shouldn’t expect hp and hp_alt to have the same posterior distribution in Julian’s model?

If we still wanted to estimate the original model, would it be possible to add the Jacobian’s in to get the same original model back? Or would that eliminate the efficiency gained from the reparamaterization?


#9

That would eliminate the efficiency. The utility of non-centering is that it drastically improves the geometry of the corresponding posterior distribution which facilitates fitting – and it’s not just about speed but also eliminating bias. You might want to check out http://mc-stan.org/documentation/case-studies/divergences_and_bias.html and https://arxiv.org/abs/1312.0906.

The posteriors for hp and hp_alt will be different, but the posteriors for hp and the function alt_to_nom(hp_alt), and hence any dependent parts of the model, will be the same.


#10

Thanks for the links. It looks like the \tau in the divergence case study is equivalent to the hp and hp_alt in in Justin’s model. How do you define the function “alt_to_nom?” I would have thought they should be the same, just as there is no need to transform the \tau in the Non-centered parameterization.

Here is the description of the model
<img src="/uploads/mc_stan/original/1X/08bc4ef5cfad06a6f34a4fd1e144856eb956516c.png" width=298" height=“250”>

The stan code from Justin’s original post:

data{
  int n;
  vector[n] dat1;
  vector[n] dat2;
}
parameters{
  real<lower=0> hp;
  real<lower=0> hp_alt;
  real<lower=0> sig1;
  real<lower=0> sig2;
  real<lower=0> c1_alt;
  real<lower=0> c2_alt;
  
}
transformed parameters{
  real sig1_alt;
  real sig2_alt;
  
  sig1_alt = c1_alt*hp_alt;
  sig2_alt = c2_alt*hp_alt; 
}
model{
  // Standard version
  
  dat1 ~ normal(0, sig1);
  dat2 ~ normal(0, sig2);
  
  sig1 ~ gamma(2,1/hp);
  sig2 ~ gamma(2,1/hp);
  
  // Alternate version
  
  dat1 ~ normal(0, sig1_alt);
  dat2 ~ normal(0, sig2_alt);
  
  c1_alt ~ gamma(2, 1);
  c2_alt ~ gamma(2, 1);
 }

And the output from the model, with hp, and hp_alt being slightly, but significantly different

           mean    se_mean            sd    2.5%     25%     50%     75%   97.5%   n_eff Rhat
**hp         2.633   0.001**       2.608   0.706   1.337   1.974   3.061   8.501 3131450    1
**hp_alt     2.618   0.002**       2.648   0.698   1.329   1.963   3.046   8.447 1613607    1
sig1         3.398   0.000         0.820   2.232   2.828   3.257   3.805   5.392 4184178    1
sig2         1.933   0.000         0.489   1.246   1.593   1.846   2.172   3.128 4206156    1
c1_alt       1.897   0.001         1.097   0.400   1.097   1.687   2.468   4.587 2640897    1
c2_alt       1.091   0.000         0.663   0.219   0.612   0.955   1.420   2.746 2686321    1
sig1_alt     3.397   0.000         0.827   2.218   2.818   3.255   3.814   5.400 7500000    1
sig2_alt     1.933   0.000         0.494   1.239   1.588   1.845   2.177   3.135 7500000    1
lp__       -58.512   0.001         1.726 -62.730 -59.425 -58.185 -57.245 -56.152 2659432    1

#11

Those are definitely different, especially in posterior sd, which looks like it must be due to some kind of tail behavior because all the quantiles reported in the 95% interval are close.

I don’t understand the gamma well enough to understand the reparameterization (how was it derived?), but could you be missing a Jacobian term in the inversion somewhere? All these flat priors are dangerous.


#12

He’s just taking the scaling of the Gamma variates outside of the Gamma distribution, that’s fine. I think this is something computational rather than mathematical. All those R-hat values are 1 so this is from a single chain.

@aaronjg Did you see if the traces take excursions like when you try to sample directly from the student’s t without reparameterizing? Did you look at the N_eff/iteration for more chains (e.g.-10)?

I’m not dismissing what you are seeing, I think there could be some real cautions we should be pointing out about re-parameterizing scale parameters.


#13

OK, I tried it myself from scratch. Here’s the data simulation:

N <- 200
sigma <- 1.5
y <- rnorm(N, 0, sigma)

And here’s the Stan model with the improper priors on theta1 (direct) and theta2 (transformed):

data {
  int N;
  vector[N] y;
}
parameters {
  real<lower=0> theta1;
  real<lower=0> sigma1;

  real<lower=0> theta2;
  real<lower=0> sigma2_unit;
}
transformed parameters {
  real sigma2 = sigma2_unit / theta2;
}
model {
  sigma1 ~ gamma(2, 1 / theta1);
  y ~ normal(0, sigma1);

  sigma2_unit ~ gamma(2, 1);
  y ~ normal(0, sigma2);
}

I fit it directly, increasing the target acceptance rate to remove divergence warnings:

fit <- stan("gamma-scale.stan",
            data = list(N = N, y = y),
            iter = 20000,
            control = list(stepsize = 0.01, adapt_delta = 0.95));

And the results are pretty close:

        mean se_mean    sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
sigma1 1.488   0.001 0.075 1.349 1.436 1.485 1.537 1.643 22192    1
sigma2 1.480   0.000 0.074 1.343 1.429 1.478 1.529 1.633 38327    1

But, the real key to seeing what’s going on is to look at he effect of improper prior,

                mean se_mean      sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
theta1        20.396   5.147 703.422    0.402    1.064    2.145    5.093   62.424 18678    1

As theta1 gets larger, sigma1 gets closer to zero. You can see from the huge standard deviation that there are very long tails here.

@betanalpha might have some idea if the first parameterization is well-behaved enough to fit.


#14

I should add that as you crank the sample size down, things get worse. This was for N = 10 rather than N = 200:

        mean se_mean     sd   1%    99% n_eff Rhat
theta1 21.25    3.32 394.25 0.35 187.16 14131    1
sigma1  1.86    0.00   0.51 1.10   3.55 17960    1
sigma2  1.65    0.00   0.39 1.02   2.93 31127    1

The posterior standard deviation is quite high, so the se_mean may be what’s off here; but usually that’s only a problem with high correlations—maybe it’s also a problem with high non-linearities?

I would assume that sigma2 is the one being estimated correctly. But with an improper prior, it’s impossible to check with fake data, because you don’t have a generative process.


#15

Nice! Lo and behold flat priors once again prove to be informative as well as nonsensical…


#16

Actually, if anybody is looking at this as an example of diagnosing bad behavior due to tails look at:

  1. trace plot of theta1 over iterations
  2. x-axis: theta1, y-axis: sigma1, z-axis: log-ensity

The really limited exploration of long tails you see in both is typical of this kind of problem (I just ran it for 1000 iterations total, you don’t need to do 10k to see this, just do 10 chains of 1000)


#17

Also, the summary statistics look ok despite this problem, even calculating R-hat with 10 chains.


#18

One more version, sampler working with log(theta3) instead of theta directly helps to tuck in the tail a little, it looks like it’s (almost?) as good as theta2.

data {
  int N;
  vector[N] y;
}
parameters {
  real<lower=0> theta1;
  real<lower=0> sigma1;

  real<lower=0> theta2;
  real<lower=0> sigma2_unit;

  real<lower=0> theta3_log;
  real sigma3;
}
transformed parameters {
  real sigma2 = sigma2_unit / theta2;
  real theta3 = exp(theta3_log);
}

model {
  sigma1 ~ gamma(2, 1 / theta1);
  y ~ normal(0, sigma1);

  sigma2_unit ~ gamma(2, 1);
  y ~ normal(0, sigma2);

  sigma3 ~ gamma(2, 1 / theta3);
  y ~ normal(0, sigma3);
  target += 1/theta3;

}

Run script:

library(rstan)
N <- 200
sigma <- 1.5
y <- rnorm(N, 0, sigma)
fit <- stan(file='problem.stan',  data = list(N = N, y = y),
 iter = 1000, control = list(stepsize = 0.01, adapt_delta = 0.95), chains=10);
shinystan::launch_shinystan(fit)

#19

The worrying aspect is the lack of divergences and the relatively high n_eff estimated for theta1.


#20

Isn’t that the wrong Jacobian adjustment? I think you need

exp'(theta3_log) = exp(theta3_log) = theta3

so that the log Jacobian adjustment is

target += theta3;