Non centered parameterization on variance parameter

aaronjg · May 17, 2017, 7:38pm

I’m following up on this discussion on the old stan-users mailing list regarding non centered parameterization of variance parameters:
https://groups.google.com/d/topic/stan-users/zOjAeJC4x_E/discussion

Basically Julian found that using non-centered parameterization on on the variance parameter caused subtle differences in the parameter estimates. It seemed like it was never resolved, and I came on the thread when I ran into the same thing.

It seems to me that the centered and non centered versions are actually different models, and it seems like this would be an issue beyond Justin’s example. Using Justin’s example:

t looks like (up to additive constants)
log(p(hp | dat1, dat2)) = -nlog(sig1) - dat1^2/2sig1^2 -nlog(sig2) - dat2^2/2sig2^2 -2 log(hp)+log(sig1)-hpsig1 -2log(hp)+log(sig2)-hpsig2

log(p(hp_alt | dat1, dat2)) = -nlog(c1_althp_alt) - ndat1^2/2(c1_althp_alt)^2 -log(c2_althp_alt) - dat2^2/2(c2_althp_alt)^2 +log(c1_alt)-c1_alt +log(c2_alt)-c2_alt

log(p(hp_alt | dat1, dat2)) = -dat1^2/2(c1_althp_alt)^2 - dat2^2/2(c2_althp_alt)^2 -2log(hp_alt)-c1_alt -c2_alt

so if I did my math right, log(p(hp | dat1, dat2)) and log(p(hp_alt | dat1, dat2)) will differ by -n*(log(sig1)+log(sig2))

So, I believe the reparamaterization is equivalent to imposing the gamma(2+n,epsilon) prior from Chung 2015 on the original model?

It also seems like this would be the case with normal distributions as well? My understanding from the mailing list was that the ‘Matt Trick’ was just a computational tool to speed up convergence, but working through this example it seems like it is implicitly putting a weakly informative prior on the data as well.

Bob_Carpenter · May 18, 2017, 10:23pm

I couldn’t quite follow the math. The non-centered reparameterizations aren’t broken, though. If I have this:

parameter {
  real Y_std;
...
transformed parameters {
  real Y = mu + Y_std * sigma;
...
model {
  Y_std ~ normal(0, 1);

I get the same distribution for Y as if I had done this:

parameters {
  real Y;
...
model {
  Y ~ normal(mu, sigma);

The “trick” is that Stan in the first case defines a density over Y_raw, which is better behaved than the distribution over Y, which is where the speedup comes from. Modulo problems with mixing and/or convergence, they will give you the same answer.

Bob_Carpenter · May 18, 2017, 10:27pm

I should’ve also said that you want a transform that will leave the distribution over quantities of interest unchanged, but let Stan sample from a distribution that looks more like an isotropic Gaussian than it would in the original parameterization. I don’t know how to do that for variance parameters—it would depend on the prior. You can do lognormal the same way you do normal, but I don’t know how to reparameterize something like a gamma.

jonah · May 19, 2017, 12:14am

You can do it for any location-scale family distribution (“half” distribution in this case really).

data {
  real<lower=0> prior_location;
  real<lower=0> prior_scale;
}
parameters {
  real<lower=0> sigma_raw;
}
transformed parameters {
  real<lower=0> sigma = prior_location + prior_scale * sigma_raw;
}
model {
  target += normal_lpdf(sigma_raw | 0, 1); // or cauchy or student_t
  ...
}

For an exponential distribution you can do something different than non-centering but similar in terms of the rescaling:

data {
  real<lower=0> prior_rate;  
}
parameters {
  real<lower=0> scale_raw;
}
transformed parameters {
  real<lower=0> scale = (1 / prior_rate) * scale_raw;  // prior_scale=1/prior_rate
}
model {
  target += exponential_lpdf(scale_raw | 1);
  ...
}

betanalpha · May 19, 2017, 12:50am

A word of caution: this works only if the distribution behaves appropriately under the given scaling. For example, we use centering/non-centering on Gaussians because if

x ~ normal(0, 1)

then

mu + sigma * x ~ normal(mu, sigma)

This is a special property of Gaussians and does not hold for generally for any distribution.

The exponential case also works because of the nice scaling properties of exponential distributions.

aaronjg · May 19, 2017, 1:19am

Sorry about the formulas in the original post. It was a bit of a mess. I’ve cleaned it up here.

From a geometric perspective, I don’t understand why we don’t need to add the jacobian of the transformation when we are multiplying by a sampled parameter?

From, an algebraic perspective, it seems like it would be required as well?

Here is the math worked out from Julian’s original example.

betanalpha · May 19, 2017, 6:26pm

Because we are not transforming densities.

The centered and non-centered transformations are better thought about as different models rather than different transformations. They induce posteriors over different spaces with different densities.

More formally, the non-centered implementation shifts the transformation into function expectations where we don’t need Jacobians. For example, if we wanted to compute the mean of a group-level parameter in the centered parameterization we’d compute

E[theta] = \int d(theta) pi(theta | data) theta

whereas in the non-centered parameterization we’d write

E[theta] = \int d(theta_tilde) pi(theta_tilde | data) f(theta_tilde)

where f is the map from the non-centered space to the centered space.

aaronjg · May 19, 2017, 6:37pm

Thanks! I had been thinking of it as more of a computational trick to speed up convergence, rather than actually changing the model that I was estimating. Just to be clear, since the two models are different, it means that we shouldn’t expect hp and hp_alt to have the same posterior distribution in Julian’s model?

If we still wanted to estimate the original model, would it be possible to add the Jacobian’s in to get the same original model back? Or would that eliminate the efficiency gained from the reparamaterization?

betanalpha · May 19, 2017, 6:47pm

That would eliminate the efficiency. The utility of non-centering is that it drastically improves the geometry of the corresponding posterior distribution which facilitates fitting – and it’s not just about speed but also eliminating bias. You might want to check out http://mc-stan.org/documentation/case-studies/divergences_and_bias.html and https://arxiv.org/abs/1312.0906.

The posteriors for hp and hp_alt will be different, but the posteriors for hp and the function alt_to_nom(hp_alt), and hence any dependent parts of the model, will be the same.

aaronjg · May 19, 2017, 8:17pm

Thanks for the links. It looks like the \tau in the divergence case study is equivalent to the hp and hp_alt in in Justin’s model. How do you define the function “alt_to_nom?” I would have thought they should be the same, just as there is no need to transform the \tau in the Non-centered parameterization.

Here is the description of the model
<img src="//cdck-file-uploads-canada1.s3.dualstack.ca-central-1.amazonaws.com/flex030/uploads/mc_stan/original/1X/08bc4ef5cfad06a6f34a4fd1e144856eb956516c.png" width=298" height=“250”>

The stan code from Justin’s original post:

data{
  int n;
  vector[n] dat1;
  vector[n] dat2;
}
parameters{
  real<lower=0> hp;
  real<lower=0> hp_alt;
  real<lower=0> sig1;
  real<lower=0> sig2;
  real<lower=0> c1_alt;
  real<lower=0> c2_alt;
  
}
transformed parameters{
  real sig1_alt;
  real sig2_alt;
  
  sig1_alt = c1_alt*hp_alt;
  sig2_alt = c2_alt*hp_alt; 
}
model{
  // Standard version
  
  dat1 ~ normal(0, sig1);
  dat2 ~ normal(0, sig2);
  
  sig1 ~ gamma(2,1/hp);
  sig2 ~ gamma(2,1/hp);
  
  // Alternate version
  
  dat1 ~ normal(0, sig1_alt);
  dat2 ~ normal(0, sig2_alt);
  
  c1_alt ~ gamma(2, 1);
  c2_alt ~ gamma(2, 1);
 }

And the output from the model, with hp, and hp_alt being slightly, but significantly different

           mean    se_mean            sd    2.5%     25%     50%     75%   97.5%   n_eff Rhat
**hp         2.633   0.001**       2.608   0.706   1.337   1.974   3.061   8.501 3131450    1
**hp_alt     2.618   0.002**       2.648   0.698   1.329   1.963   3.046   8.447 1613607    1
sig1         3.398   0.000         0.820   2.232   2.828   3.257   3.805   5.392 4184178    1
sig2         1.933   0.000         0.489   1.246   1.593   1.846   2.172   3.128 4206156    1
c1_alt       1.897   0.001         1.097   0.400   1.097   1.687   2.468   4.587 2640897    1
c2_alt       1.091   0.000         0.663   0.219   0.612   0.955   1.420   2.746 2686321    1
sig1_alt     3.397   0.000         0.827   2.218   2.818   3.255   3.814   5.400 7500000    1
sig2_alt     1.933   0.000         0.494   1.239   1.588   1.845   2.177   3.135 7500000    1
lp__       -58.512   0.001         1.726 -62.730 -59.425 -58.185 -57.245 -56.152 2659432    1

Bob_Carpenter · May 22, 2017, 7:38pm

Those are definitely different, especially in posterior sd, which looks like it must be due to some kind of tail behavior because all the quantiles reported in the 95% interval are close.

I don’t understand the gamma well enough to understand the reparameterization (how was it derived?), but could you be missing a Jacobian term in the inversion somewhere? All these flat priors are dangerous.

sakrejda · May 22, 2017, 8:00pm

He’s just taking the scaling of the Gamma variates outside of the Gamma distribution, that’s fine. I think this is something computational rather than mathematical. All those R-hat values are 1 so this is from a single chain.

@aaronjg Did you see if the traces take excursions like when you try to sample directly from the student’s t without reparameterizing? Did you look at the N_eff/iteration for more chains (e.g.-10)?

I’m not dismissing what you are seeing, I think there could be some real cautions we should be pointing out about re-parameterizing scale parameters.

Bob_Carpenter · May 22, 2017, 9:55pm

OK, I tried it myself from scratch. Here’s the data simulation:

N <- 200
sigma <- 1.5
y <- rnorm(N, 0, sigma)

And here’s the Stan model with the improper priors on theta1 (direct) and theta2 (transformed):

data {
  int N;
  vector[N] y;
}
parameters {
  real<lower=0> theta1;
  real<lower=0> sigma1;

  real<lower=0> theta2;
  real<lower=0> sigma2_unit;
}
transformed parameters {
  real sigma2 = sigma2_unit / theta2;
}
model {
  sigma1 ~ gamma(2, 1 / theta1);
  y ~ normal(0, sigma1);

  sigma2_unit ~ gamma(2, 1);
  y ~ normal(0, sigma2);
}

I fit it directly, increasing the target acceptance rate to remove divergence warnings:

fit <- stan("gamma-scale.stan",
            data = list(N = N, y = y),
            iter = 20000,
            control = list(stepsize = 0.01, adapt_delta = 0.95));

And the results are pretty close:

        mean se_mean    sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
sigma1 1.488   0.001 0.075 1.349 1.436 1.485 1.537 1.643 22192    1
sigma2 1.480   0.000 0.074 1.343 1.429 1.478 1.529 1.633 38327    1

But, the real key to seeing what’s going on is to look at he effect of improper prior,

                mean se_mean      sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
theta1        20.396   5.147 703.422    0.402    1.064    2.145    5.093   62.424 18678    1

As theta1 gets larger, sigma1 gets closer to zero. You can see from the huge standard deviation that there are very long tails here.

@betanalpha might have some idea if the first parameterization is well-behaved enough to fit.

Bob_Carpenter · May 22, 2017, 9:59pm

I should add that as you crank the sample size down, things get worse. This was for N = 10 rather than N = 200:

        mean se_mean     sd   1%    99% n_eff Rhat
theta1 21.25    3.32 394.25 0.35 187.16 14131    1
sigma1  1.86    0.00   0.51 1.10   3.55 17960    1
sigma2  1.65    0.00   0.39 1.02   2.93 31127    1

The posterior standard deviation is quite high, so the se_mean may be what’s off here; but usually that’s only a problem with high correlations—maybe it’s also a problem with high non-linearities?

I would assume that sigma2 is the one being estimated correctly. But with an improper prior, it’s impossible to check with fake data, because you don’t have a generative process.

sakrejda · May 22, 2017, 10:09pm

Nice! Lo and behold flat priors once again prove to be informative as well as nonsensical…

sakrejda · May 22, 2017, 10:26pm

Actually, if anybody is looking at this as an example of diagnosing bad behavior due to tails look at:

trace plot of theta1 over iterations
x-axis: theta1, y-axis: sigma1, z-axis: log-ensity

The really limited exploration of long tails you see in both is typical of this kind of problem (I just ran it for 1000 iterations total, you don’t need to do 10k to see this, just do 10 chains of 1000)

sakrejda · May 22, 2017, 10:28pm

Also, the summary statistics look ok despite this problem, even calculating R-hat with 10 chains.

sakrejda · May 22, 2017, 10:44pm

One more version, sampler working with log(theta3) instead of theta directly helps to tuck in the tail a little, it looks like it’s (almost?) as good as theta2.

data {
  int N;
  vector[N] y;
}
parameters {
  real<lower=0> theta1;
  real<lower=0> sigma1;

  real<lower=0> theta2;
  real<lower=0> sigma2_unit;

  real<lower=0> theta3_log;
  real sigma3;
}
transformed parameters {
  real sigma2 = sigma2_unit / theta2;
  real theta3 = exp(theta3_log);
}

model {
  sigma1 ~ gamma(2, 1 / theta1);
  y ~ normal(0, sigma1);

  sigma2_unit ~ gamma(2, 1);
  y ~ normal(0, sigma2);

  sigma3 ~ gamma(2, 1 / theta3);
  y ~ normal(0, sigma3);
  target += 1/theta3;

}

Run script:

library(rstan)
N <- 200
sigma <- 1.5
y <- rnorm(N, 0, sigma)
fit <- stan(file='problem.stan',  data = list(N = N, y = y),
 iter = 1000, control = list(stepsize = 0.01, adapt_delta = 0.95), chains=10);
shinystan::launch_shinystan(fit)

Bob_Carpenter · May 23, 2017, 7:35pm

The worrying aspect is the lack of divergences and the relatively high n_eff estimated for theta1.

Bob_Carpenter · May 23, 2017, 7:39pm

Isn’t that the wrong Jacobian adjustment? I think you need

exp'(theta3_log) = exp(theta3_log) = theta3

so that the log Jacobian adjustment is

target += theta3;

Topic		Replies	Views
Non-centered parameterization with gamma priors for variance Modeling	6	1806	September 30, 2017
Non centered parametrization of lognormal Distribution Modeling specification	4	1634	March 13, 2021
Non-Centered parameterizations Modeling	4	502	April 8, 2021
Non Centered Parameterization in GP Model Modeling	18	2175	June 20, 2017
Partial non-centered parametrizations in Stan Modeling techniques	7	2833	December 30, 2018

Non centered parameterization on variance parameter

Related topics