Fitting a simple normal model in Stan conditional on sufficient statistics & Jeffreys prior?

Likelihood1: Ybar~N(mu,sigma^2/n); s^2~gamma((n-1)/2,(n-1)/2/sigma^2)
LIkelihood2: yi ~ N(mu,sigma^2) , i=1:n
Jeffreys’ Prior: p(mu,sigma^2) propto 1/sigma^2 (same as “flat” prior…I guess*)
Objective: Fit in Stan with Likelihood1 and get the same posterior as the usual Likelihood2 (both employing Jeffreys prior)

Comment: Understand that Jeffreys prior is undesirable. I am doing this to replicate the results obtained using a GPQ method (essentially frequentist) which is based on sufficiant statistics. My contention is that the GPQ sample and the posterior sample will both be sampled from the posterior distribution. To do this I need to use the Jeffreys prior which is implied in the GPQ approach. I can get this result using Likelihood2 but I would like to demonstrate the case with Likelihood1 and so far I have failed. I think it is a coding issue I do not know how to express this in Stan.

  • I worry that the choice of likelihood may dictate the form of Jeffreys prior. If so it may well be that there is no prior choice that would render the posterior with Likelihood1 equivalent to that with Likelihood2+Jeffreys prior. If true it would be valuable for me to know that because I am struggling greatly to acheive this objective.

Any help greatly appreciated!!! :-) Dave

1 Like

You’re only going to be able to write down the posterior for the variance if you have a conjugate prior, which is an inverse gamma for variance (or gamma for precision). So you’re still going to need to code up the sampling for that.

@mike-lawrence provides an example here: PSA: using stantargets for SBC(-esque) checks, & big speedups for big gaussian likelihoods

Your scheme looks right other than that Stan uses a scale parameterization, not a variance parameterization. So that means replacing

y ~ normal(mu, sigma);
target += -2 * log(sigma);   // p(mu, sigma) propto 1 / sigma^2.

with

mean(y) ~ normal(mu, sigma / sqrt(n));
variance(y) ~ gamma((n - 1) / 2, (n - 1) / 2 / sigma^2);
target += -2 * log(sigma);

Here, it’s better to just compute mean(y) and variance(y) and (n - 1) / 2 once and store them. Also, I’d be careful to check that the variance required here is the sample variance (divides by n - 1) rather than the MLE (divides by n).

1 Like

Thanks Bob!! I will try this (also checked out Mike’s link) and will report back. :-) Dave

Thank you again Bob. I tried (my (mis)interpretation of) your suggestion to estimate the normal mu and sigma based on sufficient statistics. Bottom line - I am still seeing a difference in the estimate of sigma relative to that obtained based on the raw data. Likely I have incorrectly expressed the corresponding reference prior when using sufficient statistics. My worry is that it may not be possible, when estimating from sufficient statistics, to identify a reference prior for sigma that corresponds to the usual Jeffreys reference prior used when estimating from the raw data.

Below I provide my R and Stan codes (separate models for each method) as well as the estimates (you will see there is a difference in sigma posterior mean). Any thoughts greatly appreciated!!

Here is the Stan code used for estimation based on the full raw data vector:

// estimation of mu and sigma based on Y
data {
  int<lower=0> nY;
  vector[nY] Y;
}

parameters {
  real mu;        // implied reference prior
  real log_sigma; // implied reference prior
}

transformed parameters {
  real<lower=0> sigma = exp(log_sigma);
}

model {
  Y ~ normal(mu,sigma);
}

Here is the Stan code used for estimation based on sufficient statistics (note the transformed parameter statement has been commented out)

// Estimation of mu and sigma based on Ybar and S2
data {
  int<lower=0> nY;
  real Ybar;
  real<lower=0> S2;
}

parameters {
  real mu;
  real<lower=0> sigma;
}

transformed parameters {
//  real<lower=0> sigma = exp(log_sigma);
}

model {
  Ybar ~ normal(mu, sigma / sqrt(nY));
  S2 ~ gamma((nY - 1) / 2, (nY - 1) / 2 / sigma^2);
  target += -2 * log(sigma);
}

Below is the R code used to execute both of the above models

require(rstan) # Needed for Bayesian nonlinear model
options(mc.cores = parallel::detectCores()) # recommended by Stan
rstan_options(auto_write = TRUE) # avoids recompilation

# Raw data
Y <- c(13.98251, 14.62196, 13.88309, 13.77684, 14.70978,
       14.11597, 14.70279, 13.83011, 14.61396, 15.32342)

###### Estimation of mu and sigma based on raw data
###### Assume Jeffreys reference priors 

# Define data for Stan
dataList = list(
  Y=Y, nY =length(Y)
)

# Define the parameters for Stan
parameters = c("mu","sigma")

# Translate model to C++ and compile to DSO
StabDSO <- stan_model( 'Bob C example.stan' ) 

# Obtain sample of n.chains*(iter - warmup)/thin posterior draws
n.chains <- 4
options (mc.cores = n.chains)
stanFit <- sampling( object=StabDSO , 
                     pars=parameters,
                     data = dataList , 
                     chains = n.chains ,
                     iter = 100000 , 
                     warmup = 20000 , 
                     thin = 1,verbose=TRUE) 

stan.sum<-summary(stanFit,probs=c(0.025,0.5,0.975))
round(stan.sum$summary,digits=3)

###### Estimation of mu and sigma based on sufficient statistics
###### Assume Jeffreys reference priors 

# Define data for Stan
dataList = list(
  nY =length(Y), Ybar=mean(Y),  S2=var(Y)
)

# Define the parameters for Stan
parameters = c("mu","sigma")

# Translate model to C++ and compile to DSO
StabDSO <- stan_model( 'Bob C example SS.stan' ) 

# Obtain sample of n.chains*(iter - warmup)/thin posterior draws
n.chains <- 4
options (mc.cores = n.chains)
stanFit <- sampling( object=StabDSO , 
                     pars=parameters,
                     data = dataList , 
                     chains = n.chains ,
                     iter = 100000 , 
                     warmup = 20000 , 
                     thin = 1,verbose=TRUE) 

stan.sum<-summary(stanFit,probs=c(0.025,0.5,0.975))
round(stan.sum$summary,digits=3)

Below are the estimates obtained based on the raw data vector Y (note these estimates agree exactly with the GPQ algorithm I am trying to duplicate in Stan. This GPQ algorithm however is based on sampling distributions of the sufficient statistics not the Y vector)

        mean se_mean    sd   2.5%    50%  97.5%    n_eff Rhat
mu    14.356   0.000 0.183 13.992 14.356 14.719 154587.6    1
sigma  **0.559**   0.000 0.152  0.351  0.531  0.935 135685.9    1
lp__   1.135   0.003 1.101 -1.842  1.472  2.210 100395.1    1

Below are the estimates based on the sufficient statistics Ybar and S2. Notice that the posterior mean of sigma (0.527) is lower than the estimate based on the Y vector (0.559). I expect these estimates to be identical.

        mean se_mean    sd   2.5%    50%  97.5%    n_eff Rhat
mu    14.356   0.000 0.173 14.012 14.356 14.702 159543.8    1
sigma  **0.527**   0.000 0.143  0.331  0.500  0.881 136521.0    1
lp__   8.422   0.003 1.105  5.462  8.761  9.496 103504.7    1

Well, likely no one cares but I have been struggling mightily with the above despite my miserably deficient math stat and Stan skills.

Amazingly, Bob’s suggestion failed miserably. Sorry Bob, Likely I misunderstood your recommendation. I don’t really know how Stan decides to model the prior when the declaration for a variable implies a flat prior and the target =+ implies something else.

Mike Lawrence’s post was very helpful but Mike was using a Weibull prior and I needed (because of my objective) to use a Jeffreys reference prior.

I’ve tried everything my pea brain could dream up. I found something that comes close to agreeing with Likelihood2. It is not an EXACT match and I do not understand why. I thought I would post it and maybe some statistical archeologist in the far future will find it… below is the Stan code and output. Compare the sigma posterior mean and 95% interval to that obtained above with Likelihood2 code and output above.

> // Estimation of mu and sigma based on Ybar and S2
> data {
>   int<lower=0> nY;
>   real Ybar;
>   real<lower=0> S2;
> }
> 
> parameters {
>   real mu;
>   real logsigma;
> }
> 
> transformed parameters {
>   real<lower=0> sigma = exp(logsigma);
>   real<lower=0> sigma2 = sigma^2;
> }
> 
> model {
>   Ybar ~ normal(mu, sigma / sqrt(nY));
>   S2 ~ gamma((nY - 1) / 2, (nY - 1) / 2 / sigma2);
> }
> 
>         mean se_mean    sd   2.5%    50%  97.5%     n_eff Rhat
> mu    14.356   0.000 0.187 13.982 14.356 14.730 141864.72    1
> sigma  0.567   0.000 0.167  0.345  0.534  0.982 128984.54    1
> lp__   7.648   0.004 1.120  4.613  7.993  8.738  96417.28    1

In the first Stan program the parameter is log_sigma and the transformed parameter is sigma = exp(log_sigma), so the implicit reference prior is on the standard deviation, not the variance? That is, you want to put a Jeffreys prior on (mu, sigma), not on (mu, sigma^2)?

I think this means the prior is \propto 1/\sigma rather than \propto 1/\sigma^2 and the second Stan program should have target += - log(sigma); instead of target += -2 * log(sigma);.

Also, when I ran the “Bob C example SS.stan” code, I got this warning:

Found int division:
  (N - 1) / 2
Values will be rounded towards zero. If rounding is not desired you can
write the division as:
  (N - 1) / 2.0
If rounding is intended please use the integer division operator %/%.

So I replaced:

S2 ~ gamma((N - 1) / 2, (N - 1) / 2 / sigma^2);

with

S2 ~ gamma((N - 1) / 2.0, (N - 1) / 2.0 / sigma^2);

PS. Your adverbs are rather strong. Oftentimes omitting adverbs altogether is for the best.

That shouldn’t be a problem with sufficient statistics—they should give you the same impact on likelihood or they wouldn’t work as sufficient statistics.

Definitely not with the standard error of each other! Interesting that mu is estimated correctly but its stared deviation is underestimated exactly where sigma is underestimated.

I think this is why the models vary.

Every sampling statement and target increment statement just adds to the log posterior. So when we write

parameters { 
  real<lower=0> sigma;
}

a couple things happen. First, we log transform sigma to unconstrained space for sampling. We then convert back, but implicitly apply the change-of-variables adjustment for the inverse exp() transform. This gives sigma a uniform distribution on (0, infinity). If you then write

model {
  sigma ~ exponential(1);
}

then we just add log exponential(sigma | 1) to the target density. This acts like multiplication the density and since we started with a uniform, the result is that sigma will have an exponential(1) distribution.

Thanks! I’ve been doing a lot of Python where / just does real arithmetic everywhere.

Dont’ worry—I wasn’t offended!

Here are sample cases that work for the sufficient statistics. This is the result for improper flat priors:

normal-suff.stan

data {
  int N;
  vector[N] y;
}
transformed data {
  real mean_y = mean(y);
  real<lower=0> var_y = variance(y);
  real nm1_over2 = (N - 1) / 2.0;
  real sqrt_N = sqrt(N);
}
parameters {
  real mu;
  real<lower=0> sigma;
}
model {
  mean_y ~ normal(mu, sigma / sqrt_N);
  var_y ~ gamma(nm1_over2, nm1_over2 / sigma^2);
  // target += -2 * log(sigma);  // uncomment to add prior
}

normal.stan

data {
  int N;
  vector[N] y;
}
parameters {
  real mu;
  real<lower=0> sigma;
}
model {
  y ~ normal(mu, sigma);
  // target += -2 * log(sigma);  // uncomment to add prior
}

And here’s the simulation:

import numpy as np
import cmdstanpy as csp

N = 1000
mu = -1.3
sigma = 3.2
y = np.random.normal(size=1000, loc=mu, scale=sigma)
data = {'N': N, 'y': y}

for path in ['normal.stan', 'normal-suff.stan']:
    print(f"{path=}")
    model = csp.CmdStanModel(stan_file = path)
    fit = model.sample(data = data)
    print(fit.summary())

and here are the results:

onsider re-running with show_console=True if the above output is unclear!
             Mean      MCSE    StdDev          5%         50%        95%    N_Eff  N_Eff/s     R_hat
lp__  -1685.47000  0.021881  0.957243 -1687.39000 -1685.19000 -1684.5400  1913.85  18763.2  1.000860
mu       -1.35067  0.001863  0.102809    -1.51859    -1.35087    -1.1801  3045.70  29859.8  0.999494
sigma     3.27866  0.001193  0.073259     3.16046     3.27908     3.3995  3770.53  36966.0  1.000570

             Mean      MCSE    StdDev          5%         50%         95%    N_Eff  N_Eff/s    R_hat
lp__   1421.68000  0.022094  0.965087  1419.84000  1421.97000  1422.61000  1908.04  31800.6  1.00051
mu       -1.35280  0.001598  0.101669    -1.52219    -1.35371    -1.18389  4047.69  67461.4  1.00009
sigma     3.27941  0.001188  0.073899     3.15968     3.27808     3.40235  3869.01  64483.5  1.00019

Now let’s say you want to add in your prior of

target += -2 * log(sigma);

Just drop that into both model blocks and you get this:

lp__  -1698.39000  0.026115  1.059670 -1700.4400 -1698.07000 -1697.42000  1646.45  15387.4  1.000660
mu       -1.23748  0.001791  0.105948    -1.4132    -1.23756    -1.06151  3501.20  32721.5  1.000070
sigma     3.31041  0.001309  0.074760     3.1909     3.30999     3.43573  3260.24  30469.5  0.999488

             Mean      MCSE    StdDev          5%         50%         95%    N_Eff  N_Eff/s     R_hat
lp__   1408.80000  0.025063  1.010110  1406.87000  1409.10000  1409.74000  1624.37  25783.6  1.001920
mu       -1.23186  0.001805  0.103880    -1.40280    -1.23301    -1.05956  3311.27  52559.9  0.999797
sigma     3.31053  0.001326  0.073496     3.19558     3.30775     3.43517  3074.04  48794.3  1.000210

Sorry—I didn’t keep the same data between the two different simulations. But the point is you can add the prior to both models and you still get the same result.

OK Bob, that did it!! I used your 2 Stan codes (normal and normal-suff) above with my data and your model code does produce the identical estimates either using the raw data or the sufficient statistics. I ran each Stan code 3 different ways:

  1. With the target +/- commented out (I assume this places a uniform prior distribution on sigma)
  2. using target += -2 * log(sigma);, (I assume this places a uniform (flat) prior distribution on log(sigma^2) )
  3. using target += -1 * log(sigma); (I assume this places a uniform (flat) prior distribution on log(sigma) )

As it turns out, with version 3 (I assume uniform (flat) prior distribution on log(sigma) ), both of your Stan models produce the same estimates that are obtained with the GPQ algorithm I am trying to model (amounts to sampling alternately from the chisquare then conditionally from the normal - I think it is used someplace in BDA3). So you helped me achieve this objective and I am grateful (it is also my birthday today :-). I’m also grateful to Mike Lawrence and desislava. BTW- I apologize for my shameful use of adverbs and beg forgiveness. I have great respect for the contributors of this support tool. THANK YOU.

Glad it worked. I just added a section to the User’s Guide Efficiency Chapter on how to do this for the normal and the Poisson cases (we already had the bernoulli/binomial in place).

No worries about the language—I wasn’t offended.

Great. I need to check it out. I obviously need a better grounding in how Stan 1. Builds its log-likelihood (which Includes the prior components), and 2. Selects prior draws (especially from a uniform(-Inf,+Inf) prior). Is there a good reference for these? Thanks so much for your help!!

image001.jpg

image002.jpg

The Stan Reference Manula is the definitive guide but the short answers are:

  1. It transforms all constrained parameters to unconstrained space for sampling. It then applies the inverse transform and the Jacobian change of variables correction to the log density. The transforms we use are listed in a chapter of the reference manual. That allows the user to work in the natural parameter space. All distribution statements (e.g., a ~ foo(b)) and target increment statements (e.g., target += foo_lpdf(a | b)), increment the target log density. That’s really all there is to it. Generated quantities don’t contribute to the target density—they’re for efficient posterior predictive inference w/o autodiff.

  2. We initialize uniform(-2, 2) on the unconstrained scale. So if you have a positive-constrained parameter, it’s log transformed, so that means initialization is (non-uniform) between (exp(-2), exp(2)). If you add offsets and multipliers to unconstrained parameters, they get used for initialization. So if I write real<offset=mu, multiplier=sigma> alpha, then initialization will be (-2, 2) on the unconstrained scale, but then we multiply by sigma and offset by mu to initialize uniformly in the interval mu + sigma * (-2, 2).