Setting up priors correctly

I am recreating some model results from a paper, and am unsure whether I’m setting up the model properly in my stan code.

The outcome is defined as:

Y ~ N(\theta_pbs, \sigma^2)

and the prior distributions for the parameters are:

[\theta_pbs] ~ N(\theta_0, \tau^2)

[sigma^2] ~ IG(\alpha, \beta)

and the paper states that the following prior values are used:

\theta_0 = 0
\tau = 1
\alpha = 0.001
\beta = 0.001

My stan code was set up as shown below.

data {
   int<lower=0> N_pbs;
   vector[N_pbs] y_pbs;

parameters {
  real theta_0;
  real tau;
  real sigma;
  real theta_pbs;
  real<lower=0.000001> alpha;
  real<lower=0.000001> beta;
// Setting lower bounds here to restrict alpha, beta to positive real numbers
model {
  y_pbs ~ normal(theta_pbs, pow(sigma, 2));
  theta_pbs ~ normal(theta_0, pow(tau,2));
  sigma ~ inv_gamma(alpha, beta);

The data, Y_pbs, is just simulated from a normal distribution in R with a mean 0 and a sd of 0.8.


y_pbs = rnorm(n = 400, mean = 0, sd = 0.8)

I then run the sampling in R using rstan, with the following setup:

data = {
   y_pbs = y_pbs, 
   N_pbs = length(y_pbs), 
   theta_0 = 0, 
   tau = 1, 
   alpha = 0.001, 
   beta = 0.001

mod_test <- stan_model(file = "first_stan.stan")

res <- rstan::sampling(mod_test,
                data = data,
                chains = 4,
                iter = 2000,
                seed = seed,
                cores = 4)

I ended up with the following warnings

Warning messages:
1: There were 67 divergent transitions after warmup. See
to find out why this is a problem and how to eliminate them. 
2: There were 3760 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See 
3: Examine the pairs() plot to diagnose sampling problems
4: The largest R-hat is 2.76, indicating chains have not mixed.
Running the chains for more iterations may help. See 
5: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See 
6: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See 

I just want to make sure that I’m setting up the priors and model correctly, to determine whether the issues related to the model specification itself.

Thanks for any help in advance.

The model seems a little strange. Maybe I can clarify some things.

  • Essentially as I understand your model describes the following scenario:
    y_{pbs} \sim \mathcal{N}(\theta_{pbs},\sigma)\\ \theta_{pbs} \sim \mathcal{N}(\theta_{0},\tau) It appears to be hierarchical by design, but in fact having a normal with a mean that is itself normally distributed just creates another normal with standard deviation \sqrt{\sigma^{2}+\tau^{2}}, and as for a mean \mathbb{E}(y_{pbs})=\mathbb{E}(\theta_{pbs})=\theta_{0}. If you want to deal with a gaussian mixture or partial pooling you need more information that is currently provided.

  • When you call normal ~ (mu, sigma) sigma is entered in Stan as standard deviation and not variance. So in effect your squaring here means that your true sigma and tau is being modelled as the square root of the standard deviation. Not sure if you want that, and also it may cause issues for the sampler. On that note, I would also recommend adding <lower=0> to sigma and tau to make sure the sampler is aware that \{\sigma,\tau\}\in \mathbb{R}_{>0}.

  • You mention 0.0001 is a prior for alpha and beta, but here you list them as lower bounds. This doesn’t make sense to me and I think it is generally poor practice to make sure hard constraints. Why would you need these specific cutoffs? If you are expecting them to be low, you can model alpha and beta on the log scale and then exponentiate them in the transformed parameters or model block. As your code reads now your values of alpha and beta cannot go below 0.0001 which I have a feeling may not be what you’re actually looking for.

  • There is no prior for tau which seems a little strange since it seems to be a variance hyperparameter. Adding a lower bound as prevously mentioned, and a proper prior, will help a bit for fitting, and also you may want to considered non-centered paramtetrization for theta_0. For example, using a dummy variable theta_pbs_raw

real theta_pbs_raw; //dummy variable
  theta_pbs_raw ~ std_normal();
  tau ~ exponential(1); // just an example prior for standard deviation, you can change
  theta_0 ~ ?; // not sure what is your expectation for the mean of this distribution, but put it here
  real theta_pbs = theta_pbs_raw*tau + theta_0; //scaling and shifting theta_pbs_raw
  y_pbs ~ normal(theta_pbs, sigma);

Hopefully these provide enough direction. Regarding priors for variance parameters there are great posts such as this one for you to study and other forum posts discussing why hard constraints such as for your alpha and beta are not recommended and also general information about variance priors, here and here and here

Thanks for the help, that is a good point about the variance just being the sum of the two variances. Regarding the lower bounds for alpha and beta, basically I was trying to bound them to Real numbers greater than 0 (to meet the requirements of the inverse gamma distribution). I was under the impression that <lower=0> was inclusive, is that incorrect?

I’ll take these comments into consideration and check out the links you provided as well. Thanks!

In reality it is may be inclusive because of underflowing, in theory it is exclusive for real numbers. Since you want the sampler to get the distribution, which may be around 0.0001 and not absolutely limited by 0.0001, the constraint is not good for sampling, though it may be used as a prior.

I quote from the manual the manual (second link I sent) :

hard constraints are not recommended, for two reasons:
(a) Except when there are logical or physical constraints, it is very
unusual for you to be sure that a parameter will fall inside a specified
range, and (b) The infinite gradient induced by a hard constraint can cause
difficulties for Stan’s sampling algorithm. As a consequence, we recommend
soft constraints rather than hard constraints; for example, instead of
constraining an elasticity parameter to fall between 0, and 1, leave it
unconstrained and give it a normal(0.5, 0.5) prior distribution.