 # Bounded parameters in hierarchical models: Which hyperpriors to use

I have a lot of models involving bounded parameters. Usually, these are either bounded as positive or over the interval [0,1]. Since I have data from many participants, I usually try to use these models as part of a hierarchical modeling approach.

However, so far I have not found any usable (and maybe citable) recommendations on which hierarchical priors to use for these kinds of models.

With purely positive parameters, I usually use a log-normal for the individual parameters (or a normal that is then exponentiated, i.e. Matt’s trick). The priors for the parameters are a normal (on \mu) and a Cauchy (on \sigma). So something like this:

data {
int N;
}

parameters {
real mu_theta;
real<lower=0> sigma_theta;

vector[N] theta_z;
}

transformed parameters {
vector[N] theta = exp(theta_z*sigma_theta + mu_theta);
}

model {
mu_theta      ~ normal(0,1);
sigma_theta   ~ cauchy(0,1);

theta_z ~ std_normal();
}

generated quantities {
real mean_theta = exp(mu_theta + sigma_theta^2/2);
real var_theta  = exp(sigma_theta^2-1)*exp(2*mu_theta+sigma_theta^2);
}


However, with this approach, I get extremely spread out means, which don’t really fit my data (usually I can tell my means are much lower). I.e. the above gives me a prior distribution on the mean with a 95% HDI between 0.21 and 2.7E114 (some values also go to infinity, so mean is not defined). If I restrict \sigma so I don’t get usable values, then mean and var are correlated with r=.8, which also seems bad to me.

So another approach I have tried is to put the priors on mean and variance, and then calculating the distribution parameters based on these:

data {
int N;
}

parameters {
real<lower=0> mean_theta;
real<lower=0> var_theta;
vector<lower=0>[N] theta;
}

transformed parameters {
real  mu_theta = log(mean_theta/sqrt(1+var_theta/mean_theta^2));
real  sigma_theta = sqrt(log(1+var_theta/mean_theta^2));
}

model {
mean_theta   ~ exponential(1);
var_theta  ~ cauchy(0,1);

theta ~ lognormal(mu_theta,sigma_theta);
}


While this gives me much more usable values on mean, variance, and the individual parameter (also mean can now be controlled better), it also gives me warnings about bad mixing of the chains. I also tried similar approaches with priors on the group-level mode or median, with similar results.

I found a similar problem with parameters that are restricted to the interval [0,1]. In other models (e.g. StanDDM) I often found the approach to sample from a normal distribution and then transform the resulting values to the interval using either a logit or probit transformation. However, since the resulting logit-normal can become multimodal depending on the mean and variance of the normal, I found that this can lead to the sampler getting stuck (I assume this creates some kind of funnel geometry in the posterior). Also, with this approach, the resulting mean of the individual parameters is not even defined analytically, and high means become unlikely. This is bad, because it puts a prior on the group mean, that I often don’t want or which I want to control better.

An approach I have come up with instead is to sample the parameters of a beta distribution instead (the limits on \alpha and \beta ensure the distribution stays unimodal):

data {
int N;
}

parameters {
real<lower=1> alpha_theta;
real<lower=1> beta_theta;
vector<lower=0,upper=1>[N] theta;
}

transformed parameters {
real  mean_theta = alpha_theta/(alpha_theta+beta_theta);
real  var_theta =  alpha_theta*beta_theta/((alpha_theta+beta_theta)^2*(alpha_theta+beta_theta+1));
}

model {
alpha_theta ~ exponential(1);
beta_theta  ~ exponential(1);

theta ~ beta(alpha_theta,beta_theta);
}


While this approach seems to work fairly well in most cases, it severely limits my ability to add priors on the group-level mean of \theta.

I have by now even though of generalizing the triangular distribution as

f_{c,k}(x)=\begin{cases} \frac{(k+1)x^k}{c^k} \text{ for } x \leq c\\ \frac{(k+1)(1-x)^k}{(1-c)^k} \text{ for } x > c \end{cases}

so that I can set a prior on the group-level mode c\in[0,1] and use k>0 as a kind of centrality parameter. But I am a bit worried about inventing my own distribution without understanding what it might do the model.

Are there any good guidelines about using these kinds of constrained parameters in hierarchical models, that avoid the kind of problems I found above (i.e. too high values and/or funnels in the model). So far I have not found anything that really worked well for me.

I wonder if it is possible to code these boundaries like constraints instead of link functions: Horseshoe and regression coefficients with bounds ?

That way instead of putting a prior and transforming, you are putting the prior are the transformed thing.