# Looking for a way to specify a truncated lognormal prior to improve performance

Hi,

I am fairly new to Stan, and would like to know if it’s possible to require a lognormal prior to only return values within a specified range (i.e. specify a truncated lognormal distribution when that is not one of the available distributions listed in the Stan manual). I think I might have part of the answer to my question, but could use a little push. (The use case is also kind of interesting!) The full model background and details follow, but let me start with the meat of the question. I have some parameters that look like this:

parameters{
...
real<upper=2.2> alpha_8_1;
real<lower=2.2> alpha_8_4;
}


and in the model section I have:

model{
...
mu_alpha ~ normal(0,4); // weak but not uninformative
sigma_beta ~ cauchy(0,5); // weak but not uninformative
alpha_8_1 ~ lognormal(mu_alpha, sigma_alpha);
alpha_8_4 ~ lognormal(mu_alpha, sigma_alpha);
...
}


As you can see, in each iteration of the model, it’s completely possible for a draw from the prior for alpha_8_1, alpha_8_4, or both to lie outside the range allowed by the parameter’s constraint. Now if I remember this correctly and it applies to Hamiltonian Monte Carlo as well as ordinary MCMC, then in calculating the probably of a candidate jump getting accepted or rejected I’m multiplying by the prior probability, so that means that a particular jump might be more likely (100% likely?) to be rejected if the prior has a value completely outside the acceptable range for that variable. In other words, I would be seeing lots of autocorrelation between subsequent jumps within a chain, and potentially not just for parameter alpha_8_1 depending on the implementation (i.e. if parameters are accepted or rejected separately or in blocks). Here’s what I’m seeing (see the traceplot for alpha[14,1] below):

(Let’s ignore the within-chain label switching that’s causing the anomalous chain 4 for alpha[14,2] for the moment.) I suspect that I’m seeing autocorrelation in alpha[14,1] because of these issues with the priors.

I think that I could deal with the lower bound for alpha_8_4 by writing something like this:

   alpha_8_4 ~ lognormal(mu_alpha, sigma_alpha) + 2.2;


Is that right?

But then how do I specify the prior for alpha_8_1 to avoid getting draws greater than the threshold, 2.2? I think I would need to set the prior for that parameter to a truncated lognormal distribution, but I didn’t see any specification for such a distribution in the Stan manual. I can manufacture one, though, right? Since a lognormal distribution with given location and scale parameters can be written in terms of a normal distribution, then if I knew the exact location and scale parameters for my desired truncated lognormal distribution, I think I could probably calculate the necessary upper bound for the truncated normal distribution that would
generate my desired truncated lognormal distribution, and then actually implement that as Stan code. R code follows, as I’m still used to translating this into Stan code:

   X <- exp(mu + sigma*Z)


where X is the truncated lognormal variable, mu and sigma are respectively the location and scale parameters for the truncated lognormal distribution, and Z is a truncated normal variable that can be specified in Stan. But how in Stan would I compute the upper bound to use in specifying Z when mu and sigma are themselves hyperparameters?

That’s my question. Here’s more background if any of that doesn’t make sense:

MODEL

I’m fitting a five-dimensional compensatory two-parameter logistic IRT model. Right now I’m generating true model parameters and making sure I can recover them, and I’ve successfully done so for 1-D, 2-D, and 3-D models, and for one 5-D model in which the ability parameters are uncorrelated. Here I’ve specified that they are correlated, and I use spherical priors for the theta parameters to specify the units as one type of identifying constraint. (The model is still subject to rotational invariance, so the parameters that I get back from the model won’t match the “true” parameters and even if everything is correct, I will have to rotate the coordinate axes to recover the true parameters as per Reckase (2009).) In lower dimensions, I observe label switching between chains:

Here’s an example from a three-dimensional 2PL model where three distinct alpha parameters, one for each dimension, can be observed. It’s straightforward (albeit slow) to solve this problem by “relabeling” the chains, i.e. copying observations around in the resulting Stan model object. In five dimensions, the posterior distributions of the alpha parameters appear to be so highly multimodal that I start seeing label switching within chains as well as between them:

The true values of alpha[15,1], alpha[15,2], alpha[15,3], alpha[15,4] and alpha[15,5] are 1.312 ,1.367, 0.618 ,2.653 ,and 0.671 respectively, and the parameter values in the example above are close to but not exactly these values due to rotational invariance. In this case, alpha[15,4] has a posterior mean of about 3.0, but different chains capture this parameter during different iterations of the model. To deal with the label switching within chains, my current strategy is to inspect plots like this one, and look for chains in which there appears to be a clear break between draws for a parameter associated with one specific dimension of the difficulty parameters for a particular item and those associated with all the other dimensions for that item. In other words, I’m looking for a horizontal “line” containing as much white as possible that can separate the various chains. In this case I might assign a constraint to alpha[15,4] to force it to always be above about 2.5, and similarly constrain alpha[15,1], alpha[15,2], alpha[15,3], and alpha[15,5] to be less than 2.5. My current full model, badly written to roll something out quickly, is as follows:

// Useful resource: https://rfarouni.github.io/assets/projects/BayesianMIRT/BayesianMIRT.html
data{
int<lower=1> n_respondents;
int<lower=1> n_items;
int<lower=0,upper=1> Y[n_respondents,n_items];
int<lower=1> D; // number of dimensions: required to be 5 in this case
}
transformed data{
cov_matrix[D] Sigma;
// From p. 335 of the manual: indices for unconstrained alpha parameters. Badly written code here,
// just want to write something that works and then improve.
int<lower=1, upper=38> idxs[165,2] = { {1, 1}, {1, 2}, {1, 3}, {1, 4}, {1, 5},
{2, 1}, {2, 2}, {2, 3}, {2, 4}, {2, 5},
{3, 1}, {3, 2}, {3, 3}, {3, 4}, {3, 5},
{5, 1}, {5, 2}, {5, 3}, {5, 4}, {5, 5},
{6, 1}, {6, 2}, {6, 3}, {6, 4}, {6, 5},
{7, 1}, {7, 2}, {7, 3}, {7, 4}, {7, 5},
{9, 1}, {9, 2}, {9, 3}, {9, 4}, {9, 5},
{11, 1}, {11, 2}, {11, 3}, {11, 4}, {11, 5},
{12, 1}, {12, 2}, {12, 3}, {12, 4}, {12, 5},
{13, 1}, {13, 2}, {13, 3}, {13, 4}, {13, 5},
{14, 1}, {14, 2}, {14, 3}, {14, 4}, {14, 5},
{15, 1}, {15, 2}, {15, 3}, {15, 4}, {15, 5},
{16, 1}, {16, 2}, {16, 3}, {16, 4}, {16, 5},
{17, 1}, {17, 2}, {17, 3}, {17, 4}, {17, 5},
{18, 1}, {18, 2}, {18, 3}, {18, 4}, {18, 5},
{19, 1}, {19, 2}, {19, 3}, {19, 4}, {19, 5},
{21, 1}, {21, 2}, {21, 3}, {21, 4}, {21, 5},
{22, 1}, {22, 2}, {22, 3}, {22, 4}, {22, 5},
{23, 1}, {23, 2}, {23, 3}, {23, 4}, {23, 5},
{24, 1}, {24, 2}, {24, 3}, {24, 4}, {24, 5},
{25, 1}, {25, 2}, {25, 3}, {25, 4}, {25, 5},
{26, 1}, {26, 2}, {26, 3}, {26, 4}, {26, 5},
{27, 1}, {27, 2}, {27, 3}, {27, 4}, {27, 5},
{28, 1}, {28, 2}, {28, 3}, {28, 4}, {28, 5},
{29, 1}, {29, 2}, {29, 3}, {29, 4}, {29, 5},
{30, 1}, {30, 2}, {30, 3}, {30, 4}, {30, 5},
{31, 1}, {31, 2}, {31, 3}, {31, 4}, {31, 5},
{32, 1}, {32, 2}, {32, 3}, {32, 4}, {32, 5},
{33, 1}, {33, 2}, {33, 3}, {33, 4}, {33, 5},
{34, 1}, {34, 2}, {34, 3}, {34, 4}, {34, 5},
{35, 1}, {35, 2}, {35, 3}, {35, 4}, {35, 5},
{36, 1}, {36, 2}, {36, 3}, {36, 4}, {36, 5},
{37, 1}, {37, 2}, {37, 3}, {37, 4}, {37, 5} };
row_vector [D] mu_theta = rep_row_vector(0.0, D);
Sigma = rep_matrix(0.0, D, D);
for(i in 1:D){
Sigma[i,i] = 1.0;
} // Identity matrix as per Reckase (2009) and Chun Wang (2015)

}
parameters{
vector [D] theta [n_respondents];
vector<lower=0>[n_items*D-25] alpha_raw;
real<lower=2.0> alpha_8_4;
real<upper=2.2> alpha_8_1;
real<upper=2.2> alpha_8_2;
real<upper=2.2> alpha_8_3;
real<upper=2.2> alpha_8_5;
real<lower=2.15> alpha_10_1;
real<upper=2.15> alpha_10_2;
real<upper=2.15> alpha_10_3;
real<upper=2.15> alpha_10_4;
real<upper=2.15> alpha_10_5;
real<lower=1.7> alpha_38_3;
real<upper=1.7> alpha_38_1;
real<upper=1.7> alpha_38_2;
real<upper=1.7> alpha_38_4;
real<upper=1.7> alpha_38_5;
real<lower=1.6> alpha_4_5;
real<upper=1.6> alpha_4_1;
real<upper=1.6> alpha_4_2;
real<upper=1.6> alpha_4_3;
real<upper=1.6> alpha_4_4;
real<lower=1.4> alpha_20_2;
real<upper=1.4> alpha_20_1;
real<upper=1.4> alpha_20_3;
real<upper=1.4> alpha_20_4;
real<upper=1.4> alpha_20_5;
real mu_alpha;
// mu_alpha doesn't have to be lower-bounded at zero because only
// exp(mu_alpha) has to be positive.

vector[n_items] beta;
real<lower=0> sigma_alpha;

real<lower=0> sigma_beta;
real mu_beta;
}
transformed parameters {
matrix[n_items, D] alpha;
for(i in 1:165){
alpha[idxs[i,1], idxs[i,2]] = alpha_raw[i];
}
alpha[4,1] = alpha_4_1;
alpha[4,2] = alpha_4_2;
alpha[4,3] = alpha_4_3;
alpha[4,4] = alpha_4_4;
alpha[4,5] = alpha_4_5;
alpha[8,1] = alpha_8_1;
alpha[8,2] = alpha_8_2;
alpha[8,3] = alpha_8_3;
alpha[8,4] = alpha_8_4;
alpha[8,5] = alpha_8_5;
alpha[10,1] = alpha_10_1;
alpha[10,2] = alpha_10_2;
alpha[10,3] = alpha_10_3;
alpha[10,4] = alpha_10_4;
alpha[10,5] = alpha_10_5;
alpha[20,1] = alpha_20_1;
alpha[20,2] = alpha_20_2;
alpha[20,3] = alpha_20_3;
alpha[20,4] = alpha_20_4;
alpha[20,5] = alpha_20_5;
alpha[38,1] = alpha_38_1;
alpha[38,2] = alpha_38_2;
alpha[38,3] = alpha_38_3;
alpha[38,4] = alpha_38_4;
alpha[38,5] = alpha_38_5;
}
model{

// hyperpriors
sigma_alpha ~ cauchy(0,5);
mu_alpha ~ normal(0,4); // weak but not uninformative
sigma_beta ~ cauchy(0,5); // weak but not uninformative
mu_beta ~ normal(0.0, 4); // weak but not uninformative

// priors
for(i in 1:n_respondents){
theta[i] ~ multi_normal(mu_theta, Sigma);  // Reckase 2009
}
beta ~ normal(mu_beta,sigma_beta);
alpha_raw ~ lognormal(mu_alpha, sigma_alpha); // to_vector(alpha_raw)?
// Stan forum question: How do I do something like a truncated lognormal here?
alpha_4_1 ~ lognormal(mu_alpha, sigma_alpha);
alpha_4_2 ~ lognormal(mu_alpha, sigma_alpha);
alpha_4_3 ~ lognormal(mu_alpha, sigma_alpha);
alpha_4_4 ~ lognormal(mu_alpha, sigma_alpha);
alpha_4_5 ~ lognormal(mu_alpha, sigma_alpha);
alpha_8_1 ~ lognormal(mu_alpha, sigma_alpha);
alpha_8_2 ~ lognormal(mu_alpha, sigma_alpha);
alpha_8_3 ~ lognormal(mu_alpha, sigma_alpha);
alpha_8_4 ~ lognormal(mu_alpha, sigma_alpha);
alpha_8_5 ~ lognormal(mu_alpha, sigma_alpha);
alpha_10_1 ~ lognormal(mu_alpha, sigma_alpha);
alpha_10_2 ~ lognormal(mu_alpha, sigma_alpha);
alpha_10_3 ~ lognormal(mu_alpha, sigma_alpha);
alpha_10_4 ~ lognormal(mu_alpha, sigma_alpha);
alpha_10_5 ~ lognormal(mu_alpha, sigma_alpha);
alpha_20_1 ~ lognormal(mu_alpha, sigma_alpha);
alpha_20_2 ~ lognormal(mu_alpha, sigma_alpha);
alpha_20_3 ~ lognormal(mu_alpha, sigma_alpha);
alpha_20_4 ~ lognormal(mu_alpha, sigma_alpha);
alpha_20_5 ~ lognormal(mu_alpha, sigma_alpha);
alpha_38_1 ~ lognormal(mu_alpha, sigma_alpha);
alpha_38_2 ~ lognormal(mu_alpha, sigma_alpha);
alpha_38_3 ~ lognormal(mu_alpha, sigma_alpha);
alpha_38_4 ~ lognormal(mu_alpha, sigma_alpha);
alpha_38_5 ~ lognormal(mu_alpha, sigma_alpha);

// likelihood
// More efficient to write loops column by column as matrices are stored internally in column-major order (Stan manual p. 327)
for(j in 1:n_items){
for(i in 1:n_respondents){
real p;
p = inv_logit(row(alpha,j)*theta[i]+beta[j]);
Y[i,j] ~ bernoulli(p);
}
}
}
generated quantities{
vector[n_items] log_lik[n_respondents];
vector[n_items] pY[n_respondents];
for(j in 1:n_items){
for(i in 1:n_respondents){
real p;
pY[i,j] = inv_logit(row(alpha,j)*theta[i]+beta[j]);
log_lik[i,j] = bernoulli_lpmf(Y[i,j] | pY[i,j]);
}
}
}


Note that to get a model like this one to work, I need to force the initial values to satisfy the constraints. Here’s some R code for that:

# For initial value generation.
generate.alpha.with.upper.bound <- function(meanlog, sdlog, upper){
result <- Inf
while(result > upper){
result <- rlnorm(1, meanlog=meanlog, sdlog=sdlog)
}
return(result)
}

# For initial value generation.
generate.alpha.with.lower.bound <- function(meanlog, sdlog, lower){
result <- -Inf
while(result < lower){
result <- rlnorm(1, meanlog=meanlog, sdlog=sdlog)
}
return(result)
}
set.seed(12345)
inits <- list()
for(i in 1:4){
chain.inits <- list()
chain.inits$alpha_4_1 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=1.6) chain.inits$alpha_4_2 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=1.6)
chain.inits$alpha_4_3 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=1.6) chain.inits$alpha_4_4 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=1.6)
chain.inits$alpha_4_5 <- generate.alpha.with.lower.bound(meanlog=0, sdlog=1, lower=1.6) chain.inits$alpha_8_1 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=2.2)
chain.inits$alpha_8_2 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=2.2) chain.inits$alpha_8_3 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=2.2)
chain.inits$alpha_8_4 <- generate.alpha.with.lower.bound(meanlog=0, sdlog=1, lower=2.0) chain.inits$alpha_8_5 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=2.2)
chain.inits$alpha_10_1 <- generate.alpha.with.lower.bound(meanlog=0, sdlog=1, lower=2.15) chain.inits$alpha_10_2 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=2.15)
chain.inits$alpha_10_3 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=2.15) chain.inits$alpha_10_4 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=2.15)
chain.inits$alpha_10_5 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=2.15) chain.inits$alpha_38_1 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=1.7)
chain.inits$alpha_38_2 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=1.7) chain.inits$alpha_38_3 <- generate.alpha.with.lower.bound(meanlog=0, sdlog=1, lower=1.7)
chain.inits$alpha_38_4 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=1.7) chain.inits$alpha_38_5 <- generate.alpha.with.upper.bound(meanlog=0, sdlog=1, upper=1.7)
inits[[i]] <- chain.inits
}


Here are the simulated data I’m working with:

5D_IRT_data.csv (105.8 KB)

(If you’d like to critique anything else of course, please go ahead since I’m new and that might help someone else here as well.) Thanks very much for any insight you could provide!

Best,

Richard

Your parameters that have lognormal priors need <lower = 0> bounds, as well as any additional truncation you wish to specify. See here.

The MCMC sampler does not draw from the prior distribution. Additionally, I don’t see how the traceplots that follow demonstrate the latter point?

parameters {
real <lower = 2.2> alpha_8_4;
}


already constraints alpha_8_4 to be greater than 2.2.

parameters {
real <lower = 0, upper = 2.2> alpha_8_1;
}

model {
alpha_8_1 ~ lognormal(mu_alpha, sigma_alpha);
}


The declared bounds will constrain alpha_8_1 to be between [0, 2.2]. Placing the lognormal distribution on alpha_8_1 just changes the shape of the prior from uniform between [0, 2.2] to having truncated lognormal between [0, 2.2].

You shouldn’t need to do this once you change the parameter declarations to have appropriate bounds.

Higher level thoughts: relabeling quickly becomes infeasible, you need to reparametrize / break the symmetry in the model in order to ensure accurate estimation. Also I’m not sure what you mean by rotational invariance?

Quite right! Rookie mistake there. I should have recognized that was the bug, instead I specified initial values myself. In this case the chains quickly converged to positive alpha values that never crossed zero, so I ended up getting back my simulated parameter values without issue. But yes, that is definitely a bug. It’s possible that the bug may have extended the required burn-in time for the chains, but the traceplots show alpha parameters with draws that are never dropping below zero for any samples. So my original question about improving performance is still valid, I’d say.

What I think the traceplots are showing is less-than-ideal autocorrelation. My understanding is that when that happens, it indicates candidate jumps that are getting rejected by the sampler. If I’m remembering the Metropolis algorithm correctly (see here for example), the acceptance probabilities have the prior probabilities as factors. So if p(\alpha)=0 because lognormal(mu_alpha, sigma_alpha) draws a value that does not satisfy the constraint on that parameter, then the entire acceptance probability becomes zero, and we’ll see two consecutive draws with the same value in the traceplot, i.e. autocorrelation. So I agree with you that currently I’m already drawing values for, say, alpha_8_1 from a truncated lognormal distribution, but the problem is that I think I’m doing so inefficiently.

Quite right! Let me share some new plots. Do you see the constraints on alpha_20_1 through alpha_20_5 in the full model code? Again, there should be a lower bound of zero on those parameters, but I got lucky this time and here are the resulting traceplots:

In this traceplot, you can see how the upper bound of 1.4 on alpha[20,1] and the lower bound of 1.4 on alpha[20,2] are likely slightly biasing the posterior means for those parameters (downwards for alpha[20,1] and upwards for alpha[20,2]. There seemed to be no way around it in my case, because the posterior for dimension 2 overlapped with those for the other dimensions even more for the other 37 items. Here’s one problem that creates:

See the odd peak in the draws for alpha[32,2] around the 700th draw and corresponding trough for alpha[32,1]? That’s label switching going on: I think the posterior draw for alpha[20,1] approached 1.4 from below, the posterior draw for alpha[20,1] approached 1.4 from above, and during the next jump alpha[20,1] jumped above 1.4 and dimension 1 was relabeled as dimension 2, and alpha[20,2] simultaneously jumped below 1.4 and dimension 2 was relabeled as dimension 1. The chains continued merrily for 50 draws before the reverse happened. So interesting!

This is a really cool problem that specifically shows up for multidimensional IRT
models, and Reckase (2009), Multidimensional Item Response Theory, gives it good coverage. The probability of observing the correct response on a test is a function of the alpha, beta, and theta parameters. Now let’s say we’ve graphed a respondent’s abilities in five dimensions. Well, we can move the origin, rotate the coordinate axes any number of ways, or change the scale for any of the coordinate axes, and there will always be another set of parameter values that will produce exactly the same probabilities of correct response to the items. Reckase (2009) also gives his readers an algorithm for moving the origin and rotating / rescaling the coordinate axes to map one set of responses into a coordinate system captured by another set of responses (it’s basically a non-orthogonal Procrustes problem although I couldn’t find a Wiki page for the non-orthogonal variant).

I threw chain 2 out of the analysis and recovered my simulated parameter values after doing the Procrustes transformation. In particular, after the rotation the posterior means for alpha[15,] are 1.3896907 1.5301941 0.8472825 2.5893580 0.8807151, and the simulated values were [1.312 1.367 0.618 2.653 0.671]. Not bad! 93.7% of the alpha parameters had true values within the (frequentist) 95% posterior confidence intervals. I suspect that some accuracy was lost during the Procrustes transformation, and the small amount of bias visible in the traceplots for alpha[20,] could add a little more inaccuracy. And maybe fixing those bugs in the code would help.

So it’s nice to get this far! But I’d still very much love to improve the model performance. I really appreciate any ideas.

When the parameters have the appropriate constraints declared on them, Stan will never propose a value outside the feasible region (outside of those constraints). There is a transformation to the unconstrained scale going on under the hood that means [a, b] gets mapped to (-\infty, \infty) for the purposes of sampling, and then back to [a, b] for the purposes of inference, so you won’t ever see a rejection due to a value being proposed from outside [a, b].

Have you thought about defining these parameters in terms of positive_ordered types?

Excuse me for taking awhile to get back. I don’t think that would fix the problem in question, would it? Let’s say I have two variables, \alpha_1 and \alpha_2 corresponding to two dimensions of a finite mixture or IRT model, with 0 < \alpha_1 < \alpha_2. What I’m saying is that during estimation, two chains switch labels exactly at the same time that we get a draw with \alpha^*_1 > \alpha^*_2, such that the draws get accepted and during the next iteration of the chain we still have \alpha_1 < \alpha_2, since dimensions 1 and 2 have swapped places. But even though positive_ordered types don’t seem like an obvious fix in theory, maybe it would work in practice. I’ll put that on the back burner and post back if it works, but in the meantime I’m still wondering how to specify a truncated lognormal prior. Thanks for the responses!

If you have a constraint in the model that 0 < \alpha_1 < \alpha_2, then you want to declare

positive_ordered alpha;


That will enforce the constraint. If you add independent priors, say half-normal,

alpha ~ normal(0, 1);


then the positive ordered alpha has the same posterior as taking two positive values,

parameters {
vector<lower = 0> alpha_raw;
...
model {
alpha_raw ~ normal(0, 1);
...
transformed parmaeters {
positive_ordered alpha = sort_asc(alpha_raw);


For a trunctaed lognormal prior, there’s a chapter in the manual on truncation. If you want beta to have a lognormal truncated below by 0.3 say, you just do this:

real<lower = 0.3> beta;
...
beta ~ lognormal(mu, sigma) [0.3, ];


You con’t need the truncation brackets if mu and sigma are both constants.