Hi everyone,
The general approach to making models faster is, where possible, to have the parameters be closer to multivariate normal.
Suppose I have the following toy model:
data{
int n;
vector[n] y;
}
parameters{
real mu;
real<lower=0> sig;
}
model{
y ~ lognormal(mu, sig);
}
Presumably the following model is equivalent but much easier to sample from. Is this right?
data{
int n;
vector[n] y;
}
transformed data{
vector[n] log_y;
for(i in 1:n) log_y[i] = log(y[i]);
parameters{
real mu;
real<lower=0> sig;
}
model{
log_y ~ normal(mu, sig);
}
Is there any reason not to do what’s listed above in general? It seems to me that in almost all circumstances the lognormal()
sampling function is convenient but likely to lead to more work for NUTS.
Thanks
1 Like
everything gets unconstrained in the background anyway – the key as far as I understand is then to find ways of specifying the model such that the unconstrained parameters do not vary their curvature / relation to other parameters too much across different points in the space.
1 Like
The approach you proposed is unlikely to help much. The important part for Stan’s sampler is how the parameters behave. What you have shown should give exactly the same likelihood for the same parameter values. There might be a small peformance gain for precomputing log_y
, but that is likely to be negligible.
It would be a different story if y
was a parameter. Let’s look at 3 cases:
A)
parameters{
real mu;
real<lower=0> sig;
real<lower=0> y;
}
model {
y ~ lognormal(mu, sig);
//do something more with y
}
B)
parameters{
real mu;
real<lower=0> sig;
real log_y;
}
transformed parameters {
real y = exp(log_y);
}
model {
log_y ~ normal(mu, sig);
//do something more with y
}
C)
parameters{
real mu;
real<lower=0> sig;
real log_y_raw;
}
transformed parameters {
real y = exp(log_y_raw * sig + mu);
}
model {
log_y_raw ~ normal(0, 1);
//do something more with y
}
Here A) and B) are basically equivalent, because for parameters with lower bound, Stan does exactly this log transform under the hood.
However, in many cases C) would be preferable to both A) and B) as log_y_raw
is less tangled with mu
and sigma
. This is also called the “non-centered parametrization”.
Does that make sense?