Initialization failed: relation between priors and bounds

I am having problems in the initialization phase of a collection of models. I have a likelihood for my data (generalized extreme value, GEV) and I am modeling some parameters as time series -sometimes I consider the parameters constant, others autoregressive of order 1 and others I include a seasonality term-, which I believe to be where the problem lies.

The following model -an example of the set- is failing to initialize. The location parameter is an autoregressive model of order 1 (AR(1)) with a seasonality term of order 12. The scale parameter is an AR(1).

#include ../gev_vvv.stan
data {
    int<lower=0> N;
    vector<lower=0>[N] y;
    int idxs[N];
    real<lower=3> nu; // Number of sigmas to avoid zero valued parameters
}
parameters {
// Location
    real<lower=0> locm;
    vector<lower=0>[N] loct;
    real<lower=0> sigmal;
    real<lower=0, upper=1> Phil;
    real<lower=0, upper=1> Laml;
// Scale
    real<lower=0> escm;
    vector<lower=0>[N] esct;
    real<lower=0> sigmae;
    real<lower=0, upper=1> Phie;
// Shape
    real<lower=-.5, upper=.5> form;
}
transformed parameters {
    vector<lower=-.5, upper=.5>[N] fort;

// LOCATION
    real<lower=0, upper=1> sigmaol= nu * sigmal / sqrt((1 - Phil^2) * (1 - Laml^2)) / locm;
    real<lower=0> sigmasl = sqrt(sigmal^2 / (1 - Laml^2));
// SCALE
    real<lower=0, upper=1> sigmaoe = nu * sigmae / sqrt((1 - Phie^2)) / escm;
// SHAPE
    real<lower=0, upper=1> formplushalf = form + .5;

    for (i in 1:N) {
        fort[i] = form;
    }
}
model {
// LOCATION
    locm ~ gamma(5, 1./20);
    sigmaol ~ beta(2, 4);
    Phil ~ beta(2, 4);
    Laml ~ beta(2, 4);
// SCALE
    escm ~ gamma(4, 1./10);
    sigmaoe ~ beta(2, 4);
    Phie ~ beta(2, 4);
// SHAPE
    formplushalf ~ beta(4, 4);

// LOCATION
    loct[1] ~ normal(locm, sigmaol * locm / nu);
    loct[2:13] ~ normal(Phil * loct[1:12] + (1. - Phil) * locm, sigmasl);
    loct[14:N] ~ normal(Phil * loct[13:(N-1)] + Laml * loct[2:(N-12)] - Phil * Laml * loct[1:(N-13)] + (1. - Phil - Laml + Phil * Laml) * locm, sigmal);
// SCALE
    esct[1] ~ normal(escm, sigmaoe * escm / nu);
    esct[2:N] ~ normal(Phie * esct[1:(N-1)] + (1. - Phie) * escm, sigmae);
// SHAPE

    y[idxs] ~ gev(loct[idxs], esct[idxs], fort[idxs]);

// LOCATION
    target += -log(sigmaol) + log(sigmal);
// SCALE
    target += -log(sigmaoe) + log(sigmae);
// SHAPE

}

I am afraid that I may have severely misunderstood some of the concepts related to bounds, priors and changes of variables, and I cannot wrap my head around the problem, so I would really appreciate any helping hand.

My thinking goes as follows:

When I define a parameter as an AR(1), this implies that I have the following parameters:
(1) the average of the process (\mu_m, locm in the code)
(2) the autocorrelation coefficient (\Phi_l, Phil in the code)
(3) the standard deviation of the noise process (\sigma_l, sigmal in the code).

In the example I have an additional parameter introduced by the seasonal component, but its behavior is similar, so I exclude it from this explanation.

The standard deviation of the AR(1) is \sigma^* = \dfrac{\sigma_l}{\sqrt{1 - \Phi^2}} and assuming my AR(1) process is going to always be within 5 \sigma^* of the mean[^1], I can define \sigma_{ol} = \dfrac{5 \sigma_l}{\mu_{m}\sqrt{1 - \Phi^2}}, bound it to be between 0 and 1, and give it a beta prior distribution.

[^1]: I know this it not always true, but I believe it may be a good enough approximation -I may be wrong though-.

If I give a prior and bounds to \mu_m, \Phi and \sigma_{ol}, does Stan automatically sample \sigma_l in such a way that all bounds are respected? Or as \sigma_l is not upper bounded it will be able to get \infty values despite all the specified priors?

I do not understand why when Stan tries to initialize \sigma_{ol}, it is producing values larger than 1, which in my mind should be impossible as the variable is bounded between 0 and 1, and assigned a beta prior. What have I not understood here? [I know the answer would be _so many things_ , but I would really appreciate a minimal set here :D]

Also, as I am giving a prior to \sigma_{ol} instead of to \sigma_{l} I am including the determinant of the jacobian of the inverse transformation. Is that the correct procedure? Because in some cases, the models seem to initialize if I don’t include the determinant, which seems counterintuitive to me.

I hope to have exposed my case clearly, but if not, please feel free to require as many clarifications as needed.

I really appreciate any help and hint that you can give me.

1 Like

If I understand correctly, \sigma_{ol} is one of the transformed parameters, so Stan isn’t explicitly initializing this at all. Stan only initializes variables declared in the parameters block. To clarify, the initialization procedure that Stan does automatically behind the scenes is as follows:

  1. For every variable in the parameters block:
    1a. Generate an initial value by sampling a uniform distribution from -2:2 (at least, that’s the default range if you don’t supply a value to cmdstan’s init argument).
    1b. If the variable is unbounded, use that value; if the variable is bounded, impose the bounds by applying a transform that achieves the desired bounds and automatically add the jacobian associated with that transform so that the user can supply a prior for the output bounded variable.;

  2. For every variable in the transfomed parameters block:
    2a. Compute the variable given it’s relation to the initialized parameters.
    2b. If the variable is bounded check that the value of the variable as computed falls within the bounds; if the check fails, terminate the initialization attempt and start back at 1 (generating new random inits for all parameters).

So as a user you’re responsible for making sure that the way you derive transformed parameters from the parameters will yield a value that meets the bounds you desire for the TP. Since there’s a re-try loop, there’s a bit of wiggle-room here in that initialization can succeed even when the P->TP relationship doesn’t fully guarantee the TP satisfies its bounds so long as there’s a sufficiently high likelihood that it will, but that can be fragile workflow-wise and I find it easiest to only ever use transforms I know are guaranteed to yield the desired bounding behaviour.

To discern where your transforms might be going awry, try removing the bounds on the TPs but add print statements immediately after their computation to show the parameter values and the TP values; that should show you which TPs are falling outside your expected bounds. From there you can check that you did your computations as intended and possibly motivate either different bounds or different computations.

1 Like

Thank you very much @mike-lawrence. Indeed, the problem is that I thought that Stan also initialized transformed parameters. The transformation I was making did not ensure that the bounds were respected and from there the initialization problem arose.

If I understand the implications of the initialization procedure correctly, it only makes sense to impose priors on parameters, am I right? That is, it does not serve any purpose to impose a prior on a transformed parameter, or does it?

Finally, is the prior supposed to respect the bounds imposed on a given parameter? From your explanation of the initialization procedure, it seems that Stan initializes the chain to some values, irrespectively of the prior, which does not make much sense to me, so I imagine I am still missing some nuance here.

Again, thank you very much for your help and your prompt response.

Under the hood, the Stan syntax that we call “putting a prior on a parameter” just means “increment the target density by some specific function of the parameter”. Thus:

Imposing priors on transformed parameters has meaning in a Stan model, because incrementing the target by a function of a transformed parameter will modify the prior that is being placed on the untransformed parameters. The most straightforward example involves placing a prior on a transformed parameter while implementing the relevant Jacobian adjustment. This creates a prior distribution that behaves as if you had declared the transformed parameter to be a parameter and placed a prior on it (note, however, that the details of the sampling algorithm will not be the same–the sampling will always happen on the unconstrained scale over the parameters declared in the parameters block).
Without Jacobian adjustments, “putting priors” on transformed parameters (when the transform isn’t linear and one-to-one) increments the target in ways that don’t correspond to our notion of “sampling the transformed parameter from the probability distribution implied by the sampling statement”. In general this is unadvisable unless you really know what you’re doing.

If the prior doesn’t respect the bounds, then you just get a different prior. For example, if we have a parameter bounded below by zero, and we write a line of Stan code that looks as though we are placing a standard normal prior on the parameter, we have simply given the variable a half-normal prior. The prior gets truncated according to the constraint. Again, this is because under the hood, the syntax that we call “putting a prior on a parameter” just means “increment the target density by some specific function of the parameter”. The target density that respects the bounds gets incremented, and there’s no density elsewhere to increment.

Yes, this is what Stan does. As a consequence, while it is fine to declare a prior that doesn’t respect a constraint (see above), it is not a good idea to fail to declare a constraint that the prior respects. For example, if the prior is lognormal, make sure that the parameter is declared with a lower bound of zero.
There are reasons for ignoring the prior here, including the computational problems that can arise from sampling inits from the tails of over-weak priors, the fact that Stan supports improper priors, the fact that Stan supports non-generative priors that, depending on the geometry, can be difficult to simulate from, and probably a bunch more. Perhaps the best reason is that the default initialization has proven successful in practice across a relatively broad range of models.

1 Like

Thank you very much @jsocolar. Your explanations have been very clarifying. With yours and @mike-lawrence’s help, I believe I have a better grasp at what Stan is doing and the implications it has for the model I am implementing.

In a more general setting, I still have the doubt of when should a change of variable be preferred over a reparameterization. In the cases that I can imagine, very similar to the ones shown in chapter 22 of the user manual, they seem to be equivalent procedures, mostly differing in their expressiveness, but this impression may be quite biased because of my ignorance on the subject. Are there situations where one or the other approach is more suitable?

Once again, thank you very much for your help.