Non-linear model with AutoRegressive random effects and divergences

Dear Community,

It is my first post ever on this post.

I am trying to use Stan in order to fit a non-linear model for count data collected along times t=1,…T, over regions g=1,…,G.
Without delving too much into the details, the modeling assumption is that:

Y_{gt}\sim Poi(\mu_g(t))


\log(\mu_g(t))=lOff_g + \theta_{gt} + log( b_g + r_g \cdot s_g \cdot h_g \cdot \exp(h_g(p_g-t)) \cdot (1+\exp(h_g(p_g-t))^{-s_g-1})

with lOff_g is a constant offset. The multiplying random effects have an AR prior:

\theta_{gt} \sim N(\rho \cdot \theta_{gt-1}, \sigma)

At the current moment there is no dependence among regions g, so the subscript g can be practically ignored.

Before going and fit my model on real data, I always make some first tries on data simulated from the model itself.
The fit, obtained by simulating from the posterior marginal, looks always good. You can see it here on one region:

However, unless I specify very tight priors for the parameters, I always get many divergences.
Furthermore, even when avoiding divergences using tight priors, the parameters’ chains show poor convergence (high R-hat). Here you can see the pairs() plot from Stan, on some selected parameters (from the curve first, from the random effects after).

PairRich PairThetas

I know the non-linear model I use looks pretty complex, but as far as I do not introduce the AutoRegressive random effects theta, everything goes fine.
Any suggestion on how I can improve the results I get from Stan?

Please, find below the Stan code.

functions {
  real log_RichLin(real lr, real lh, real pe, real ls, real ti) {
    real h = exp(lh);
    real s = exp(ls);
    real lout;
    real lhpt = log(1+exp(h * (pe-ti)));
    if (is_inf(lhpt))
        lhpt = h * (pe-ti);
    lout = lr + ls + lh + h*(pe-ti) - (s+1) * lhpt;

    return lout;

data {
        // Data block
        int<lower=0> N; // Number of obs
        int Y[N]; // Vector of Obs

        int<lower=0> Nreg;  // Number of regions
        int<lower=0> Ntimes;  // Number of times
        real<lower=0> t[Ntimes]; // Times
        vector[N] lOff;   //Offset
        int rId[N]; // Associate obs to region
        int tId[N]; // Associate obs to time

parameters {  
    // Parameters block
        // Baseline parameter
        real logbase[Nreg];
        // Richard's parameter
        real logr[Nreg];
        real logh[Nreg];
        real p[Nreg];
        real logs[Nreg];
        vector[Nreg] theta[Ntimes];
        real<lower=0> taut;
        real<lower=-1, upper=1> rho;

transformed parameters {
        // Transformed parameters block
        vector[N] lm;
        real<lower=0> sigmat;
        // SD from Precision
        sigmat = 1/taut;
        // Mean function
        for (i in 1:N) {
            lm[i] = theta[tId[i]][rId[i]]*sigmat + 
                    log(exp(logbase[rId[i]]) +
                    exp(log_RichLin(logr[rId[i]], logh[rId[i]], p[rId[i]], logs[rId[i]], t[tId[i]])));
        lm = lOff + lm;

    model {
        // Model block
        // Baseline priors
        logbase ~ normal(0, 10);
        // Richards Priors
        logr ~ normal(0, 10);
        logh ~ normal(0, 10);
        p ~ normal(t[Ntimes]/2, t[Ntimes]/(2*1.96));
        logs ~ normal(0, 10);
        // Random effects
        theta[1] ~ normal(0, 1);
        for (i in 2:Ntimes){
          theta[i] ~ normal(rho*theta[i-1], 1);
        taut ~ gamma(2, 2);

        // likelihood
        Y ~ poisson_log(lm);

    generated quantities{
        // Output block
        real Y_pred[N];
        vector[N] log_lik;
        Y_pred = poisson_log_rng(lm);  // Posterior predictive distribution
        for (i in 1:N){
              log_lik[i] = poisson_log_lpmf(Y[i] | lm[i]);
1 Like

sorry for not getting to you earlier, your question is relevant and well written.

Especially great that you found the exact component that breaks your model.

That is often a sign that the model is not well identified - i.e. there are many very different model configurations that are consistent with the data. A common case is that say two parameters enter the likelihood only as their sum. So as long as the sum is correct, the predictions look good, but the individual parameters are not determined and can make prolems for the sampler.

The pairs plots show a few worrying patterns. One is between logs[1] and logh[1] (and partially also with p[1]

Note sure about the cause for now, but I’ll note there are several things that seem to differ between the mathematical description and the code:

  • the exp(log_RichLin) probably should be just log_RichLin.
  • You seem to assume that b_g = 0

It is often useful to reparametrize AR terms via a set of standard normal distributions, so instead of:

       theta[1] ~ normal(0, 1);
       for (i in 2:Ntimes){
          theta[i] ~ normal(rho*theta[i-1], 1);

you would have something like (code not tested, just a sketch):

theta_raw ~ normal(0,1);

theta[1] = theta_raw[1] * sigma_t;
 for (i in 2:Ntimes){
          theta[i] = theta[i - 1] * rho + theta_raw * sigma_t;

Additional suggestions that are likely not the main cause, but could make your model work better:

  • log(exp(logbase[rId[i]]) is probably equivalent to just logbase[rId[i]]
    • There are several other cases of this “more than exponential” effects - lh goes through 3 exponentiations before affecting mu (one in log_RichLin, then when computing lm and finally in the poisson_log. Even a minute change in the parameter is likely to propagate to enormous change in the outcome, so normal(0,10) prior is likely extremely wide for this parameter.
  • A more stable way to compute log(1+exp(...) is provided by log1p_exp
  • Some of the computations might be better expressed with log_sum_exp (but not completely sure about this…)

Hope at least some of this helps. If not, or if the advice is unclear, please ask for further clarifications.

Dear Martin,

Thank you for your answer! You are totally right about most of the points you raise.
I soon realized that the weird dependence between logs, logh and p could be a big problem. However, any other parameterization does not seem to improve on that (things usually get worse).
Maybe, I should consider fixing one and cross-validate

  • About the first point on the exp(log_RichLin) and b_g = 0. Here I think the current version, even if a bit convoluted, is okay. I want log(b_g+RichLin), but I rather compute both therms on the log-scale (the first for domain>0 reasons, the second for numerical ones). Hence I write: log( exp(logb_g) + exp(logRichLin) ). That said, I could use log_sum_exp() here, as you suggested, for numerical stability. I will try!

  • I don’t know how I did not think about the non-centered parameterization for the AR. Thanks for pointing this out.

  • You are right on the priors. Most of the parameters I use are exponentiated, thus the priors I was using were extremely diffuse. I settled for some tighter ones (sd=1), and this helps a lot! (especially considering most of the parameters are in (0,1) generally)

  • I will definitely use log1p_exp() in the logRichlin() function. Thanks.

That is all for now.
Thanks again for your suggestions, I will keep you posted!

1 Like