Divergent transition fitting truncated gamma

I am trying to translate this model from INLA to Stan and evolve it to adapt to my dataset (hurdle model with a spatial Gaussian process).
I will tackle my model step-by-step.
Since my data are truncated, I am now trying to fit the parameters of a Y sample drawn from a censored Gamma distribution with \alpha (shape) and the \beta (rate parameter) written in terms of \mu = E[Y]:
\beta = \alpha/ \mu
because of \mu is needed to later model the GP. For now, \mu is written as a simple linear model : \mu=\beta_0 + \beta_1* X
I start with this toy dataset:

set.seed(999)

N = 100
X = runif(N, 0, 2)

alpha = 10
beta0 = 0.2
beta1 = 0.5
mu = beta0 + beta1 * X

Y = rgamma(N, rate = alpha / mu, shape = alpha)

U = 1
L = 0.25

Y[Y < L] = L
Y[Y > U] = U

I coded the Stan model as:

data {
    int<lower = 1> N;
    real<lower = 0> L;
    real<lower = L> U;
    vector<lower = L, upper = U>[N] Y; // dependent variable
    vector[N] X; // independent variable

}

parameters {
    real beta0;
    real beta1; 
    real<lower = 0> alpha; //shape parameter
}

transformed parameters {
    vector[N] rate = alpha * inv(beta0 + beta1  *X );
}

model {
    beta0 ~ std_normal();
    beta1 ~ std_normal();
    alpha ~ gamma(0.01, 0.01);
    for (n in 1:N) {
         Y[n] ~ gamma(alpha, rate[n]) T[L, U];
    }
}

At the very beginning I get some errors (the following fits all):

Chain 2: Rejecting initial value:
Chain 2:   Error evaluating the log probability at the initial value.
Chain 2: Exception: gamma_lpdf: Inverse scale parameter is -2.10891, but must be > 0!  (in 'model77e750880e39_Gamma_regression_truncated' at line 26)

with, at the end:

Warning messages:
1: There were 61 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 

So, here are my questions:

  • is it a better way to formulate the model and amend the divergent transitions?
  • I use a Gamma distribution because of my data (hourly rainfall) must positive. Do you think a truncated normal distribution could be more stable to sample from?

Finally, here’s the pairs plot:

Thanks in advance

1 Like

If it understand it correctly, “truncated” normal is not the same as “half-normal”. Truncation is an assumption about the data. E.g. the range of the data is actually -100 to +100 but it’s truncated at zero so I only get to see 0-100. However, what you’re fitting when using Stan’s truncation feature is the conventional distribution not some folded or limited version of it.

Meanwhile, the half normal distribution is literally a folded, non-symmetric normal distribution. So far as I know Stan doesn’t have a half-normal distribution out of the box. However, looking at the wikipedia page for it I think it’d be straightforward to write your own halfnormal_lpdf if you wanted to go that route.

Here is a little bit about truncation from the Stan documentation:

Also I find this really useful:

It has a whole list of distributions with support on [0,\infty)

1 Like

If the rate ends up being negative at initialization then your model is wrong. The combination beta0 + beta1*X needs to be restricted to be positive otherwise you’re running off the edge of parameter space and that breaks HMC.

Does the following work?

data {
    int<lower = 1> N;
    real<lower = 0> L;
    real<lower = L> U;
    vector<lower = L, upper = U>[N] Y; // dependent variable
    vector[N] X; // independent variable

}

parameters {
    real beta0;
    real beta1; 
    real<lower = 0> alpha; //shape parameter
}

transformed parameters {
    vector[N] rate = exp(beta0 + beta1 * X);
}

model {
    beta0 ~ std_normal();
    beta1 ~ std_normal();
    alpha ~ std_normal();
    Y ~ gamma(alpha, rate[n]) T[L, U];
}
1 Like

Looking at your model a bit. In your formula \mu=E[Y] is a scalar. Then \mu=\beta_0+\beta_1X and \frac{\mu-\beta_0}{\beta_1}=X which fits best when \frac{\mu-\beta_0}{\beta_1}=E[X]. So \beta_0=E[Y]-\beta_1E[X], which is one equation with two unknowns. So far as I can tell, this model is not well identified because any number of choices for \beta_0,\beta_1 may fit equally well.

thanks, @emiruz and @krzysztofsakrejda.

Definitely yes. When we have short-time, intense precipitation events we can see 2 physical phenomena that inform the measure:

  • considering the whole spatial domain, we have no precipitation in a subdomain (true 0) and precipitation > 0 in the remaining part. This will be accounted in the spatial hurdle model (actually solved with INLA)
  • the rain-gauge measures a false-zero if precipitation is lower than a threshold (depending on the rai-gauge model) up to a maximum value (depending on the rain-gauge volume). This should be accounted for the “truncation part”.

I used the Gamma distribution because of its positive support, but AFAIK there is no reason not to use another distribution.

I played around with my model, and in fact, it is so. The model is super-sensible to the priors I use.
With the following true values: \alpha=9, \beta_0 = 0.5, beta_1 = 1.5 using informative priors I get:

           mean   se_mean       sd       2.5%        25%        50%       75%
beta0 -0.2025142 0.8181889 1.202111 -1.8391312 -0.9389400 -0.5984882 0.1568568
beta1  1.4486941 0.9071915 1.326416  0.1566004  0.4759683  0.8317741 2.0635211
alpha  1.7304700 1.6061429 2.323058  0.2101610  0.3408593  0.4708945 1.5270991
         97.5%    n_eff     Rhat
beta0 2.152733 2.158647 3.962268
beta1 4.148511 2.137774 4.062345
alpha 6.974082 2.091952 4.830636
Warning messages:
1: There were 2823 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: Examine the pairs() plot to diagnose sampling problems
 
3: The largest R-hat is 1.97, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat 
4: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess 
5: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess.

@krzysztofsakrejda: I followed your advice and yes, now the model behaves better although

Rejecting initial value:
Chain 4:   Error evaluating the log probability at the initial value.
Chain 4: Exception: gamma_lpdf: Inverse scale parameter is inf, but must be finite!  (in 'model77e74486159a_Gamma_regression_truncated' at line 27)

Now my question is: since there is no theoretical need to prefer a gamma distribution over other distributions, do you think it is a good idea to use a censored normal distribution to fit my measure?

I’d let what fits better and has more theoretical basis be your guide. Consider trying both.

I would worry however that your model isn’t identifiable. If B0 and B1 are important parameters going forward you’re baking in an element of arbitrariness . Further, you may have trouble fitting your model at all as the amount of data increases.

I think this is not correct - the Stan’s truncation feature will take the original distribution and rescale it so that it integrates to 1 over the truncated range. Taking a normal (centered at 0 for simplicity) truncated to range (0, Inf), the reduced range integrates to 0.5 so Stan will renormalize the PDF by a factor of 2, which gets you exactly the formula for a half normal PDF.

I believe this is better modelled as censored data - the difference can be subtle, but might also be very important :-)

I would say Gamma or log-normal are the most common choices for data with positive support, so all else being equal I would start with those. The relevant Wiki pages often provide you with ways to interpret parameters/meaning of a distribution - does any of that match your understanding of the phenomena? (e.g. log normal occurs when your error is multiplicative). After that, posterior predictive checks are IMHO the way to go.

Hope that helps and clarifies more than confuses :-)
Also thanks to @emiruz and @krzysztofsakrejda for their help!

1 Like

Thank you for the correction. So I now understand that in Stan truncated data is handled by truncated distributions. I.e. the feature isn’t there to make it possible to use truncated data with non truncated distributions. Actually I see now that the latter case could be handled by treating the data as censored instead.

1 Like

I have to figure out if truncation or censoring work with my rainfall data.
The upper limit is clearly censored.
I have some doubts about the lower limit. If I have 0.8 mm of real precipitation and the detection threshold is set to 1 mm, then the recorded value is set to 0: a “false 0”.
In addition, I have "true zero"s in the spatial domain, where no precipitation occurs and I do not know a-priori how many “false 0” and “true 0” I have. I only know upper and lower limits for my data. So is 1 mm a truncation or a censoring limit when dealing whit a semi-continuous model?

I believed I could expand my model where the “true 0” could be accounted by an Hurdle model translating the INLA example. Doing this scares me now a bit, but I have tackle this at a certain point.
In the meantime I want to experiment with censoring/truncation.
I started from this simple example:


N = 100

sigma = 0.5
beta0 = 2.00
beta1 = 1.00

X = runif(N, 0, 1)
mu = beta0 + beta1 * X

Y = rnorm(N, mean = mu, sd = sigma)

U = quantile(Y, prob = 0.9)
L = quantile(Y, prob = 0.1)

idx_L = which(Y < L)
N_cens_L = length(idx_L)

idx_U = which(Y > U)
N_cens_U = length(idx_U)

Y_obs = Y[-c(idx_L, idx_U)]
N_obs = length(Y_obs)
X = X[-c(idx_L, idx_U)]

I tried to fit \sigma, \beta_0 and \beta_1 with the following model

data {
    int<lower = 0> N_obs;
    int<lower = 0> N_cens_L;
    int<lower = 0> N_cens_U;
    real L;
    real U;
    vector[N_obs] Y_obs;
    vector[N_obs] X;

}

parameters {
    real beta0;
    real beta1; 
    real<lower = 0> sigma;
}

transformed parameters {
    vector[N_obs] mu = beta0 + beta1 * X;
}

model {
    beta0 ~ std_normal();
    beta1 ~ std_normal();
    sigma ~ std_normal();
    target += normal_lpdf(Y_obs | mu, sigma);
    target += N_cens_L * normal_lcdf(L | mu, sigma);
    target += N_cens_U * normal_lccdf(U | mu, sigma);
}

while \beta_0 and \beta_1 are good, \sigma is really far away from the true value:

         mean     se_mean        sd       2.5%          25%       50%
beta0 2.1544023 0.008035144 0.3384525  1.4964456  1.918614083 2.1562701
beta1 0.4093501 0.014395205 0.5989443 -0.7532349 -0.001578269 0.4083911
sigma 7.2906466 0.010005218 0.4686896  6.3980343  6.960850901 7.2833842
            75%    97.5%    n_eff     Rhat
beta0 2.3864785 2.827676 1774.223 1.000516
beta1 0.8172559 1.580011 1731.160 1.000455
sigma 7.5976424 8.263921 2194.409 1.00121

What I am missing? I followed pedantically the manual

The way I like to think about this is:

  • censoring: a value is drawn from the full distribution, when the value is outside the bounds, the value at the bound is reported
  • truncation: a value is drawn from the full distribution, when the value is outside the bounds, it is discarded and a new value is drawn - repeat until the value fits within the bounds

So I would say you are observing is censoring. And the censoring bound is the detection limit (so 0 actually means < 1mm). Hurdle models are AFAIK closely related, but I don’t understand them enough to be of much help on the details.

I see several possible problems with your code. In order of importance as I perceive it:

  1. target += N_cens_L * normal_lcdf(L | mu, sigma); Is very weird - the normal_lcdf is vectorized and will compute the sum of lcdf values for all the (80) mus
  2. You hide the X values of the censored observations from the model, although those are informative.

I would guess you want to compute mu_cens_L and mu_cens_R vectors using the respective X values and then have target += normal_lcdf(L | mu_cens_L).

Does that make sense?

Hey, how did this work out? I am always curious about these Gamma distribution models and their performance.