Divergent transitions in hierarchical model

Can anyone suggest what I could change with the below code to make it more stable.

Problem: I have n=1800 units and n_obs = 69 of them fail, and n_cens = 1731 are still working (and so are right-censored). I have some simulated data and have obtained estimates for the true values via maximum likelihood. I would now like to obtain these estimates by using a Bayesian approach.

I have tried many times with vague priors with no success. The error message I keep getting is x divergent transitions after warmup. I have now placed a very narrow prior on sig (around its true value), but I still get divergent transitions.

I have tried increasing adapt-delta, lowering stepsize, and increasing max_treedepth, but I still get divergent transitions.

rho, sig1, sig2, beta, eta, mu0, and sigma0 are all unknown parameters; however, I inputted them as data to try and find the source of the divergent transitions.

When running the code with rho, sig1, sig2, beta, eta, mu0, and sigma0 (and sig) treated as unknown parameters, my code took 4 days to run with “control = list(adapt_delta = 0.999, stepsize = 0.001, max_treedepth = 20” and 2000 iterations with one chain. I still had divergent transitions.

I have come to the conclusion (from reading similar posts) that my model must have something fundamentally wrong with it, but I just cannot see what it is. Could it be that I am defining my own matrix and not using cov_matrix or something? The matrix I am entering is positive definite so I doubt that is the case.

Perhaps the parameterization of the Weibull distribution that I am using is causing this to be unstable? But I have used this parameterization before and it has been fine (not in a hierarchical model).

I have 1800 units and 1-70 observations for each unit. For example, unit one has 35 observations, and unit 2 has 26. I enter the data in the following form, tt = 1,2,…,35, 1,2,…,26, w_ind = 1,1,1,…,1, 2,2,…,2 (35 1’s and 26 2’s), y_tt = y_1,1, y_1,2, y_1,3, …,y_1,35, y_2,1,… (I added this paragraph to show that the data was entered correctly and is not the issue).

 data {
int<lower=1> n; // Sample size
int<lower=1> n_cens; // Right censored sample size
int<lower=1> n_obs; // Failed unit sample size
int<lower=1> N; // Total number of times considered (all times for all units)
real<lower=0> tt[N]; //All times for all units
real y_tt[N]; //All of the use-rates
int w_ind[N]; //Indices for w
vector[2] w_M; //Mean vector for w
vector[n_obs] x_DFD; //Use rate at failure-times for the units that failed
real <lower = 0> sig1;
real <lower = 0> sig2;
real <lower = -1, upper = 1>rho;
real <lower = 0> mu0;
real <lower = 0> sigma0;
real eta;
real Beta;
vector[n_obs] u_obs;
vector[n_cens] u_cens;
matrix[2,2] I2; //Covariance matrix for w
}
parameters {
real <lower = 0> sig;
vector[2] w[n]; // A length-n array of length 2 vectors
}
model {
//Defining covariance matrix, cholesky decomposition and mean and variance of mixed effects model
matrix[2,2] Sigma;
matrix[2,2] L;
real Mu[N];
real Sig[N];
vector[2] w2[n]; // A length-n array of length 2 vectors
//Covariance matrix
Sigma[1,1] = sig1^2;
Sigma[1,2] = rho*sig1*sig2;
Sigma[2,1] = rho*sig1*sig2;
Sigma[2,2] = sig2^2;
//Cholesky decomposition
L = cholesky_decompose(Sigma);
for(i in 1:n){
w2[i] = L*w[i];
}
//Parameters used in mixed effects model
for(i in 1:N){
Mu[i] = eta + w2[w_ind[i]][1] + w2[w_ind[i]][2]*log(tt[i]);
Sig[i] = sig1^2 + log(tt[i])*(2*rho*sig1*sig2 + sig2^2*log(tt[i])) + sig^2;
}
//Likelihood
target += weibull_lpdf(u_obs| 1/sigma0, exp(mu0));
target += Beta*x_DFD;
target += weibull_lccdf(u_cens|1/sigma0, exp(mu0));
target += normal_lpdf(y_tt|Mu, Sig);
// Prior:
w ~ multi_normal(w_M, I2);
sig ~ gamma(0.05, 1);
}


Diagnostic plots for when I ran the full model with all parameters (i.e. when I did not input rho, sig1, sig2, beta, eta, mu0, and sigma0). p.pdf (32.6 KB) p2.pdf (36.7 KB) p3.pdf (37.5 KB)

w ~ multi_normal(w_M, I2);

w2[i] = L*w[i];


What’s the relationship between w and w2?

target += Beta*x_DFD;


What’s the Beta*x_DFD term?

Sig[i] = sig1^2 + log(tt[i])*(2*rho*sig1*sig2 + sig2^2*log(tt[i])) + sig^2;

target += normal_lpdf(y_tt|Mu, Sig);


normal_lpdf is parameterized in terms of standard deviation, not variance. (multi_normal is in terms of covariance – so it’s a bit different)

1 Like

So, instead of drawing from multi_normal(w_M, Sigma), [w_M = c(0,0)], I draw from a multi_normal(w_M, I2), where I2 is the 2 \times 2 identity matrix, and multiply these draws by the Cholesky decomposition of Sigma. I.e. just a multivariate reparameterizations.

The Beta*x_DFD term is part of the likelihood. The likelihood for the model I am consider is given by

L(\theta_T, \theta_X \mid \text{Data}) = L(\theta_T \mid \text{Failure-time Data, Covariate History}) \times L(\theta_X \mid \text{Covariate History}) where

L(\theta_T \mid \text{Failure-time Data, Covariate History}) \\ = \prod_{i=1}^n\{\exp[\beta x_i(t_i)]f_0(u[t_i;\beta, x_i(t_i)], \theta_0)\}^{\delta_i} \times \{1 - F_0(u[t_i;\beta, x_i(t_i)], \theta_0)\}^{1-\delta_i},

and
L(\theta_X \mid \text{Covariate History}) = \prod_{i=1}^n\bigg\{\prod_{t_{ij} \leq t_i} f_{\text{NOR}}[x_i(t_{ij}) - \eta - Z_i(t_{ij})w_i; \sigma^2]\bigg\}

\delta_i is 1 for the units that fail and 0 for the censored units. Hence the likelihood can be written as

\prod_{\text{failed units}}\{\exp[\beta x_i(t_i)]f_0(u[t_i;\beta, x_i(t_i)], \theta_0)\} \times \prod_{\text{censored units}} \{1 - F_0(u[t_i;\beta, x_i(t_i)], \theta_0)\}.

Stan takes the log likelihood. The log-likelihood is given by

\sum_{\text{failed units}}\{\beta x_i(t_i) + f_0(u[t_i;\beta, x_i(t_i)], \theta_0)\} + \sum_{\text{censored units}} \{1 - F_0(u[t_i;\beta, x_i(t_i)], \theta_0)\},

where f_0 and F_0 are the pdf and cdf of the Weibull distribution. In addition, the log-likelihood for the L(\theta_X \mid \text{Covariate History}) is just the sum of the logs of the normal densities.

Ah, thanks, I will take the square root of Sig in the likelihood.

Oh okay. That looks right. It might be more convenient to do:

matrix[2, n] w;

matrix[2, n] w2;

w2 = L * w;

to_vector(w) ~ normal(0, 1);


From the traceplots it looks like sig is the thing that’s most weird. Any chance it’s in the wrong place or anything?

1 Like

I checked my calculations and I think sig is in the right place.

For now, I have changed

target += normal_lpdf(y_tt|Mu, Sig);


to

target += normal_lpdf(y_tt|Mu, sqrt(Sig));


This could make a huge difference (I hope).

Regarding

to_vector(w) ~ normal(0, 1);


Does this create n 2-vectors, with each element drawn from a standard normal distribution? Then I guess

to_row_vector(w) ~ normal(0, 1);


would create 2 n-vectors, with each elements drawn from a standard normal distribution (I do not want this, but I am just curious about how the syntax works).

Perhaps

to_vector(w) ~ std_normal;


would be slightly better, too.

It just flattens the matrix into a vector and puts a normal(0, 1) prior on everything. to_row_vector would have the same effect.

Hopefully, but I doubt it. There’s probably something wrong still.

Next thing is to probably check correlations in things. In particular, since sig is going crazy, is it correlated with any of the ws?

By the way, in the model you showed, only sig and w are being estimated. Does this still fail?

1 Like

By the way, in the model you showed, only sig and w are being estimated. Does this still fail?

Yes, the diagnostics plot for sig looks like it did in the diagnostic plots that I showed. Although, when I increased adapt_delta, and max_treedepth, and lowered stepsize, the model still failed, but the diagnostic plot did not look as bad (I need to wait for my code to finish running to show this), and this may imply that I just need to run for more iterations.

Next thing is to probably check correlations in things. In particular, since sig is going crazy, is it correlated with any of the ws?

The covariate process model that includes sig and the w’s is

X_i(t_{ij}) = \eta + Z_i(t_{ij})w_i + \epsilon_{ij},

where \eta is the mean, Z_i(t_{ij}) = [1, \log(t_{ij})], w_i = (w_{0i}, w_{1i})' \sim N(0,\Sigma_{w}), \epsilon_{ij} \sim N(0, \sigma^2), and

\begin{pmatrix} \sigma^2_1& \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma^2_2 \end{pmatrix}.

It is assumed that cov(w_i, \epsilon_{ij}) = 0 \forall i,j.

Or are you asking me to check (from the Stan output) if sig is correlated with any of the w’s and this may (somehow) be causing me issues?

Since you’ve had no luck with these parameters so far, it’ll be fair to just leave them at their defaults for now. Any divergences you see might help you figure out where the problem is.

Yeah, the output. If sigma is acting crazy and we don’t know why, then maybe something else is acting crazy and from there we can work backwards to identify the problem in the model.

1 Like

Yeah, the output. If sigma is acting crazy and we don’t know why, then maybe something else is acting crazy and from there we can work backwards to identify the problem in the model.

Okay, I will check this.

Just for reference, I have attached a diagnostics plot after running the code in the first post, with 2000 iterations and the default control parameters, but with

target += normal_lpdf(y_tt|Mu, Sig);


replaced with

target += normal_lpdf(y_tt|Mu, sqrt(Sig));


There were 17 divergent transitions after warmup.

Diagnostics plot: pp.pdf (30.8 KB)

Also this is a super super sharp spike near zero. Any chance you could replace this with something like normal(0, 0.05) or something just in case there’s something weird with the gamma?

1 Like

Yes, okay. I will try that.

My initial thought was, since sig cannot be less than zero, I should use a distribution with support (0, \infty). But, I have recently seen some models with normal priors on standard deviations.

I guess since

real <lower = 0> sig;


Stan will reject any proposed value that is less than zero, but may signal a warning?

real <lower = 0> sig;


Stan will use a transform from (-inf, inf) to (0, inf) to avoid any need for rejections, etc: https://mc-stan.org/docs/2_21/reference-manual/variable-transforms-chapter.html

1 Like

After running the code with (for 2000 iterations)

sig ~ normal(0, 0.05);


it appears that sig has converged. The mean of the returned samples, however, is approximately sig^2 and not sig. The true value of sig is 0.05, and hence the true value of sig^2 is 0.0025 or 2.5e-0.3. I was not expecting this because I define sig in the parameter block and not sig^2. I also use sig^2 in the Sig equation and not sig.

I have attached a diagnostics plot. I have also included the sample quantiles for sig.

              mean   se_mean          sd          2.5%           25%           50%           75%         97.5% n_eff     Rhat
sig   2.344140e-03 0.0000548  0.00184017  9.868000e-05  8.245100e-04  2.001670e-03  3.382450e-03  6.721040e-03  1128 1.000072


Diagnostics plot:converge.pdf (33.0 KB)

I performed a similar analysis, but this time I treated rho as an unknown parameter and treated sig as constant.

From a diagnostic plot it appears that rho has converged. The samples, however, are not close to the true value. I have attached a diagnostics plot. I have also included the sample quantiles for rho. The true value of rho is 0.2.

             mean    se_mean          sd         2.5%          25%          50%          75%        97.5% n_eff     Rhat
rho    -0.9437974 0.00003636  0.00038049   -0.9445539   -0.9440402   -0.9437844   -0.9435419   -0.9431056   109 1.004652


rho.pdf (23.5 KB)

One thought I have had is that (a reminder below)

L(\theta_T, \theta_X \mid \text{Data}) = L(\theta_T \mid \text{Failure-time Data, Covariate History}) \times L(\theta_X \mid \text{Covariate History}),

where \theta_T = (\beta, \mu_0, \sigma_0), and \theta_X = (\rho, \sigma, \sigma_1, \sigma_2, \eta). That is, the covariate data (y_tt, and x_DFD in code) was first generated using the \theta_X parameters, and then the failure-time data (u_obs and u_cens) was generated afterwards. The failure-time data does not depend on \theta_X and the covariate data does not depend on \theta_T. I estimated these parameters separately using MLE and recovered the true parameter values.

However, even though I have programmed independent likelihood functions and independent priors in Stan, the joint posterior for (\theta_T, \theta_X) will introduce some dependencies between the \theta_T and \theta_X parameters, and hence the marginal samples will not be the same as they would be assuming complete independence. I think this makes sense but I am not sure if it would cause rho to be so far from the true value.

I know that

p(X \mid D)p(Y \mid D) \propto p(D \mid X)p(D \mid Y)p(X)p(Y)

if the prior and the likelihood functions are separable. But, I am not sure if Stan is somehow causing the parameters to be correlated.

Edit: The same argument applies when treating sig1 as an unknown parameter and fixing all other parameters. I have attached a diagnostic plot for sig1, and the quantiles of the samples obtained. The true value for sig1 is 0.46.

              mean    se_mean          sd          2.5%           25%           50%           75%         97.5% n_eff      Rhat
sig1  1.463012e-01 0.00003008  0.00090603  1.445781e-01  1.457293e-01  1.462948e-01  1.469484e-01  1.480195e-01   907 0.9982059


Diagnostic plot for sig1: sig1.pdf (25.4 KB)

Wait I wasn’t reading this correctly before. This is the posterior right? Why is it parameters given data? I think what we actually have is:

p(\text{Data} | \theta_T, \theta_X) p(\theta_T)p(\theta_X)

And we’ve plugged this all into Stan to generate samples from:

p(\theta_T, \theta_X | \text{Data})

Is this in line with the notation you’re using?

Since both \theta_T and \theta_X go into generating the data, the data will inform both of them. There can be all sorts of interactions here. Presumably they’d be driven by situations where the data can be either explained by a certain choice of \theta_T or a certain choice of \theta_X. That’s all baked into Bayes rule and we don’t have any control here.

Yes, that is what we have.

L(\text{Data} \mid \theta_T, \theta_X) \equiv p(\text{Data} \mid\theta_T, \theta_X),

both notations are used for the likelihood function.

This is interesting. I will attempt to run the model adding more and more parameters and see if I get any divergent transitions. I could simulate some data from Stan and see if it looks like the actual data, to see if the parameters Stan outputs are reasonable.

Is Stan case sensitive? Will it know Sig and sig are different?

Also, could there be an issue with 1/sigma0 in the Weibull likelihoods? I.e. when a proposed value close to zero is chosen.

Well I think going to simpler model and less data is where you want to go :D. More divergences is just more confusion.

Yes

These are different

Is sigma0 estimated? How small is sigma0?

Okay, yes. I will try and simulate some data with a simpler model with one parameter to begin with.

Currently I am just substituting sigma0 with its true value (so this should not be an issue for now), but I will need to estimate it when I can sort the issues with the simpler models. The true value of sigma0 is 8.2, but I was wondering if certain proposed values of sigma, during estimation, could cause divergent transitions.

Got it. I’d guess no, but I could be wrong. I don’t have experience with Weibull’s. This looks like a different situation than what gives rise to divergences with regular hierarchical models though.

Update: I have calculated the correlation between sig and each component of each random effects vector (there are 1800 random effects vectors, w = (w_1, w_2)). This could indicate that sig = f(w_1) or sig = f(w_2).

I could fit a linear regression to see if sig = f(w_1, w_2).

I also calculated the correlation between each random effects vector, i.e. cor(w_{i1}, w_{i2}), for i = 1, \dots, 1800. This could indicate that w_2 = f(w_1).

The range of correlations between sig and the components of the random effects vectors is [-0.1091405, 0.1046078], with an average correlation of 0.0001907212.

The range of correlations between each random effects vector is [-0.1159629, 0.1389009], with an average correlation of -0.002469418.

These numbers lead me to believe that there is no issue with sigma being correlated with the random effects, and that the elements of the random effects are themselves not correlated.

I also tried putting a very narrow prior on sig (around its true value of 0.05)

model {
...
sig ~ normal(0.05, 0.01);
}


I also set the initial value of sig to be 0.05 so that the chain would start at the true value.

Most of the density under this prior is between 0.02 and 0.08, yet Stan still returns

              mean    se_mean          sd          2.5%           25%           50%           75%         97.5% n_eff      Rhat
sig   2.336760e-03 0.00004405  0.00176344  1.135100e-04  8.845000e-04  1.935080e-03  3.440960e-03  6.495730e-03  1602 0.9992026


This is the strangest result so far. How can Stan be giving most of the posterior density to regions that have close to zero density in the sig prior? There are no warnings, and the trace plots all seem to suggest the chains have mixed well.

Stan has no problem simulating mu0, Beta, and sigma0, when sig, sig1, sig2, rho, and eta are fixed (w is still a parameter in this scenario).

Edit: I decided to print sig after each iteration (I chose to do 10 iterations as I wanted to see what was happening at the beginning and the code was printing too quickly with a higher number of iterations). I choose the prior to be

  sig ~ normal(0.05, 0.01);


and I set the initial value to be 0.05. The printed results are shown below.

SAMPLING FOR MODEL 'reparameterized2_check' NOW (CHAIN 1).
Chain 1: sig: 0.05

Chain 1: sig: 0.05

Chain 1:
Chain 1: Gradient evaluation took 0.01 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 100 seconds.
Chain 1:
Chain 1:
Chain 1: WARNING: No variance estimation is
Chain 1:          performed for num_warmup < 20
Chain 1:
Chain 1: sig: 0.05

Chain 1: sig: 2.54598e+124

Chain 1: sig: 0.05

Chain 1: sig: 2.9852e+123

Chain 1: sig: 0.05

Chain 1: sig: 1.31739e+030

Chain 1: sig: 0.05
....


Stan seems to be proposing very large values for some reason. Also, since I chose 10 iterations, why are there many sig values between 50% and 60%? What are all of these values? I thought each iteration would contain one value. There are only multiple values for sig in the warmup phase. I guess this has something to do with Stan trying to get closer to the densest regions of the target distribution.

Chain 1: Iteration: 5 / 10 [ 50%]  (Warmup)
Chain 1: sig: 0.0584591

Chain 1: sig: 0.0585334

Chain 1: sig: 0.058548

...
Chain 1: sig: 1.20778
...

Chain 1: sig: 0.00448044

Chain 1: Iteration: 6 / 10 [ 60%]  (Sampling)
Chain 1: sig: 1.20778

Chain 1: sig: 0.488425

Chain 1: Iteration: 7 / 10 [ 70%]  (Sampling)
Chain 1: sig: 1.20778

Chain 1: sig: 0.473162

Chain 1: Iteration: 8 / 10 [ 80%]  (Sampling)
Chain 1: sig: 1.20778

Chain 1: sig: 0.484718

Chain 1: Iteration: 9 / 10 [ 90%]  (Sampling)
Chain 1: sig: 1.20778

Chain 1: sig: 0.48561

Chain 1: Iteration: 10 / 10 [100%]  (Sampling)
Chain 1: sig: 1.20778



There are a lot of values in-between 50% and 60%, including 1.20778. This value seems to be the starting value for all of the sampling iterations. Each sampling iteration has one proposal that appears to be rejected.