Divergent transitions in hierarchical model

bbbales2 · November 4, 2019, 9:05pm

Chain 1: Iteration: 5 / 10 [ 50%]  (Warmup)

You’ll see lots of wild things if you run Stan with short/no warmup. Do these things go away after a big warmup?

This is normal. If the data is informative of that parameter, then it’s totally reasonable it would trounce the prior.

The only advice I can really give at this point is simplify the model. In my understanding there are two things:

p(X_X | \theta_X)

Where X_X is the thing called covariate history and it has parameters \theta_X and

p(X_T | \theta_T, X_X)

Where X_T are failure times and \theta_T are the things attached to that model.

So is there a way to just do the first model (p(X_X | \theta_X)) for now?

JoLee · November 4, 2019, 9:23pm

I have considered a simpler model (not including the \theta_X part of the model) and recovered the model parameters. This was when I only had the parameters \theta_T and treated the covariate history as being constant and known at all times.

The problem does seem to lie in the \theta_X = (\eta, \sigma_1, \sigma_2, \rho, \sigma) parameter vector. So, I will attempt to recover those parameters without considering the \theta_T part of the model.

I am not sure. I imagine that the initial propositions for sig will still be very large. The code runs so quickly that I cannot see what is being printed.

JoLee · November 6, 2019, 5:35pm

I considered a simpler model. The simpler has the following form

X_i(t_{ij}) = \eta + w_{0i} + w_{1i}log(t_{ij}) + \varepsilon_{ij}, where

\varepsilon_{ij} \sim N(0, \sigma^2), and

\Sigma_{w} = \begin{pmatrix} \sigma^2_1& \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma^2_2 \end{pmatrix}.

In other words, X_i \sim N(\eta + w_{0i} + w_{1i}log(t_{ij}), \sigma_1^2 + \rho\sigma_1\sigma_2\log(t_{ij}) + \log(t_{ij})[\rho\sigma_1\sigma_2 + \sigma^2_2\log(t_{ij})] + \sigma^2).

I simulated a small dataset from this model. This dataset contains 50 units (i = 1,\dots,50), and each unit has 100 measurements (j = 1,\dots,100). The true parameter values are

\eta = 1, \sigma_1 = 0.46, \sigma_2 = 0.22, \rho = 0.20, \sigma = 0.05.

I used the following Stan code:

data {
  int<lower=1> n; // Sample size
  int<lower=1> N; // Total number of times considered (all times for all units)
  real<lower=0> tt[N]; //All times for all units
  real y_tt[N]; //All of the use-rates
  int w_ind[N]; //Indices for w
}
parameters {
  real <lower = 0> sig;
  real <lower = 0> sig1;
  real <lower = 0> sig2;
  real <lower = -1, upper = 1>rho;
  real eta;
  matrix[2, n] w;
}
transformed parameters{
  //Defining covariance matrix, and cholesky decomposition 
  cov_matrix[2] Sigma;
  matrix[2,2] L;
  matrix[2, n] w2;
  
  Sigma[1,1] = sig1^2;
  Sigma[1,2] = rho*sig1*sig2;
  Sigma[2,1] = rho*sig1*sig2;
  Sigma[2,2] = sig2^2;
  
  //Cholesky decomposition
  L = cholesky_decompose(Sigma);
  w2 = L*w;
  
}

model {
  //Defining mean and variance of mixed effects model
  real Mu[N];
  real Sig[N];
  
  //Parameters used in mixed effects model
  for(i in 1:N){
    Mu[i] = eta + w2[1,w_ind[i]] + w2[2,w_ind[i]]*log(tt[i]);
    Sig[i] = sqrt(sig1^2 + log(tt[i])*(2*rho*sig1*sig2 + sig2^2*log(tt[i])) + sig^2);
  }
  
  //Likelihood
    target += normal_lpdf(y_tt|Mu, Sig);
    // Prior:
    eta ~ normal(1, 0.1);
    to_vector(w) ~ normal(0, 1);
    sig ~ normal(0.05, 0.1);
    sig1 ~ normal(0.46, 0.1);
    sig2 ~ normal(0.22, 0.1);
    rho ~ normal(0.2, 0.1);
}

I also set the initial values (using one chain) to be the true values. I have posted the output below.

Inference for Stan model: covariate_history.
1 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=500.

                    mean    se_mean         sd          2.5%           25%           50%           75%         97.5% n_eff      Rhat
sig           0.00147989 0.00004558 0.00109131    0.00008564    0.00056554    0.00131792    0.00210129    0.00407314   573 1.0050062
sig1          0.12062747 0.00025085 0.00375407    0.11334060    0.11832019    0.12053361    0.12286424    0.12849077   224 0.9980990
sig2          0.03797979 0.00005621 0.00089803    0.03612970    0.03738746    0.03803156    0.03854388    0.03971250   255 0.9980047
rho          -0.64260523 0.00136347 0.01910576   -0.67802336   -0.65486925   -0.64397021   -0.62984465   -0.60499704   196 0.9986664
eta           0.94631716 0.00090889 0.01255101    0.92417353    0.93723272    0.94619426    0.95472676    0.97200233   191 0.9980701
Sigma[1,1]    0.01456505 0.00006055 0.00090766    0.01284609    0.01399967    0.01452835    0.01509562    0.01650988   225 0.9980670
Sigma[1,2]   -0.00295015 0.00001621 0.00023883   -0.00345336   -0.00308949   -0.00294962   -0.00279064   -0.00249462   217 0.9981057
Sigma[2,1]   -0.00295015 0.00001621 0.00023883   -0.00345336   -0.00308949   -0.00294962   -0.00279064   -0.00249462   217 0.9981057
Sigma[2,2]    0.00144327 0.00000423 0.00006811    0.00130536    0.00139782    0.00144640    0.00148563    0.00157708   259 0.9980188
L[1,1]        0.12062747 0.00025085 0.00375407    0.11334060    0.11832019    0.12053361    0.12286424    0.12849077   224 0.9980990
L[1,2]        0.00000000        NaN 0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000   NaN       NaN
L[2,1]       -0.02442087 0.00008574 0.00125753   -0.02688328   -0.02521050   -0.02444180   -0.02360880   -0.02193476   215 0.9983378
L[2,2]        0.02907222 0.00001766 0.00034375    0.02838163    0.02884897    0.02907822    0.02929589    0.02969408   379 1.0009660
lp__       3641.15104392 0.44610895 6.74073912 3626.35592137 3637.30179038 3641.17944491 3645.82128203 3653.33882154   228 0.9990915

The samples are, once again, not near the true values (apart from eta). After putting Sigma, L and w2 in the transformed parameters block, I am able to see these parameters in the Stan output.

It appears that something is going wrong with the Cholesky decomposition. There are NaN values in the L[1,2] row of the output.

In addition, I used the parameter samples for \eta, \sigma_1, \sigma_2, \rho, and \sigma to make some predictions. Namely, I forecasted the covariate process of a new unit that may have just joined the cohort of units. I compared these forecasts to forecasts using the true values.

I have attached a plot of the forecasts:

forecasts.pdf (7.3 KB)

The forecasts from the Stan samples are much narrower compared to the forecasts from the true parameter values.

Aside: when writing

sqrt(vector)

does Stan take the square root of each element of the vector?

nhuurre · November 6, 2019, 6:25pm

Those NaNs are for the standard error and effective sample size only. The value has no variation which confuses the sampling diagnostics.
You can avoid the decomposition by constructing the factor directly

transformed parameters{
  cov_matrix[2] Sigma;
  matrix[2,2] L;
  matrix[2, n] w2;
  L[1,1] = sig1;
  L[1,2] = 0;
  L[2,1] = rho*sig2;
  L[2,2] = sig2*sqrt(1-square(rho));
  Sigma = tcrossprod(L);
  w2 = L*w; 
}

Yes, it’s element-wise

JoLee · November 6, 2019, 6:31pm

Ah yes, L is a lower triangular matrix. Thanks!

bbbales2 · November 6, 2019, 7:54pm

Is the problem here that the covariance of w is being included twice?

I think you either sample w, in which case your model looks like:

X_i(t_{ij}) \sim N(\eta + w_{0i} + w_{1i}log(t_{ij}), \sigma^2)
w_i \sim N(0, \Sigma_{w})

Or you integrate out your w s in which case your likelihood looks like:

X_i(t_{ij}) \sim N(\eta, \text{something} + \sigma^2)

Presumably you should do the second cause then you don’t have to worry about the w s. I’d just double check the indexing or whatnot.

But it might not be possible in your expanded model.

Anyway, in just this limited case, I think what’s happening is in your first model you have p(X | w) is normal and p(w) is normal as well, and so you know that together they’re jointly normal (or you look this up so you know this – not something that works in general or whatevs).

To integrate out w you want to have the expression for the jointly normal p(X, w). You can do this by sorta working backwards here: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions. I think what you wrote out (X_i(t_{ij}) \sim N(\eta, \sigma_1^2 + \rho\sigma_1\sigma_2\log(t_{ij}) + \log(t_{ij})[\rho\sigma_1\sigma_2 + \sigma^2_2\log(t_{ij})] + \sigma^2)) is probably right but double check.

Cause that ends up beings a big multivariate normal distribution, integrating out w means just taking the bits of the mean and covariance that only involve X (https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Marginal_distributions).

JoLee · November 7, 2019, 4:56pm

It is interesting. My first attempt at this problem used

data {
  //Likelihood
    target += normal_lpdf(y_tt|Mu, sig);
}

(notice sig and not Sig). This code returned the correct parameter values. But, I thought I forgot part of the variance and so I rewrote the code to

data {
  //Likelihood
    target += normal_lpdf(y_tt|Mu, Sig);
}

Going back to the older code (and what you suggested). It does return the right values.

The marginal distribution for X_i(t_i) is given by

X_i \sim N(\eta, \sigma_1^2 + \rho\sigma_1\sigma_2\log(t_{ij}) + \log(t_{ij})[\rho\sigma_1\sigma_2 + \sigma_2^2\log(t_{ij})] + \sigma^2),

and the conditional distribution is given by

X_i \mid w \sim N(\eta + w_{0i} + w_{1i}\log(t_{ij}), \sigma^2),

And the posterior distribution is given by

p(\theta, b \mid X) \propto \prod_{i=1}^np(X_i \mid w_i, \theta)p(w_i, \theta) = \prod_{i=1}^np(X_i \mid w_i, \theta)p(w_i \mid \theta)p(\theta)

which requires the conditional distribution for X_i, and the priors for w_i and \theta = (\eta, \sigma_1, \sigma_2, \rho, \sigma).

Okay I think the problem is solved.

I have accepted your solution. Thank you for the help. I appreciate it.

Topic		Replies	Views
Stan do not use my init and divergent transition issue Modeling	9	1121	June 14, 2019
Divergent transitions in a multivariate multilevel regression model Modeling fitting-issues	17	1499	August 25, 2018
Divergent transitions Modeling	12	1021	July 17, 2019
Divergent transitions & BFMI low in a state-space model Modeling	6	1041	August 17, 2017
High number of divergent transitions in ODE model with explicit constraints Modeling ecology , population-dynamics	55	3429	June 5, 2018

Divergent transitions in hierarchical model

Related topics