Max treedepth warnings in hierarchical multivariate model

Hi all. So I’ve been working on the model below for a few days now as an attempt to implement the models described here - nested domain multivariate hierarchical models. With simulated data the models are more or less recovering the correct answers, however in the current form I’m getting max treedepth warnings on virtually every iteration. On increasing the max-treedepth to 12. I’m still getting these errors 950 iterations out of 1000, although traceplots look a bit better. Its also super-slow. I’ve tried a few different things including re-parametrisation and altering the priors, but its not helping.

Stuff I tried that didnt’ help is commented out in the code below. If I remove the beta_oc, beta0_dom and beta_dom terms the model behaves very well, but then I have no nesting of outcomes in domains which is what I want! However that makes me think the problem lies with those terms, although I’ve been unable to figure it out.
Model code below and I include a file for simulating data.

Any suggestions folks? Thanks in advance!

Sim data: gen_data_y3_x3.R (1011 Bytes)

Stan model:

data{
    int<lower=0> NvarsX;     // num independent variables
    int<lower=0> NvarsY;     // num dependent variables
    int<lower=0> N;          // Total number of rows
    int<lower=0> N_ind;      // Number of individuals/participants
    int<lower=2> N_dom;      // Number of domains in which outcomes are nested
    
    int indiv[N];   // data to identify individuals
    matrix[N, NvarsX] x;     // data for independent vars
    vector[NvarsY] y [N];    // data for dependent vars
    int dom[NvarsY];  // data to define outcome domains
}

parameters{
    cholesky_factor_corr[NvarsY] L_Omega; //For the correlations between outcome variables
    vector<lower=0>[NvarsY] L_sigma;      //Residual SD of outcome variables
    
    matrix[NvarsY, N_ind] Zbeta0_ind;         //Intercepts at individual level
    matrix[NvarsX, N_ind] Zbeta_ind [NvarsY]; //Coefficients at the individual level
    vector<lower=0> [NvarsY] Tau0;           //Random effect for individual intercepts
    matrix<lower=0> [NvarsY, NvarsX] Tau;    //Random effect for indvidiual coefficients

    vector[NvarsY] Beta0;         // Level 2 intercepts for each Y
    vector[NvarsX] Beta [NvarsY]; // Level 2 coefficients for each X vs each Y
    
    vector[NvarsY] beta_oc;      // Outcome specific random effect
    //vector[NvarsY] zb_oc;
    //vector[NvarsY] Tau_oc;
    
    vector[N_dom] beta0_dom;      // Domain specificrandom intercept
    vector[N_dom] beta_dom;      // Domain specific random effect
    vector<lower = 0> [NvarsY] sd_oc;        // SD for outcome effect
    vector<lower = 0> [N_dom] sd_dom;       // SD for domain effect
    vector<lower = 0> [N_dom] sd0_dom;       // SD for domain intercept
}

transformed parameters{
    vector[NvarsY] mu[N];
    matrix[NvarsY, NvarsY] L_Sigma;
    matrix[NvarsY, N_ind] beta0_ind;
    matrix[NvarsX, N_ind] beta_ind [NvarsY];

    L_Sigma = diag_pre_multiply(L_sigma, L_Omega);
    
    // Define Individual level betas
    for (s in 1:N_ind){
        for (dv in 1:NvarsY){
            // Non centered specification for component in ( )
            beta0_ind[dv, s] = ( Zbeta0_ind[dv, s] * Tau0[dv] + Beta0[dv] ) +
                                        beta0_dom[ dom[dv] ];
            for (iv in 1:NvarsX){
                // Non centered specification for component in ( )
                beta_ind[dv, iv, s] = ( Zbeta_ind[dv, iv, s] * Tau[dv, iv] + Beta[dv, iv] ) +
                                        beta_oc[dv] + beta_dom[ dom[dv] ];
                                        //(zb_oc[dv] * Tau_oc[dv] + beta_oc[dv] ) + beta_dom[ dom[dv] ];
                                        
            }
        }
    }

    // Define level 2 betas
    for (i in 1:N){
        for (dv in 1:NvarsY){
            mu[i, dv] = beta0_ind[dv, indiv[i]] +
                                    dot_product( to_vector( beta_ind[dv, 1:NvarsX, indiv[i]] ), x[i, 1:NvarsX] ); 
        }
    }
}

model{
    //Priors
    L_Omega ~ lkj_corr_cholesky(1);
    L_sigma ~ cauchy(0, 5);
    Tau0 ~ cauchy(0, 2);
    to_vector(Tau) ~ cauchy(0,2);
    to_vector(beta_oc) ~ normal(0, sd_oc);
    //to_vector(zb_oc) ~ normal(0, 1);
    //to_vector(Tau_oc) ~ normal(0, 2);
    to_vector(beta_dom) ~ normal(0, sd_dom);
    to_vector(beta0_dom) ~ normal(0, sd0_dom);
    to_vector(sd_oc) ~ normal(0, 0.5);
    to_vector(sd_dom) ~ normal(0, 0.5);
    to_vector(sd0_dom) ~ normal(0, 0.5);
    
    
    for (s in 1:N_ind){
        for (dv in 1:NvarsY){
            Zbeta0_ind[dv, s] ~ normal(0, 1);
            Zbeta_ind[dv, 1:NvarsX, s] ~ normal(0, 1);
        }
    }
    to_vector(Beta0) ~ cauchy(0, 5);
    //for( dv in 1:NvarsY){
    //    to_vector(Beta[dv, 1:NvarsX]) ~ cauchy(0, 2);
    //}


    //likelihood
    for (i in 1:N){
        y[i, 1:NvarsY] ~ multi_normal_cholesky(mu[i, 1:NvarsY], L_Sigma);
    }
}

generated quantities{
    matrix[NvarsY, NvarsY] Omega;
    matrix[NvarsY, NvarsY] Sigma;
    Omega = multiply_lower_tri_self_transpose(L_Omega);
    Sigma = quad_form_diag(L_Omega, L_sigma);
}


1 Like

Might be worth trying the QR reparameterization. Have a look at section 9.2 “The QR Reparameterization” of the 2.17.0 manual. Treedepth issues often come up in regression problems where you have tight correlations in the posterior. The sampler has to take small steps to avoid divergences, and so you get treedepth problems cause it needs to take a lot of these steps to actually explore the posterior.

For simple models (y \sim \mathcal{N}(a x + b, \sigma)) just centering your data can help you avoid these issues.

1 Like

Hey Ben, which should be easily spot in the pairs plot. Right?

In addition to what Ben said:
The model shows unit variance prior in LKJ with a dimension of 20 and a scaling of cauchy(0, 5) with only 40 observations. Is this space larger than the existing universe?

1 Like

So recall that I said if I leave out the beta_oc and beta_dom terms the model behaves well and these problems don’t occur, so I don’t think that is the issue. Nevertheless, I changed the scaling to cauchy(0,2) and it made no difference.

Thanks Ben. I just now attempted to do this and it also made no difference. My attempting implementation was as follows below - perhaps I did something wrong ? (I didn’t worry too much about generating betas at the end since I didn’t need them to know if this was helping)

Assuming I implemented that correctly … any other ideas?

data{
    int<lower=0> NvarsX;     // num independent variables
    int<lower=0> NvarsY;     // num dependent variables
    int<lower=0> N;          // Total number of rows
    int<lower=0> N_ind;      // Number of individuals/participants
    int<lower=2> N_dom;      // Number of domains in which outcomes are nested
    
    int indiv[N];   // data to identify individuals
    matrix[N, NvarsX] x;     // data for independent vars
    vector[NvarsY] y [N];    // data for dependent vars
    int dom[NvarsY];  // data to define outcome domains
}

transformed data{
    matrix[N, NvarsX] Q_ast;
    matrix[NvarsX, NvarsX] R_ast;
    matrix[NvarsX, NvarsX] R_ast_inverse;
    // thin and scale the QR decomposition
    Q_ast = qr_Q(x)[, 1:NvarsX] * sqrt(N - 1);
    R_ast = qr_R(x)[1:NvarsX, ] / sqrt(N - 1);
    R_ast_inverse = inverse(R_ast);
}

parameters{
    cholesky_factor_corr[NvarsY] L_Omega; //For the correlations between outcome variables
    vector<lower=0>[NvarsY] L_sigma;      //Residual SD of outcome variables
    
    matrix[NvarsY, N_ind] Ztheta0_ind;         //Intercepts at individual level
    matrix[NvarsX, N_ind] Ztheta_ind [NvarsY]; //Coefficients at the individual level
    vector<lower=0> [NvarsY] Tau0;           //Random effect for individual intercepts
    matrix<lower=0> [NvarsY, NvarsX] Tau;    //Random effect for indvidiual coefficients

    vector[NvarsY] Theta0;         // Level 2 intercepts for each Y
    vector[NvarsX] Theta [NvarsY]; // Level 2 coefficients for each X vs each Y
    
    vector[NvarsY] theta_oc;      // Outcome specific random effect
    vector[NvarsY] zb_oc;
    vector[NvarsY] Tau_oc;
    
    vector[N_dom] theta0_dom;      // Domain specific random intercept
    vector[N_dom] theta_dom;      // Domain specific random effect
    vector<lower = 0> [NvarsY] sd_oc;        // SD for outcome effect
    vector<lower = 0> [N_dom] sd_dom;       // SD for domain effect
    vector<lower = 0> [N_dom] sd0_dom;       // SD for domain intercept
}

transformed parameters{
    vector[NvarsY] mu[N];
    matrix[NvarsY, NvarsY] L_Sigma;
    matrix[NvarsY, N_ind] theta0_ind;
    matrix[NvarsX, N_ind] theta_ind [NvarsY];

    L_Sigma = diag_pre_multiply(L_sigma, L_Omega);
    
    // Define Individual level thetas
    for (s in 1:N_ind){
        for (dv in 1:NvarsY){
            // Non centered specification for component in ( )
            theta0_ind[dv, s] = ( Ztheta0_ind[dv, s] * Tau0[dv] + Theta0[dv] ) +
                                        theta0_dom[ dom[dv] ];
            for (iv in 1:NvarsX){
                // Non centered specification for component in ( )
                theta_ind[dv, iv, s] = ( Ztheta_ind[dv, iv, s] * Tau[dv, iv] + Theta[dv, iv] ) +
                                        theta_oc[dv] + theta_dom[ dom[dv] ];
                                        //(zb_oc[dv] * Tau_oc[dv] + theta_oc[dv] ) + theta_dom[ dom[dv] ];
                                        
            }
        }
    }

    // Define level 2 thetas
    for (i in 1:N){
        for (dv in 1:NvarsY){
            mu[i, dv] = theta0_ind[dv, indiv[i]] +
                                dot_product( to_vector( theta_ind[dv, 1:NvarsX, indiv[i]] ), Q_ast[i, 1:NvarsX] ); 
        }
    }
}

model{
    //Priors
    L_Omega ~ lkj_corr_cholesky(1);
    L_sigma ~ cauchy(0, 5);
    Tau0 ~ cauchy(0, 2);
    to_vector(Tau) ~ cauchy(0,2);
    to_vector(theta_oc) ~ normal(0, sd_oc);
    //to_vector(zb_oc) ~ normal(0, 1);
    //to_vector(Tau_oc) ~ normal(0, 2);
    to_vector(theta_dom) ~ normal(0, sd_dom);
    to_vector(theta0_dom) ~ normal(0, sd0_dom);
    to_vector(sd_oc) ~ normal(0, 0.5);
    to_vector(sd_dom) ~ normal(0, 0.5);
    to_vector(sd0_dom) ~ normal(0, 0.5);
    
    
    for (s in 1:N_ind){
        for (dv in 1:NvarsY){
            Ztheta0_ind[dv, s] ~ normal(0, 1);
            Ztheta_ind[dv, 1:NvarsX, s] ~ normal(0, 1);
        }
    }
    to_vector(Theta0) ~ normal(0, 2);
    for( dv in 1:NvarsY){
        to_vector(Theta[dv, 1:NvarsX]) ~ normal(0, 2);
    }


    //likelihood
    for (i in 1:N){
        y[i, 1:NvarsY] ~ multi_normal_cholesky(mu[i, 1:NvarsY], L_Sigma);
    }
}

generated quantities{
    matrix[NvarsY, NvarsY] Omega;
    matrix[NvarsY, NvarsY] Sigma;
    Omega = multiply_lower_tri_self_transpose(L_Omega);
    Sigma = quad_form_diag(L_Omega, L_sigma);
    
    //vector[NvarsY] beta_oc = R_ast_inverse * theta_oc;
    //vector[NvarsY] beta_dom;
    //beta_dom = R_ast_inverse * theta_dom;
}

Looks right to me. Bummer.

@andre.pfeuffer is right. Probably best to look at the posterior samples you do have and see if anything weird is going on.

Maybe there’s an unidentifiability between beta_oc, beta_dom, or Beta or something. Put em’ in pairplots and look around.

I assume @andre.pfeuffer was probably suggesting that you try a normal, or something lighter tail

1 Like

Yeah I’ve been doing a bit of that allright, but I remain confused where the problem lies 😝

Ah got you. I’ll bear that in mind. I’m going to take a step back to a simpler model that worked - I think tried to add too much at once, and I’ll go slower this time. I’m rethinking how to do it somewhat. Thanks for your help @bbbales2 and @andre.pfeuffer

1 Like

Ok so I removed some of the terms to keep things simpler, and I tightened up some of the priors including the LKJ. This has brought some improvement - I’m now only getting 250 treedepth errors per 1000 iterations and it runs a bit faster. The revised code is below and below that pairs plot for a few parameters. Looks like maybe an idenitifiabilty issue? Any suggestions to improve it?

data{
    int<lower=0> NvarsX;     // num independent variables
    int<lower=0> NvarsY;     // num dependent variables
    int<lower=0> N;          // Total number of rows
    int<lower=0> N_ind;      // Number of individuals/participants
    int<lower=2> N_dom;      // Number of domains in which outcomes are nested

    int<lower=1> indiv[N];   // data to identify individuals
    matrix[N, NvarsX] x;     // data for independent vars
    vector[NvarsY] y [N];    // data for dependent vars
    int dom[NvarsY];         // data to define outcome domains
}

parameters{
    cholesky_factor_corr[NvarsY] L_Omega; //For the correlations between outcome variables
    vector<lower=0>[NvarsY] L_sigma;      //Residual SD of outcome variables
    
    matrix[NvarsY, N_ind] Zbeta0_ind;         //Intercepts at individual level
    matrix[NvarsX, N_ind] Zbeta_ind [NvarsY]; //Coefficients at the individual level
    vector<lower=0> [NvarsY] Tau0;           //Random effect for individual intercepts
    matrix<lower=0> [NvarsY, NvarsX] Tau;    //Random effect for indvidiual coefficients

    vector[NvarsY] Beta0;         // Level 2 intercepts for each Y
    vector[NvarsX] Beta [NvarsY]; // Level 2 coefficients for each X vs each Y
    
    vector[NvarsY] beta_dom;            // Domain specific random effect
    vector<lower = 0> [N_dom] sd_dom;   // SD for domain effect - betas pooled per domain

}

transformed parameters{
    vector[NvarsY] mu[N];
    matrix[NvarsY, NvarsY] L_Sigma;
    matrix[NvarsY, N_ind] beta0_ind;
    matrix[NvarsX, N_ind] beta_ind [NvarsY];

    L_Sigma = diag_pre_multiply(L_sigma, L_Omega);
    
    // Define Individual level betas - non parametric specification
    for (s in 1:N_ind){
        for (dv in 1:NvarsY){
            beta0_ind[dv, s] = Zbeta0_ind[dv, s] * Tau0[dv] + Beta0[dv];
            for (iv in 1:NvarsX){
                beta_ind[dv, iv, s] = Zbeta_ind[dv, iv, s] * Tau[dv, iv] + Beta[dv, iv] +
                                        beta_dom[dv];
            }
        }
    }

    // Define level 2 betas
    for (i in 1:N){
        for (dv in 1:NvarsY){
            mu[i, dv] = beta0_ind[dv, indiv[i]] +
                                    dot_product( to_vector( beta_ind[dv, 1:NvarsX, indiv[i]] ), x[i, 1:NvarsX] ); 
        }
    }
}

model{
    // Priors
    L_Omega ~ lkj_corr_cholesky(1);
    L_sigma ~ normal(0,2.5);
    Tau0 ~ cauchy(0,5);
    to_vector(Tau) ~ cauchy(0,5);
    
    for (s in 1:N_ind){
        for (dv in 1:NvarsY){
            Zbeta0_ind[dv, s] ~ normal(0,1);
            Zbeta_ind[dv, 1:NvarsX, s] ~ normal(0,1);
        }
    }
    
    to_vector(Beta0) ~ cauchy(0, 2);
    for( dv in 1:NvarsY){
        to_vector(Beta[dv, 1:NvarsX]) ~ normal(0, 1);
        
        beta_dom[dv] ~ normal(0, sd_dom[ dom[dv] ]);
    }
    
    to_vector(sd_dom) ~ normal(0, 0.1);

    // Likelihood
    for (i in 1:N){
        y[i, 1:NvarsY] ~ multi_normal_cholesky(mu[i, 1:NvarsY], L_Sigma);
    }
}

generated quantities{
    matrix[NvarsY, NvarsY] Omega;
    matrix[NvarsY, NvarsY] Sigma;
    Omega = multiply_lower_tri_self_transpose(L_Omega);
    Sigma = quad_form_diag(L_Omega, L_sigma);
}

Okay. I figured out the source of that problem.

As specified above, the relevant parts of the model were essentially trying to calculate: (b_1 + b_2 + b_3)x . So the problem with that is while my mind wants to think the parameters are each multiplied by x, I’m forgetting the order of operations. Stan is doing the addition first and then the multiplication of the sum B (= b_1 + b_2 + b_3), and so the subterms of B are unresolvable without each of them being multiplied by x in turn. So I rewrote my model to be in essence: (b_1x +b_2x + b_3x) and that has resolved that issue. But of course I now have a new issue causing its own max treedepth warnings 🙄.

The new problem causing max iterations is more of a Neal’s funnel situation - seen here between the variables beta_dom and sd_dom:

I’m having trouble reparamaterising the relevant line here :

        beta_dom[dv] ~ normal(0, sd_dom[ dom[dv] ]);

Note that while there are (in this case) 3 beta_dom’s there are only 2 sd_doms (reflecting nesting of outcomes in domains). Any suggestions for how to reparameterise this? (I’ve tried and failed. Full code below.

data{
    int<lower=0> NvarsX;     // num independent variables
    int<lower=0> NvarsY;     // num dependent variables
    int<lower=0> N;          // Total number of rows
    int<lower=0> N_ind;      // Number of individuals/participants
    int<lower=2> N_dom;      // Number of domains in which outcomes are nested

    int<lower=1> indiv[N];   // data to identify individuals
    matrix[N, NvarsX] x;     // data for independent vars
    vector[NvarsY] y [N];    // data for dependent vars
    int dom[NvarsY];         // data to define outcome domains
}

parameters{
    cholesky_factor_corr[NvarsY] L_Omega; //For the correlations between outcome variables
    vector<lower=0>[NvarsY] L_sigma;      //Residual SD of outcome variables
    
    matrix[NvarsY, N_ind] Zbeta0_ind;         //Intercepts at individual level
    matrix[NvarsX, N_ind] Zbeta_ind [NvarsY]; //Coefficients at the individual level
    vector<lower=0> [NvarsY] Tau0;           //Random effect for individual intercepts
    matrix<lower=0> [NvarsY, NvarsX] Tau;    //Random effect for indvidiual coefficients

    vector[NvarsY] Beta0;         // Level 2 intercepts for each Y
    vector[NvarsX] Beta [NvarsY]; // Level 2 coefficients for each X vs each Y
    
    vector[NvarsY] beta_dom;        // Domain specific random effect
    vector<lower = 0> [N_dom] sd_dom;   // SD for domain effect - betas pooled per domain

    row_vector[NvarsY] beta_oc;        // Outcome specific random effect
    vector<lower = 0> [NvarsY] sd_oc; // SD for outcome effect - betas pooled per outcome
}

transformed parameters{
    vector[NvarsY] mu[N];
    matrix[NvarsY, NvarsY] L_Sigma;
    matrix[NvarsY, N_ind] beta0_ind;
    matrix[NvarsX, N_ind] beta_ind [NvarsY];

    L_Sigma = diag_pre_multiply(L_sigma, L_Omega);
    
    // Define Individual level betas - non parametric specification
    for (dv in 1:NvarsY){
        for (s in 1:N_ind){

            beta0_ind[dv, s] = Zbeta0_ind[dv, s] * Tau0[dv] + Beta0[dv];
            for (iv in 1:NvarsX){
                beta_ind[dv, iv, s] = (Zbeta_ind[dv, iv, s] * Tau[dv, iv] + Beta[dv, iv]);// +
                                       // beta_dom[dv];
            }
        }
    }

    // Define level 2 betas
    for (i in 1:N){
        for (dv in 1:NvarsY){
            mu[i, dv] = beta0_ind[dv, indiv[i]] +
                            dot_product( to_vector( beta_ind[dv, 1:NvarsX, indiv[i]] ), x[i, 1:NvarsX] ) +
                            sum( beta_dom[dv] * x[i, 1:NvarsX]) +
                            sum( beta_oc[dv] * x[i, 1:NvarsX]); 
        }
    }
}

model{
    // Priors
    L_Omega ~ lkj_corr_cholesky(1);
    L_sigma ~ normal(0,2.5);
    Tau0 ~ cauchy(0,5);
    to_vector(Tau) ~ cauchy(0,5);
    
    for (s in 1:N_ind){
        for (dv in 1:NvarsY){
            Zbeta0_ind[dv, s] ~ normal(0,1);
            Zbeta_ind[dv, 1:NvarsX, s] ~ normal(0,1);
        }
    }
    
    to_vector(Beta0) ~ cauchy(0, 2);
    for( dv in 1:NvarsY){
        to_vector(Beta[dv, 1:NvarsX]) ~ normal(0, 1);
        
        beta_dom[dv] ~ normal(0, sd_dom[ dom[dv] ]);
        beta_oc[dv] ~ normal(0, sd_oc[dv]);
    }
    
    to_vector(sd_dom) ~ normal(0, 0.1);
    to_vector(sd_oc) ~ normal(0, 0.1);


    // Likelihood
    for (i in 1:N){
        y[i, 1:NvarsY] ~ multi_normal_cholesky(mu[i, 1:NvarsY], L_Sigma);
    }
}

generated quantities{
    matrix[NvarsY, NvarsY] Omega;
    matrix[NvarsY, NvarsY] Sigma;
    Omega = multiply_lower_tri_self_transpose(L_Omega);
    Sigma = quad_form_diag(L_Omega, L_sigma);
}

beta_dom[dv] ~ normal(0, sd_dom[ dom[dv] ]);

you may write this statement and all others as non-centered distribution:

parameters
vector[NvarsY] beta_dom_raw;
transformed parameters
vector[NvarsY] beta_dom = beta_dom_raw .* sd_dom[ dom ];
model:
beta_dom_raw ~ normal(0, 1);

See discussions about centered / non-centered distribution here and elsewhere.

1 Like

Note that these are warnings and not errors (I edited the thread title accordingly). A large number of max treedepth warnings indicate there can be a serious problem with your model, but sometimes the posterior just happens to be difficult and then max treedepth warnings indicate slower mixing, but the inference result can still be valid. Use Rhat and n_eff to diagnose your chains for mixing problems. These may help to find out which parameters are related to the max treedepth exceedences.

2 Likes

Thanks @andre.pfeuffer. Indeed I know about the non-centered parameterisation and had tried it to no avail.

Noted! Fair point thanks for fixing it.

Yup, I had realised this and was having trouble finding the source of the problem. This model is considerably harder than stuff I had tried before so there has been a learning curve for me. I’ve come to the conclusion this model won’t work as I’ve tried to implement it. I think because I’m trying to add terms to allow for colinearity of outcomes when this already included via the lkj cholesky term which I had neglected to consider, my efforts were doomed from the start.

1 Like