Hi all,
I am attempting to specify a model that approximates the underlying data-generating process of several traits related to human growth. Let x be a scalar independent variable and Y be a vector of observed continuous variables indexed by k = 1,2,...,K. I assume the overall response vector is distributed as y_n ~ N (h(x), \sigma(x)) where h(x) is a function for the mean and \sigma(x) a function for the standard deviation. The mean is a K length response vector where h_k(x) = a_k*x^r_k. For the sd I assume linearity where \sigma(x) = s_k[1+\kappa_k*x]. Essentially each continuous response has a parameter vector [a, r, b, s[\Sigma], \kappa].
To start, I tested the below model with simulated data where K =1, 2, and 6. In all cases, I am able to retrieve my original parameters. However, when I switch over to real data (where x = age and Y = continuous growth traits (i.e. bone lengths in mm) I begin to run into several issues where MANY divergences occur, the R-hat indicates a lack of convergence, there are mixing issues, tree depth issues, and ESS issues. After going through the forums, the tutorials, and reading some of the work by Betancourt, I am pretty sure the main issue is the geometric complexity of the posterior I am after. I have adjusted the adapt_delta
and the tree depth
, but still run into issues, telling me the model may be poorly written or to complex. Now, I know the difference between the simulated data and the “real” data rests on the fact that there is significant conditional dependence between each single response variable, the data are heteroskedastic scaling with age (hence the sd function), and according to the shinystan output there is significant autocorrelation in all chains that is not present in the simulated data. It could also be possible that the correlation structure also changes according to x. Again, I presume this is because of the complex structure of the “real” data.
QUESTION
How can I reparameterize my model to simplify the geometric properties of the posterior but is logically equivalent? Any advice would be wonderful. Essentially, my domain expertise suggests that each vector of traits Y is age dependent x, are heteroskedastic, and conditionally dependent, and the correlation structure may change according to x. The long term goal here is to scale this up where K is large, > 50 and I include ordinal responses with a probit link as well incorporating instances where some of the Y vector may be missing. It may also be that the multivariate normal is an inappropriate choice and I do plan on testing the multivariate generalization of the exponential power distribution (Gomez-Villegas and Marin, 1998), but I am struggling in writing this custom in Stan based on a cholesky parameterization for \Sigma… But, I am not there yet! Any advice or suggestions would be wonderful!
data{
//multivariate outcomes
int N; // # of individuals
int K; // # of responses
vector[K] y[N]; // vector of responses per individual
//predictors
real X[N]; //age
}
parameters{
vector<lower=0>[K] a;
vector<lower=0>[K] r;
vector[K] b;
vector<lower=0>[K] s_scale;
vector<lower=0>[K] kappa;
//cholesky factors for rho
cholesky_factor_corr[K] L_Omega;
}
transformed parameters{
vector[K] mu[N];
vector[K] s[N];
matrix[K,K] Sigma[N];
for(i in 1:N){
for(k in 1:K){
mu[i,k] = a[k]*X[i]^(r[k])+b[k];
s[i,k] = s_scale[k]*(1 + kappa[k]*X[i]);
}
}
for(i in 1:N){
Sigma[i] = diag_pre_multiply(s[i],L_Omega);
}
}
model{
a ~ gamma(4,1);
r ~ gamma(4,1);
b ~ normal(0,100);
kappa ~ gamma(4,1);
s_scale ~ gamma(4,1);
L_Omega ~ lkj_corr_cholesky(1);
for(i in 1:N){
y[i] ~ multi_normal_cholesky(mu[i], Sigma[i]);
}
}