Transformed parameters converge

Hi,

I have some trouble understanding the information i get from the summary table. My parameters have \hat{R} values far away from unity while my transformed parameters appears to behave quite well.
Below I have a more in depth explanation of my approach, followed by the implemented .stan model and finally the summary table.

Actual question

If anyone could tell me why only the transformed parameters behave well and why approach 3 so much slower than approach 1 and 2, it would be greatly appreciated.
Before anyone states it I am aware of the tree depth being saturated. But using method 1 or 2 never gets above leapfrog step three.

In depth

I am trying to solve the following inverse problem: \textbf{d}=\textbf{Gm}
Here \textbf{d} (size n) is observations, \textbf{G} (size n\times m) data kernel and \textbf{m} (size m) the model.
The parameters in \textbf{m} are dependent and from some simulations, \textbf{M} (Each column is a realization of \textbf{m}), I have an approximate covariance matrix \Sigma_M.

I have sampled the following ways:

1. Each element in \textbf{m} being independent and normally distributed
2. \textbf{m} following multivariate normal distribution using \Sigma_M
3. Lastly projecting \textbf{m} into eigenspace, using the eigen-decomposition of \Sigma_M.

With method 3 I ran into a problem.
I sample \textbf{q} in eigen-space and project it back to determine the likelihood.
\textbf{m} = \textbf{Tq}.*\sigma_M + \mu_M
Here \textbf{T} is the tranformation matrix to and from eigen-space, .* elementwise multiplication, \sigma_M and \mu_M standard deviation and mean of each row in \textbf{M}, respectively.
The scaling with \sigma_M and \mu_M is done because \textbf{M} is standardized before determining \Sigma_M.

I found that this approach worked really bad. And realized that much of the information was “trapped” in \mu_M which was kept constant.
I then tried giving \mu_M a distribution. Which resulted in \textbf{m} having \hat{R}\approx1, but not \mu_M and \textbf{q}.
And also it required way to much computational time when compared to method 2.

Model

data {
int<lower=0> L;	// Length of data-array
int<lower=0> n;	// SH-degree
int<lower=0> mMax;	// Amount of model parameters
int<lower=0> p;	// Amount of model truncated parameters
vector[L] d;		// Declare size of data-array
matrix[L, mMax] G;	// Declare size of design matrix
vector[mMax] igrfMean;
vector[mMax] igrfStd;
vector[mMax] orgStd;
matrix[mMax, p] T;	// Transfer matrix
}

parameters {
vector[p] q;	// Declare model array
vector[mMax] mu; // Mean variable
}

transformed parameters {
vector[mMax] m;
m = ((T*q) .* orgStd)+(mu .* igrfStd + igrfMean);
}

model {
vector[L] res;		// Declare residual array

// Prior distribution of eigenspace
for (i in 1:p) {
q[i] ~ normal(0, 1);
}

// Prior distribution of mean
for (i in 1:mMax) {
mu[i] ~ normal(0, 1);
}

// Likelihood
res = d-G*m;
res ~ normal(0,1);
}


Output

WARNING:pystan:n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated
WARNING:pystan:Rhat above 1.1 or below 0.9 indicates that the chains very likely have not mixed
WARNING:pystan:3859 of 4000 iterations saturated the maximum tree depth of 10 (96.5 %)
WARNING:pystan:Run again with max_treedepth larger than 10 to avoid saturation
WARNING:pystan:Chain 4: E-BFMI = 0.0599
WARNING:pystan:E-BFMI below 0.2 indicates you may need to reparameterize your model
Inference for Stan model: None_0a54820ed44399940f2d61834cc82db1.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
q[1]    1.71    0.04   0.06   1.67   1.67   1.68   1.75   1.82      2  69.36
q[2]    4.59    0.03   0.04   4.51   4.56   4.61   4.62   4.62      2  73.53
q[3]   -1.34    0.02   0.03  -1.39  -1.36  -1.33  -1.33  -1.33      2  73.05
q[4]   -0.26    0.03   0.04  -0.29  -0.29  -0.28  -0.24  -0.19      2  79.27
q[5]   -7.73    0.01   0.02  -7.74  -7.74  -7.74  -7.72   -7.7      2  68.54
q[6]    1.52    0.04   0.06   1.48   1.48   1.49   1.56   1.63      2   66.2
q[7]   -6.04    0.03   0.05  -6.07  -6.07  -6.06   -6.0  -5.95      2  73.13
mu[1]  -0.18     0.9   1.28   -1.8  -1.25  -0.34   0.92   1.75      2 1249.7
mu[2]    2.5    3.34   4.73  -0.68  -0.61   0.01    6.1  10.67      2 1245.3
mu[3]   4.35    4.37   6.19  -1.03    0.2   1.77   8.96   14.9      2 502.12
mu[4] 122.34  115.22  163.1   8.89  17.22  38.15 235.88 418.68      2  68.87
mu[5]   1.86    3.16   4.47  -1.57  -1.08  -0.28   5.22   9.55      2 2346.3
mu[6]   6.41    7.78  11.01  -1.07  -0.42   0.61  14.39  25.45      2 1439.7
mu[7]  -0.39     0.9   1.27  -1.63  -1.61  -0.63   0.88   1.34      2 1396.7
mu[8]  -2884   62.72  88.78  -2945  -2941  -2929  -2822  -2723      2  74.26
m[1]  -5.2e4  3.4e-5 2.1e-3 -5.2e4 -5.2e4 -5.2e4 -5.2e4 -5.2e4   3869    1.0
m[2]  3138.7  4.1e-5 2.6e-3 3138.7 3138.7 3138.7 3138.7 3138.7   3934    1.0
m[3]  680.29  4.1e-5 2.5e-3 680.29 680.29 680.29  680.3  680.3   3800    1.0
m[4]  -172.9  2.6e-5 1.6e-3 -172.9 -172.9 -172.9 -172.9 -172.9   3816    1.0
m[5]   -3074  3.1e-5 1.9e-3  -3074  -3074  -3074  -3074  -3074   3695    1.0
m[6]  5004.7  3.0e-5 1.9e-3 5004.7 5004.7 5004.7 5004.7 5004.7   3931    1.0
m[7]  -715.2  3.5e-5 2.2e-3 -715.2 -715.2 -715.2 -715.2 -715.2   4094    1.0
m[8]  1198.8  3.4e-5 2.2e-3 1198.8 1198.8 1198.8 1198.8 1198.8   4253    1.0
lp__  -5.2e4  201.52  285.2 -5.2e4 -5.2e4 -5.2e4 -5.1e4 -5.1e4      2  72.49

This could mean that your data only identify \mathbf{m} well, but the decomposition into q and \mu is not unique. I suspect this is actually the case - for any given \mathbf{m,T,q}, \sigma_M, you can solve for \mu_M. So ignoring prior - you can choose any \mathbf{q} and find \mu_M to get the \mathbf{m} you need.

2 Likes

n_eff = 2 is certain indication of multimodal distribution for those quantities

1 Like

Thanks you!
I do see your point and think you are entirely correct. I had similar concerns when doing it.

How would you then interpret it if \mu_M was kept constant and only q was being sampled, but the result still was the same? q, having an \hat{R} larger than 1.1 but m having \hat{R}=1.

How are you so certain about that from n_{eff} being equal to two?

Alternative would be clustering due to sticking chains. n_eff in Stan uses within and between chain variance information to diagnose this. Depending on the exact algorithm version you are using the exact value varies, but any value approximately equal or less than the number of chains indicates the chains are sampling from non-overlapping sets.

2 Likes

That there is some non-identifiability/multi modality issue left :-) I would check the output of pairs or the “Explore” tab in shinystan for possible collinearities between the variables and then think hard about the math and double check that my code does what I think it does.