I have simulated some data that resembles some real data that I am using.
The data is simulated from a simple hierarchical model with a constant intercept and slope and a random intercept term. More specifically, I have simulated data from
Y_{i}(t_{ij}) = (\eta_1 + w_i) +\eta_2t_{ij} + \varepsilon_{ij},
where w_i \sim N(0,\sigma^2_1) and \varepsilon_{ij} \sim N(0,\sigma^2). In context, I have i=1,\dots,n units and j=1,\dots,n_i recordings of a covariate for each unit.
The data is simulated using (\eta_1, \eta_2, \sigma_1, \sigma) = (-1.22, 0.004,1, 0.0001). I have attached a plot of the data below.
y.pdf (170.7 KB)
The Stan code that I am using (shown below) is taking a very long time (about 10 hours) to estimate the model parameters and is returning warnings
5: The largest R-hat is 2.14, indicating chains have not mixed.
Running the chains for more iterations may help.
6: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help.
7: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help.
The effective sample sizes for all parameters is low. I ran the code for 6000 iterations (1500 warmup) and the effective sample sizes were (91,20,317,22) for (\eta_1, \eta_2, \sigma_1, \sigma), respectively. I also had to increase the maximum treedepth to 30 to avoid getting an additional warning.
Is it the very small values of \sigma_1 and \sigma that are causing the issues?
data{
int n; //Sample size
int<lower=1> N; // Total number of covariate recordings (all recordings for all units)
real y[N]; //All of the y values
int w_ind[N]; //Indices for w
int tt[N]; //All times for all units
}
parameters{
real <lower = 0> sig;
real <lower = 0> sig1;
real eta1;
real eta2;
vector[n] w;
}
model{
real Mu[N];
for(i in 1:N){
Mu[i] = eta1 + eta2*tt[i] + sig1*w[w_ind[i]];
}
target += normal_lpdf(y|Mu, sig);
//priors
eta1 ~ normal(0, 1);
eta2 ~ normal(0, 1);
w ~ normal(0, 1);
sig ~ normal(0, 1);
sig1 ~ normal(0, 1);
}
I have tried multiplying the response, y, by different factors to increase the estimate value of \eta_2 but this does not help. This is, for example, like changing the response from metres to centimetres.
I have 40 units and around 20 000 observations. However, even taking smaller subsets of the data takes a very long time to run and returns the same warnings. The lme function in R can estimate the parameters accurately in about a second.