I have some questions concerning developing models with scalability and estimate stability. I am a beginner building my first Stan model. My background is similar to an economics PhD, though not the highly mathematical / econometrics type. My intent is to eventually estimate the model with large datasets of company data (eg, those on the WRDS platform). The nature of the data would be like over 8000 firms in the cross section of each quarter/year and each firm has its series ranging from several years to over 30 years (x 4 for quarters). So there can be over 150,000 observations in the pooled sample.
Given the nature of the data, it seems natural to think in the direction of Hierarchical Partial Pooling like this post by @Bob_Carpenter . The large number of firms and the partial pooling mean that scalability and high dimension in the parameter space are issues to keep in mind in developing the model.
The model will have different components. For example, a linear regression model explaining the reported company earnings (r) using a potentially large list of predictor variables. A second component on misreporting, which includes the decision to misreport or not (M = 1 or 0) and the extent of misreporting (m). At the current infant stage of developing this model, I have not yet connected these variables in a meaningful way. I merely let them be disconnected stochastic variables and focus on finding a way to ensure that as overly simplified as it is, the model can scale up (with reliable estimates) as the number of observations grow.
In this model setup, estimating reliably the parameters for r and M is not a problem. For m, I assume it follows an exponential kernel Gaussian Process, like the example by @betanalpha and this one in the Stan documentation. The idea is to say that the institutional setting of a firm pretty much constrains the maximum extent the firm can misreport the earnings at each point in time of an accounting period. And if the firm somehow decides to misreport (based on the costbenefit calculation at a particular moment), it would go all the way to the maximum extent. Since the institutional constraints on m at different time points are likely to be more similar when the time points are closer, this motivates the assumption of a GP.
I tried estimating the model in a cluster for different amount of observations (~300 to ~1000) with results summarized in the table further below. The sessionInfo() is as follows:
R version 3.5.1 (20180702)
Platform: x86_64redhatlinuxgnu (64bit)
Running under: Red Hat Enterprise Linux Server 7.6 (Maipo)
Originally, I worked with my laptop and used the default 2000 iterations for a chain. The run time was over 2 hours when I tried ~500 observations. So I reduced to only 300 iterations and also moved the execution to a cluster. The check_all_diagnostics() of the stan_utility.R by @betanalpha did not complain despite the much smaller number of iterations per chain:
n_eff / iter looks reasonable for all parameters
Rhat looks reasonable for all parameters
0 of 600 iterations ended with a divergence (0%)
0 of 600 iterations saturated the maximum tree depth of 10 (0%)
EBFMI indicated no pathological behavior
n times  rho est. (=5.5)  alpha est. (=3)  sigma est. (=2)  

(of 99 obs)  total obs  user time  system  elapsed  lp__  n_eff  Rhat  n_eff  Rhat  
3  297  283.122  0.553  108.181  81.181  5.542  286  1.007  1.545  246  1.022  1.879 
3  (2nd try, 1 core only)  268.678  0.042  268.701  145.994  4.823  193  1.029  3.598  250  1.029  2.176 
4  396  493.266  0.302  199.813  157.424  4.286  227  1.008  2.952  238  1.007  2.082 
4  (2nd try)  566.725  0.38  223.265  186.286  4.818  433  0.999  2.245  234  1.006  2.12 
5  495  1131.484  1.287  405.866  120.645  5.439  285  1.004  2.813  225  1.017  1.88 
5  (2nd try)  893.744  0.601  325.718  197.346  5.855  379  1.002  3.977  253  1.002  1.983 
6  594  1746.771  1.021  664.03  199.892  4.318  226  1.022  1.659  343  1.003  1.996 
7  693  2438.372  3.689  892.547  210.669  5.336  218  1.013  4.853  151  1.017  1.992 
8  792  3602.494  1.562  1484.676  278.444  4.863  496  1.005  2.188  417  1.001  2.075 
9  891  4889.948  2.702  1885.201  271.386  6.737  264  1.017  3.049  189  1.013  2.052 
10  990  6110.68  0.895  2194.017  329.724  3.607  332  1.01  1.181  194  1.029  2.028 
I do not report here the n_eff and Rhat for sigma because its estimate seems to be very stable. I am puzzled by the lack of estimate stability for rho and especially alpha as the number of observations grow ā ie, not clearly moving closer to the true parameters (rho=5.5, alpha=3, sigma=2)? I used the following priors:
rho ~ inv_gamma(5,20)
alpha ~ gamma(1.5,0.5)
sigma ~ gamma(1,0.5)
My questions are:

For the n=3 case (ie, with 99*3 = ~300 observations), I made two attempts; similary for n=4 or 5. It seems to me that because both the model and the number of observations are identical, the log density lp__ is comparable across the two attempts.
 Is it correct to believe that the set of estimates with a lower lp__ is necessarily the set ācloserā to the true parameters (rho=5.5, alpha=3, sigma=2)?

When the amount of data doubles/triples from n=3 to n=6/n=9, the execution time (for user) jumps to a multiple of ~6 or ~17 times, respectively.
 Is this exponential growth in execution time normally expected? To what extent it could be significantly reduced by smarter coding of the Stan model or some implementation options not yet used by me as a beginner?

Suppose I could use whatever smarter options recommended in the replies to this post.
 How large could the scale of such a model be? If not for over 150,000 observations (8000+ firms, times 20+ years), is there any workaround to overcome the scalability limitation (eg, instead of using all the observations, merely study the distribution of estimates from subsamples of manageable sizes?

If the model could be scaled up for a large amount of data (much beyond the ~1000 observations I tried),
 is it reasonable to expect that the estimates would become highly stable, unlike the small scales I tried?
Would appreciate any advice and pointers to related beginnerfriendly documentation, discussion, blog posts, etc.
Many thanks.
============================
Attached below are the Stan models:
fit1 < stan(file = "misreport.stan", data = sim_EM_data,
control = list(adapt_delta = 0.90), iter = 300)
misreport.stan:
data {
int<lower=0> N; // number of observations (for different CEOs/firms)
int<lower=0> K; // number of coefficents (/predictors)
vector[N] r; // dependent variable (reported company earnings, scaled by Rev, total revenue)
matrix[N,K] Z; // predictor variables (main components of earnings, all scaled by Rev:
// COGS2Rev, SGA2Rev, RnD2Rev with Rev2Rev = unit vector as the intercept)
int<lower=0,upper=1> M[N]; // M = 1 for the decision to misreport; = 0 if report honestly
// forecasts
int<lower=0> N_new; // number of predictions
matrix[N_new,K] Z_new; //
// GP data
int<lower=0> G; // number of observed points from the realized GP function
real t[G]; // ie, range of function input (ie, x of f(x)) to the realized GP function f(x)
vector[G] m; // misreporting extent (ie, the function value y = f(x) of the realized GP function)
}
transformed data {
}
parameters {
vector[K] b; // coefficients of the predictor variables Z
real<lower=0> sd_r; // sd of reported earnings
real<lower=0,upper=1> integrity; // integrity level of the society affecting the decision to misreport or not
// est'd coefficients for the GP kernel parameters
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
}
transformed parameters {
real ilogodds; // 'log odds' of the society integrity level
ilogodds = logit(integrity);
}
model {
matrix[G, G] Kcov;
matrix[G, G] Lcov;
r ~ normal(Z*b, sd_r);
M ~ bernoulli_logit(ilogodds);
rho ~ inv_gamma(5,20);#30); // (5,5) (8.91924, 34.5805); rho_T = 5.5
alpha ~ gamma(1.5,0.5); // gamma(3,1) normal(0, 3); // (0, 2) (0,1) alpha_T = 3
sigma ~ gamma(1,0.5); // gamma(2,1) normal(0, 2); // (0, 1.5) (0,1) sigma_T = 2
// https://mcstan.org/docs/2_18/stanusersguide/fitgpsection.html
// https://betanalpha.github.io/assets/case_studies/gp_part3/part3.html
Kcov = cov_exp_quad(t, alpha, rho) + diag_matrix(rep_vector(square(sigma), G));
// ^
Lcov = cholesky_decompose(Kcov);
m ~ multi_normal_cholesky(rep_vector(0, G), Lcov); // rep_vector(0, G) is the zero mean assumed for the GP
}
generated quantities {
vector[N_new] r_new;
for (n in 1:N_new)
r_new[n] = normal_rng(Z_new[n] * b, sd_r);
}
=======================================
sim_GP_data < list(rho=rho_T,
alpha=alpha_T,
sigma=sigma_T,
N=G_total,
x=t_total)
simu_fit < stan("simu_GP.stan", data=sim_GP_data, iter=1,
chains=1, algorithm="Fixed_param")
simu_GP.stan:
data {
int<lower=1> N;
real x[N];
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
}
transformed data {
/* Note: added a nugget or jitter of 10^ā10 so the marginal covariance matrix before computing its Cholesky decomposition in order to stabilize the numerical calculations. Often the nugget is taking to be square root of the floating point precision, which would be 10ā8 for double precision calculations.
https://betanalpha.github.io/assets/case_studies/gp_part1/part1.html
*/
matrix[N, N] cov = cov_exp_quad(x, alpha, rho) + diag_matrix(rep_vector(1e10, N));
matrix[N, N] L_cov = cholesky_decompose(cov);
}
parameters {}
model {}
generated quantities {
vector[N] f = multi_normal_cholesky_rng(rep_vector(0, N), L_cov);
vector[N] y;
for (n in 1:N)
y[n] = normal_rng(f[n], sigma);
}