I have some questions concerning developing models with scalability and estimate stability. I am a beginner building my first Stan model. My background is similar to an economics PhD, though not the highly mathematical / econometrics type. My intent is to eventually estimate the model with large datasets of company data (eg, those on the WRDS platform). The nature of the data would be like over 8000 firms in the cross section of each quarter/year and each firm has its series ranging from several years to over 30 years (x 4 for quarters). So there can be over 150,000 observations in the pooled sample.
Given the nature of the data, it seems natural to think in the direction of Hierarchical Partial Pooling like this post by @Bob_Carpenter . The large number of firms and the partial pooling mean that scalability and high dimension in the parameter space are issues to keep in mind in developing the model.
The model will have different components. For example, a linear regression model explaining the reported company earnings (r) using a potentially large list of predictor variables. A second component on misreporting, which includes the decision to misreport or not (M = 1 or 0) and the extent of misreporting (m). At the current infant stage of developing this model, I have not yet connected these variables in a meaningful way. I merely let them be disconnected stochastic variables and focus on finding a way to ensure that as overly simplified as it is, the model can scale up (with reliable estimates) as the number of observations grow.
In this model setup, estimating reliably the parameters for r and M is not a problem. For m, I assume it follows an exponential kernel Gaussian Process, like the example by @betanalpha and this one in the Stan documentation. The idea is to say that the institutional setting of a firm pretty much constrains the maximum extent the firm can misreport the earnings at each point in time of an accounting period. And if the firm somehow decides to misreport (based on the cost-benefit calculation at a particular moment), it would go all the way to the maximum extent. Since the institutional constraints on m at different time points are likely to be more similar when the time points are closer, this motivates the assumption of a GP.
I tried estimating the model in a cluster for different amount of observations (~300 to ~1000) with results summarized in the table further below. The sessionInfo() is as follows:
R version 3.5.1 (2018-07-02)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.6 (Maipo)
Originally, I worked with my laptop and used the default 2000 iterations for a chain. The run time was over 2 hours when I tried ~500 observations. So I reduced to only 300 iterations and also moved the execution to a cluster. The check_all_diagnostics() of the stan_utility.R by @betanalpha did not complain despite the much smaller number of iterations per chain:
n_eff / iter looks reasonable for all parameters
Rhat looks reasonable for all parameters
0 of 600 iterations ended with a divergence (0%)
0 of 600 iterations saturated the maximum tree depth of 10 (0%)
E-BFMI indicated no pathological behavior
n times | rho est. (=5.5) | alpha est. (=3) | sigma est. (=2) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
(of 99 obs) | total obs | user time | system | elapsed | lp__ | n_eff | Rhat | n_eff | Rhat | |||
3 | 297 | 283.122 | 0.553 | 108.181 | -81.181 | 5.542 | 286 | 1.007 | 1.545 | 246 | 1.022 | 1.879 |
3 | (2nd try, 1 core only) | 268.678 | 0.042 | 268.701 | -145.994 | 4.823 | 193 | 1.029 | 3.598 | 250 | 1.029 | 2.176 |
4 | 396 | 493.266 | 0.302 | 199.813 | -157.424 | 4.286 | 227 | 1.008 | 2.952 | 238 | 1.007 | 2.082 |
4 | (2nd try) | 566.725 | 0.38 | 223.265 | -186.286 | 4.818 | 433 | 0.999 | 2.245 | 234 | 1.006 | 2.12 |
5 | 495 | 1131.484 | 1.287 | 405.866 | -120.645 | 5.439 | 285 | 1.004 | 2.813 | 225 | 1.017 | 1.88 |
5 | (2nd try) | 893.744 | 0.601 | 325.718 | -197.346 | 5.855 | 379 | 1.002 | 3.977 | 253 | 1.002 | 1.983 |
6 | 594 | 1746.771 | 1.021 | 664.03 | -199.892 | 4.318 | 226 | 1.022 | 1.659 | 343 | 1.003 | 1.996 |
7 | 693 | 2438.372 | 3.689 | 892.547 | -210.669 | 5.336 | 218 | 1.013 | 4.853 | 151 | 1.017 | 1.992 |
8 | 792 | 3602.494 | 1.562 | 1484.676 | -278.444 | 4.863 | 496 | 1.005 | 2.188 | 417 | 1.001 | 2.075 |
9 | 891 | 4889.948 | 2.702 | 1885.201 | -271.386 | 6.737 | 264 | 1.017 | 3.049 | 189 | 1.013 | 2.052 |
10 | 990 | 6110.68 | 0.895 | 2194.017 | -329.724 | 3.607 | 332 | 1.01 | 1.181 | 194 | 1.029 | 2.028 |
I do not report here the n_eff and Rhat for sigma because its estimate seems to be very stable. I am puzzled by the lack of estimate stability for rho and especially alpha as the number of observations grow ā ie, not clearly moving closer to the true parameters (rho=5.5, alpha=3, sigma=2)? I used the following priors:
rho ~ inv_gamma(5,20)
alpha ~ gamma(1.5,0.5)
sigma ~ gamma(1,0.5)
My questions are:
-
For the n=3 case (ie, with 99*3 = ~300 observations), I made two attempts; similary for n=4 or 5. It seems to me that because both the model and the number of observations are identical, the log density lp__ is comparable across the two attempts.
- Is it correct to believe that the set of estimates with a lower lp__ is necessarily the set ācloserā to the true parameters (rho=5.5, alpha=3, sigma=2)?
-
When the amount of data doubles/triples from n=3 to n=6/n=9, the execution time (for user) jumps to a multiple of ~6 or ~17 times, respectively.
- Is this exponential growth in execution time normally expected? To what extent it could be significantly reduced by smarter coding of the Stan model or some implementation options not yet used by me as a beginner?
-
Suppose I could use whatever smarter options recommended in the replies to this post.
- How large could the scale of such a model be? If not for over 150,000 observations (8000+ firms, times 20+ years), is there any work-around to overcome the scalability limitation (eg, instead of using all the observations, merely study the distribution of estimates from subsamples of manageable sizes?
-
If the model could be scaled up for a large amount of data (much beyond the ~1000 observations I tried),
- is it reasonable to expect that the estimates would become highly stable, unlike the small scales I tried?
Would appreciate any advice and pointers to related beginner-friendly documentation, discussion, blog posts, etc.
Many thanks.
============================
Attached below are the Stan models:
fit1 <- stan(file = "misreport.stan", data = sim_EM_data,
control = list(adapt_delta = 0.90), iter = 300)
misreport.stan:
data {
int<lower=0> N; // number of observations (for different CEOs/firms)
int<lower=0> K; // number of coefficents (/predictors)
vector[N] r; // dependent variable (reported company earnings, scaled by Rev, total revenue)
matrix[N,K] Z; // predictor variables (main components of earnings, all scaled by Rev:
// COGS2Rev, SGA2Rev, RnD2Rev with Rev2Rev = unit vector as the intercept)
int<lower=0,upper=1> M[N]; // M = 1 for the decision to misreport; = 0 if report honestly
// forecasts
int<lower=0> N_new; // number of predictions
matrix[N_new,K] Z_new; //
// GP data
int<lower=0> G; // number of observed points from the realized GP function
real t[G]; // ie, range of function input (ie, x of f(x)) to the realized GP function f(x)
vector[G] m; // misreporting extent (ie, the function value y = f(x) of the realized GP function)
}
transformed data {
}
parameters {
vector[K] b; // coefficients of the predictor variables Z
real<lower=0> sd_r; // sd of reported earnings
real<lower=0,upper=1> integrity; // integrity level of the society affecting the decision to misreport or not
// est'd coefficients for the GP kernel parameters
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
}
transformed parameters {
real ilogodds; // 'log odds' of the society integrity level
ilogodds = logit(integrity);
}
model {
matrix[G, G] Kcov;
matrix[G, G] Lcov;
r ~ normal(Z*b, sd_r);
M ~ bernoulli_logit(ilogodds);
rho ~ inv_gamma(5,20);#30); // (5,5) (8.91924, 34.5805); rho_T = 5.5
alpha ~ gamma(1.5,0.5); // gamma(3,1) normal(0, 3); // (0, 2) (0,1) alpha_T = 3
sigma ~ gamma(1,0.5); // gamma(2,1) normal(0, 2); // (0, 1.5) (0,1) sigma_T = 2
// https://mc-stan.org/docs/2_18/stan-users-guide/fit-gp-section.html
// https://betanalpha.github.io/assets/case_studies/gp_part3/part3.html
Kcov = cov_exp_quad(t, alpha, rho) + diag_matrix(rep_vector(square(sigma), G));
// ^
Lcov = cholesky_decompose(Kcov);
m ~ multi_normal_cholesky(rep_vector(0, G), Lcov); // rep_vector(0, G) is the zero mean assumed for the GP
}
generated quantities {
vector[N_new] r_new;
for (n in 1:N_new)
r_new[n] = normal_rng(Z_new[n] * b, sd_r);
}
=======================================
sim_GP_data <- list(rho=rho_T,
alpha=alpha_T,
sigma=sigma_T,
N=G_total,
x=t_total)
simu_fit <- stan("simu_GP.stan", data=sim_GP_data, iter=1,
chains=1, algorithm="Fixed_param")
simu_GP.stan:
data {
int<lower=1> N;
real x[N];
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
}
transformed data {
/* Note: added a nugget or jitter of 10^ā10 so the marginal covariance matrix before computing its Cholesky decomposition in order to stabilize the numerical calculations. Often the nugget is taking to be square root of the floating point precision, which would be 10ā8 for double precision calculations.
https://betanalpha.github.io/assets/case_studies/gp_part1/part1.html
*/
matrix[N, N] cov = cov_exp_quad(x, alpha, rho) + diag_matrix(rep_vector(1e-10, N));
matrix[N, N] L_cov = cholesky_decompose(cov);
}
parameters {}
model {}
generated quantities {
vector[N] f = multi_normal_cholesky_rng(rep_vector(0, N), L_cov);
vector[N] y;
for (n in 1:N)
y[n] = normal_rng(f[n], sigma);
}