Scalability and Estimate Stability questions

I have some questions concerning developing models with scalability and estimate stability. I am a beginner building my first Stan model. My background is similar to an economics PhD, though not the highly mathematical / econometrics type. My intent is to eventually estimate the model with large datasets of company data (eg, those on the WRDS platform). The nature of the data would be like over 8000 firms in the cross section of each quarter/year and each firm has its series ranging from several years to over 30 years (x 4 for quarters). So there can be over 150,000 observations in the pooled sample.

Given the nature of the data, it seems natural to think in the direction of Hierarchical Partial Pooling like this post by @Bob_Carpenter . The large number of firms and the partial pooling mean that scalability and high dimension in the parameter space are issues to keep in mind in developing the model.

The model will have different components. For example, a linear regression model explaining the reported company earnings (r) using a potentially large list of predictor variables. A second component on misreporting, which includes the decision to misreport or not (M = 1 or 0) and the extent of misreporting (m). At the current infant stage of developing this model, I have not yet connected these variables in a meaningful way. I merely let them be disconnected stochastic variables and focus on finding a way to ensure that as overly simplified as it is, the model can scale up (with reliable estimates) as the number of observations grow.

In this model setup, estimating reliably the parameters for r and M is not a problem. For m, I assume it follows an exponential kernel Gaussian Process, like the example by @betanalpha and this one in the Stan documentation. The idea is to say that the institutional setting of a firm pretty much constrains the maximum extent the firm can misreport the earnings at each point in time of an accounting period. And if the firm somehow decides to misreport (based on the cost-benefit calculation at a particular moment), it would go all the way to the maximum extent. Since the institutional constraints on m at different time points are likely to be more similar when the time points are closer, this motivates the assumption of a GP.

I tried estimating the model in a cluster for different amount of observations (~300 to ~1000) with results summarized in the table further below. The sessionInfo() is as follows:

 R version 3.5.1 (2018-07-02)  
 Platform: x86_64-redhat-linux-gnu (64-bit)   
 Running under: Red Hat Enterprise Linux Server 7.6 (Maipo)  

Originally, I worked with my laptop and used the default 2000 iterations for a chain. The run time was over 2 hours when I tried ~500 observations. So I reduced to only 300 iterations and also moved the execution to a cluster. The check_all_diagnostics() of the stan_utility.R by @betanalpha did not complain despite the much smaller number of iterations per chain:

 n_eff / iter looks reasonable for all parameters  
 Rhat looks reasonable for all parameters  
 0 of 600 iterations ended with a divergence (0%)  
 0 of 600 iterations saturated the maximum tree depth of 10 (0%)  
 E-BFMI indicated no pathological behavior
n times rho est. (=5.5) alpha est. (=3) sigma est. (=2)
(of 99 obs) total obs user time system elapsed lp__ n_eff Rhat n_eff Rhat
3 297 283.122 0.553 108.181 -81.181 5.542 286 1.007 1.545 246 1.022 1.879
3 (2nd try, 1 core only) 268.678 0.042 268.701 -145.994 4.823 193 1.029 3.598 250 1.029 2.176
4 396 493.266 0.302 199.813 -157.424 4.286 227 1.008 2.952 238 1.007 2.082
4 (2nd try) 566.725 0.38 223.265 -186.286 4.818 433 0.999 2.245 234 1.006 2.12
5 495 1131.484 1.287 405.866 -120.645 5.439 285 1.004 2.813 225 1.017 1.88
5 (2nd try) 893.744 0.601 325.718 -197.346 5.855 379 1.002 3.977 253 1.002 1.983
6 594 1746.771 1.021 664.03 -199.892 4.318 226 1.022 1.659 343 1.003 1.996
7 693 2438.372 3.689 892.547 -210.669 5.336 218 1.013 4.853 151 1.017 1.992
8 792 3602.494 1.562 1484.676 -278.444 4.863 496 1.005 2.188 417 1.001 2.075
9 891 4889.948 2.702 1885.201 -271.386 6.737 264 1.017 3.049 189 1.013 2.052
10 990 6110.68 0.895 2194.017 -329.724 3.607 332 1.01 1.181 194 1.029 2.028

I do not report here the n_eff and Rhat for sigma because its estimate seems to be very stable. I am puzzled by the lack of estimate stability for rho and especially alpha as the number of observations grow ā€” ie, not clearly moving closer to the true parameters (rho=5.5, alpha=3, sigma=2)? I used the following priors:

  rho ~ inv_gamma(5,20)  
  alpha ~ gamma(1.5,0.5)  
  sigma ~ gamma(1,0.5)  

My questions are:

  1. For the n=3 case (ie, with 99*3 = ~300 observations), I made two attempts; similary for n=4 or 5. It seems to me that because both the model and the number of observations are identical, the log density lp__ is comparable across the two attempts.

    • Is it correct to believe that the set of estimates with a lower lp__ is necessarily the set ā€œcloserā€ to the true parameters (rho=5.5, alpha=3, sigma=2)?
  2. When the amount of data doubles/triples from n=3 to n=6/n=9, the execution time (for user) jumps to a multiple of ~6 or ~17 times, respectively.

    • Is this exponential growth in execution time normally expected? To what extent it could be significantly reduced by smarter coding of the Stan model or some implementation options not yet used by me as a beginner?
  3. Suppose I could use whatever smarter options recommended in the replies to this post.

    • How large could the scale of such a model be? If not for over 150,000 observations (8000+ firms, times 20+ years), is there any work-around to overcome the scalability limitation (eg, instead of using all the observations, merely study the distribution of estimates from subsamples of manageable sizes?
  4. If the model could be scaled up for a large amount of data (much beyond the ~1000 observations I tried),

    • is it reasonable to expect that the estimates would become highly stable, unlike the small scales I tried?

Would appreciate any advice and pointers to related beginner-friendly documentation, discussion, blog posts, etc.

Many thanks.

============================
Attached below are the Stan models:

fit1 <- stan(file = "misreport.stan", data = sim_EM_data, 
             control = list(adapt_delta = 0.90), iter = 300)

misreport.stan:

data {
  int<lower=0> N; // number of observations (for different CEOs/firms)
  int<lower=0> K; // number of coefficents (/predictors)
  vector[N] r; // dependent variable (reported company earnings, scaled by Rev, total revenue)
  matrix[N,K] Z; // predictor variables (main components of earnings, all scaled by Rev: 
                      // COGS2Rev, SGA2Rev, RnD2Rev with Rev2Rev = unit vector as the intercept)
  int<lower=0,upper=1> M[N];   // M = 1 for the decision to misreport; = 0 if report honestly  

// forecasts  
  int<lower=0> N_new; // number of predictions
  matrix[N_new,K] Z_new; // 
  
// GP data
  int<lower=0> G; // number of observed points from the realized GP function
  real t[G];     // ie, range of function input (ie, x of f(x)) to the realized GP function f(x)
  vector[G] m;  // misreporting extent (ie, the function value y = f(x) of the realized GP function)
}
transformed data {
}
parameters {
  vector[K] b; // coefficients of the predictor variables Z
  real<lower=0> sd_r; // sd of reported earnings
  real<lower=0,upper=1> integrity;     // integrity level of the society affecting the decision to misreport or not

// est'd coefficients for the GP kernel parameters
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}
transformed parameters {
  real ilogodds;     // 'log odds' of the society integrity level
  ilogodds = logit(integrity);   
}
model {
  matrix[G, G] Kcov;
  matrix[G, G] Lcov;

  r ~ normal(Z*b, sd_r);
  M ~ bernoulli_logit(ilogodds); 

  rho ~ inv_gamma(5,20);#30);    // (5,5) (8.91924, 34.5805);                 rho_T = 5.5
  alpha ~ gamma(1.5,0.5); // gamma(3,1)   normal(0, 3);     // (0, 2)  (0,1)  alpha_T = 3
  sigma ~ gamma(1,0.5); // gamma(2,1)   normal(0, 2);     // (0, 1.5)  (0,1)  sigma_T = 2    

// https://mc-stan.org/docs/2_18/stan-users-guide/fit-gp-section.html
// https://betanalpha.github.io/assets/case_studies/gp_part3/part3.html
  Kcov = cov_exp_quad(t, alpha, rho) + diag_matrix(rep_vector(square(sigma), G));
//                                                                  ^
  Lcov = cholesky_decompose(Kcov);

  m ~ multi_normal_cholesky(rep_vector(0, G), Lcov);   // rep_vector(0, G) is the zero mean assumed for the GP
}
generated quantities {
  vector[N_new] r_new;
  for (n in 1:N_new)
    r_new[n] = normal_rng(Z_new[n] * b, sd_r);
}

=======================================

sim_GP_data <- list(rho=rho_T, 
                  alpha=alpha_T, 
                  sigma=sigma_T, 
                  N=G_total, 
                  x=t_total)

simu_fit <- stan("simu_GP.stan", data=sim_GP_data, iter=1,
            chains=1, algorithm="Fixed_param")  

simu_GP.stan:

data {
  int<lower=1> N;
  real x[N];

  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}
transformed data {
/* Note: added a nugget or jitter of 10^āˆ’10 so the marginal covariance matrix before computing its Cholesky decomposition in order to stabilize the numerical calculations. Often the nugget is taking to be square root of the floating point precision, which would be 10āˆ’8 for double precision calculations.
 https://betanalpha.github.io/assets/case_studies/gp_part1/part1.html
*/
  matrix[N, N] cov = cov_exp_quad(x, alpha, rho) + diag_matrix(rep_vector(1e-10, N));
  matrix[N, N] L_cov = cholesky_decompose(cov);
}
parameters {}
model {}
generated quantities {
  vector[N] f = multi_normal_cholesky_rng(rep_vector(0, N), L_cov);
  vector[N] y;
  for (n in 1:N)
    y[n] = normal_rng(f[n], sigma);
}

The quick answer is no, donā€™t read too much into optimums. Hereā€™s more if you want: Typical Sets and the Curse of Dimensionality

Have you looked at posterior intervals for rho, alpha, and sigma? Iā€™d only be alarmed by the variation if the uncertainty in those variables was really really small (and consistently did not include the parameters you used to generate the data with). Also, an Rhat like 1.029 is suspicious. You want these to be really close to 1.0 (1.01 is proposed as a rule-of-thumb in the new Rhat stuff: Maybe itā€™s time to let the old ways die; or We broke R-hat so now we have to fix it. | Statistical Modeling, Causal Inference, and Social Science)

Is G the number of data points (N)? Or to be clearer, is the number of points in the Gaussian process scaling with the number of observations?

If so the Cholesky decomposition is going to be a problem. That is expensive (assuming the matrix is GxG, it scales like G^3).

The as N ā†’ infinity sorta stuff usually assumes fixed numbers of parameters. If the number of parameters scales with the data, then you probably donā€™t have any guarantees.

This sounds a lot like something you wanna write out as some sort of regression model, especially considering you have a ton of data. Linear models nā€™ such are the things that are gonna scale. Have you seen this book: http://www.stat.columbia.edu/~gelman/arm/ ? Itā€™s a lot of creative regression stuff. The models are available as Stan models too: ARM Models Ā· stan-dev/example-models Wiki Ā· GitHub .

Thanks for the information, @bbbales2

In this overly simplied version, yes G=N. In the future, G << N (perhaps, with G/N being extremely close to but bigger than zero). G is the number of observed misreporting extent, which is sparse (eg, facts revealed from accounting scandal investigations). Eventually, I hope to rewrite the Stan model for the G << N case with (N - G) additional latent variables to be estimated as well.

Does this sound sensible to you? If so, would G << N (but with N - G latent variables) hurt or help scalability?

So if I understand correctly, I need something like ā€œthe number of parametersā€ grows much more slowly than N so that ( ā€œthe number of parametersā€ / N) would go to zero as N goes to infinity.

If so, in the case of G<<N with (N-G) additional latent variables, would the additional latent variables be also like many new parameters preventing the estimate stability even when N becomes huge?

A related question is that could the posterior predictive distribution be reasonably useful (ie, with stable estimates of the GP parameters) even though the estimates of the latent variables (ie, the misreporting that was never revealed) are not stable? (Estimates of the latent variables are not required to forecast the future misreporting extent well; only estimates of the GP parameters matter, right?)

Thanks for suggesting the ARM models. Will surely look at that direction as well.

I guess I was referencing this, which is the extent of my mathematical knowledge on the issue: Asymptotically we are all dead (Thoughts about the Bernstein-von Mises theorem before and after a Diamanda GalƔs concert) | Statistical Modeling, Causal Inference, and Social Science .

The more practical answer is, hierarchical models mostly end up with as many parameters as data points (or something like that). The hierarchical part of the hierarchical model is what keeps the estimates of all those parameters stable (or from doing crazy stuff).

Iā€™ve never dealt with something this size, but my prior is that hundreds of parameters is hard, and Iā€™m not sure Iā€™ve really ever gone into thousands (though people do). If youā€™re dealing with data like this, see where you can get with simple regressions. Maybe in the end stack on a GP if thereā€™s a place you think itā€™d be useful. The GP would be hard to scale.

1 Like