Scalability and Estimate Stability questions

tlyim · March 31, 2019, 9:28pm

I have some questions concerning developing models with scalability and estimate stability. I am a beginner building my first Stan model. My background is similar to an economics PhD, though not the highly mathematical / econometrics type. My intent is to eventually estimate the model with large datasets of company data (eg, those on the WRDS platform). The nature of the data would be like over 8000 firms in the cross section of each quarter/year and each firm has its series ranging from several years to over 30 years (x 4 for quarters). So there can be over 150,000 observations in the pooled sample.

Given the nature of the data, it seems natural to think in the direction of Hierarchical Partial Pooling like this post by @Bob_Carpenter . The large number of firms and the partial pooling mean that scalability and high dimension in the parameter space are issues to keep in mind in developing the model.

The model will have different components. For example, a linear regression model explaining the reported company earnings (r) using a potentially large list of predictor variables. A second component on misreporting, which includes the decision to misreport or not (M = 1 or 0) and the extent of misreporting (m). At the current infant stage of developing this model, I have not yet connected these variables in a meaningful way. I merely let them be disconnected stochastic variables and focus on finding a way to ensure that as overly simplified as it is, the model can scale up (with reliable estimates) as the number of observations grow.

In this model setup, estimating reliably the parameters for r and M is not a problem. For m, I assume it follows an exponential kernel Gaussian Process, like the example by @betanalpha and this one in the Stan documentation. The idea is to say that the institutional setting of a firm pretty much constrains the maximum extent the firm can misreport the earnings at each point in time of an accounting period. And if the firm somehow decides to misreport (based on the cost-benefit calculation at a particular moment), it would go all the way to the maximum extent. Since the institutional constraints on m at different time points are likely to be more similar when the time points are closer, this motivates the assumption of a GP.

I tried estimating the model in a cluster for different amount of observations (~300 to ~1000) with results summarized in the table further below. The sessionInfo() is as follows:

 R version 3.5.1 (2018-07-02)  
 Platform: x86_64-redhat-linux-gnu (64-bit)   
 Running under: Red Hat Enterprise Linux Server 7.6 (Maipo)

Originally, I worked with my laptop and used the default 2000 iterations for a chain. The run time was over 2 hours when I tried ~500 observations. So I reduced to only 300 iterations and also moved the execution to a cluster. The check_all_diagnostics() of the stan_utility.R by @betanalpha did not complain despite the much smaller number of iterations per chain:

 n_eff / iter looks reasonable for all parameters  
 Rhat looks reasonable for all parameters  
 0 of 600 iterations ended with a divergence (0%)  
 0 of 600 iterations saturated the maximum tree depth of 10 (0%)  
 E-BFMI indicated no pathological behavior

n times						rho est. (=5.5)			alpha est. (=3)			sigma est. (=2)
(of 99 obs)	total obs	user time	system	elapsed	lp__		n_eff	Rhat		n_eff	Rhat
3	297	283.122	0.553	108.181	-81.181	5.542	286	1.007	1.545	246	1.022	1.879
3	(2nd try, 1 core only)	268.678	0.042	268.701	-145.994	4.823	193	1.029	3.598	250	1.029	2.176
4	396	493.266	0.302	199.813	-157.424	4.286	227	1.008	2.952	238	1.007	2.082
4	(2nd try)	566.725	0.38	223.265	-186.286	4.818	433	0.999	2.245	234	1.006	2.12
5	495	1131.484	1.287	405.866	-120.645	5.439	285	1.004	2.813	225	1.017	1.88
5	(2nd try)	893.744	0.601	325.718	-197.346	5.855	379	1.002	3.977	253	1.002	1.983
6	594	1746.771	1.021	664.03	-199.892	4.318	226	1.022	1.659	343	1.003	1.996
7	693	2438.372	3.689	892.547	-210.669	5.336	218	1.013	4.853	151	1.017	1.992
8	792	3602.494	1.562	1484.676	-278.444	4.863	496	1.005	2.188	417	1.001	2.075
9	891	4889.948	2.702	1885.201	-271.386	6.737	264	1.017	3.049	189	1.013	2.052
10	990	6110.68	0.895	2194.017	-329.724	3.607	332	1.01	1.181	194	1.029	2.028

I do not report here the n_eff and Rhat for sigma because its estimate seems to be very stable. I am puzzled by the lack of estimate stability for rho and especially alpha as the number of observations grow — ie, not clearly moving closer to the true parameters (rho=5.5, alpha=3, sigma=2)? I used the following priors:

  rho ~ inv_gamma(5,20)  
  alpha ~ gamma(1.5,0.5)  
  sigma ~ gamma(1,0.5)

My questions are:

For the n=3 case (ie, with 99*3 = ~300 observations), I made two attempts; similary for n=4 or 5. It seems to me that because both the model and the number of observations are identical, the log density lp__ is comparable across the two attempts.
- Is it correct to believe that the set of estimates with a lower lp__ is necessarily the set “closer” to the true parameters (rho=5.5, alpha=3, sigma=2)?
When the amount of data doubles/triples from n=3 to n=6/n=9, the execution time (for user) jumps to a multiple of ~6 or ~17 times, respectively.
- Is this exponential growth in execution time normally expected? To what extent it could be significantly reduced by smarter coding of the Stan model or some implementation options not yet used by me as a beginner?
Suppose I could use whatever smarter options recommended in the replies to this post.
- How large could the scale of such a model be? If not for over 150,000 observations (8000+ firms, times 20+ years), is there any work-around to overcome the scalability limitation (eg, instead of using all the observations, merely study the distribution of estimates from subsamples of manageable sizes?
If the model could be scaled up for a large amount of data (much beyond the ~1000 observations I tried),
- is it reasonable to expect that the estimates would become highly stable, unlike the small scales I tried?

Would appreciate any advice and pointers to related beginner-friendly documentation, discussion, blog posts, etc.

Many thanks.

============================
Attached below are the Stan models:

fit1 <- stan(file = "misreport.stan", data = sim_EM_data, 
             control = list(adapt_delta = 0.90), iter = 300)

misreport.stan:

data {
  int<lower=0> N; // number of observations (for different CEOs/firms)
  int<lower=0> K; // number of coefficents (/predictors)
  vector[N] r; // dependent variable (reported company earnings, scaled by Rev, total revenue)
  matrix[N,K] Z; // predictor variables (main components of earnings, all scaled by Rev: 
                      // COGS2Rev, SGA2Rev, RnD2Rev with Rev2Rev = unit vector as the intercept)
  int<lower=0,upper=1> M[N];   // M = 1 for the decision to misreport; = 0 if report honestly  

// forecasts  
  int<lower=0> N_new; // number of predictions
  matrix[N_new,K] Z_new; // 
  
// GP data
  int<lower=0> G; // number of observed points from the realized GP function
  real t[G];     // ie, range of function input (ie, x of f(x)) to the realized GP function f(x)
  vector[G] m;  // misreporting extent (ie, the function value y = f(x) of the realized GP function)
}
transformed data {
}
parameters {
  vector[K] b; // coefficients of the predictor variables Z
  real<lower=0> sd_r; // sd of reported earnings
  real<lower=0,upper=1> integrity;     // integrity level of the society affecting the decision to misreport or not

// est'd coefficients for the GP kernel parameters
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}
transformed parameters {
  real ilogodds;     // 'log odds' of the society integrity level
  ilogodds = logit(integrity);   
}
model {
  matrix[G, G] Kcov;
  matrix[G, G] Lcov;

  r ~ normal(Z*b, sd_r);
  M ~ bernoulli_logit(ilogodds); 

  rho ~ inv_gamma(5,20);#30);    // (5,5) (8.91924, 34.5805);                 rho_T = 5.5
  alpha ~ gamma(1.5,0.5); // gamma(3,1)   normal(0, 3);     // (0, 2)  (0,1)  alpha_T = 3
  sigma ~ gamma(1,0.5); // gamma(2,1)   normal(0, 2);     // (0, 1.5)  (0,1)  sigma_T = 2    

// https://mc-stan.org/docs/2_18/stan-users-guide/fit-gp-section.html
// https://betanalpha.github.io/assets/case_studies/gp_part3/part3.html
  Kcov = cov_exp_quad(t, alpha, rho) + diag_matrix(rep_vector(square(sigma), G));
//                                                                  ^
  Lcov = cholesky_decompose(Kcov);

  m ~ multi_normal_cholesky(rep_vector(0, G), Lcov);   // rep_vector(0, G) is the zero mean assumed for the GP
}
generated quantities {
  vector[N_new] r_new;
  for (n in 1:N_new)
    r_new[n] = normal_rng(Z_new[n] * b, sd_r);
}

=======================================

sim_GP_data <- list(rho=rho_T, 
                  alpha=alpha_T, 
                  sigma=sigma_T, 
                  N=G_total, 
                  x=t_total)

simu_fit <- stan("simu_GP.stan", data=sim_GP_data, iter=1,
            chains=1, algorithm="Fixed_param")

simu_GP.stan:

data {
  int<lower=1> N;
  real x[N];

  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}
transformed data {
/* Note: added a nugget or jitter of 10^−10 so the marginal covariance matrix before computing its Cholesky decomposition in order to stabilize the numerical calculations. Often the nugget is taking to be square root of the floating point precision, which would be 10−8 for double precision calculations.
 https://betanalpha.github.io/assets/case_studies/gp_part1/part1.html
*/
  matrix[N, N] cov = cov_exp_quad(x, alpha, rho) + diag_matrix(rep_vector(1e-10, N));
  matrix[N, N] L_cov = cholesky_decompose(cov);
}
parameters {}
model {}
generated quantities {
  vector[N] f = multi_normal_cholesky_rng(rep_vector(0, N), L_cov);
  vector[N] y;
  for (n in 1:N)
    y[n] = normal_rng(f[n], sigma);
}

bbbales2 · March 31, 2019, 10:50pm

The quick answer is no, don’t read too much into optimums. Here’s more if you want: Typical Sets and the Curse of Dimensionality

Have you looked at posterior intervals for rho, alpha, and sigma? I’d only be alarmed by the variation if the uncertainty in those variables was really really small (and consistently did not include the parameters you used to generate the data with). Also, an Rhat like 1.029 is suspicious. You want these to be really close to 1.0 (1.01 is proposed as a rule-of-thumb in the new Rhat stuff: Maybe it’s time to let the old ways die; or We broke R-hat so now we have to fix it. | Statistical Modeling, Causal Inference, and Social Science)

Is G the number of data points (N)? Or to be clearer, is the number of points in the Gaussian process scaling with the number of observations?

If so the Cholesky decomposition is going to be a problem. That is expensive (assuming the matrix is GxG, it scales like G^3).

The as N → infinity sorta stuff usually assumes fixed numbers of parameters. If the number of parameters scales with the data, then you probably don’t have any guarantees.

This sounds a lot like something you wanna write out as some sort of regression model, especially considering you have a ton of data. Linear models n’ such are the things that are gonna scale. Have you seen this book: http://www.stat.columbia.edu/~gelman/arm/ ? It’s a lot of creative regression stuff. The models are available as Stan models too: ARM Models · stan-dev/example-models Wiki · GitHub .

tlyim · March 31, 2019, 11:46pm

Thanks for the information, @bbbales2

In this overly simplied version, yes G=N. In the future, G << N (perhaps, with G/N being extremely close to but bigger than zero). G is the number of observed misreporting extent, which is sparse (eg, facts revealed from accounting scandal investigations). Eventually, I hope to rewrite the Stan model for the G << N case with (N - G) additional latent variables to be estimated as well.

Does this sound sensible to you? If so, would G << N (but with N - G latent variables) hurt or help scalability?

So if I understand correctly, I need something like “the number of parameters” grows much more slowly than N so that ( “the number of parameters” / N) would go to zero as N goes to infinity.

If so, in the case of G<<N with (N-G) additional latent variables, would the additional latent variables be also like many new parameters preventing the estimate stability even when N becomes huge?

A related question is that could the posterior predictive distribution be reasonably useful (ie, with stable estimates of the GP parameters) even though the estimates of the latent variables (ie, the misreporting that was never revealed) are not stable? (Estimates of the latent variables are not required to forecast the future misreporting extent well; only estimates of the GP parameters matter, right?)

Thanks for suggesting the ARM models. Will surely look at that direction as well.

bbbales2 · April 1, 2019, 3:54pm

I guess I was referencing this, which is the extent of my mathematical knowledge on the issue: Asymptotically we are all dead (Thoughts about the Bernstein-von Mises theorem before and after a Diamanda Galás concert) | Statistical Modeling, Causal Inference, and Social Science .

The more practical answer is, hierarchical models mostly end up with as many parameters as data points (or something like that). The hierarchical part of the hierarchical model is what keeps the estimates of all those parameters stable (or from doing crazy stuff).

I’ve never dealt with something this size, but my prior is that hundreds of parameters is hard, and I’m not sure I’ve really ever gone into thousands (though people do). If you’re dealing with data like this, see where you can get with simple regressions. Maybe in the end stack on a GP if there’s a place you think it’d be useful. The GP would be hard to scale.

Topic		Replies	Views
Could you please help to find out why my GP model got slower Modeling gaussian-process	12	1253	April 7, 2021
Scaling up a hierarchical model Modeling bioinformatics	28	3119	June 6, 2019
Request for help understanding model behavior Modeling	17	1536	March 24, 2018
Issues with running Bayesian hierarchical reinforcement learning model in Stan Modeling rstan , fitting-issues , specification , performance	12	846	June 28, 2021
Using 2D Gaussian process predictions within model Modeling gaussian-process	11	2465	February 18, 2022

Scalability and Estimate Stability questions

Related topics