Optimization rules

Dear,
story:
I am learning about the Factor Analysis model in stan based on Erik Farouni’s FA. FA-Farouni . I set out to compare these stan results with those of the FA-model by ASreml which uses an Average-Information (AI) algorithm to optimize the likelihood. REML-AI-algorithm pp 50. ASreml has a much better approximation of the covariance matrix (\Sigma = \Lambda \Lambda' + \Psi ) than I can get with stan; while the lambda and psi estimates of ASreml and stan are very comparable. It seems that the minor differences in the mean estimates of L and Psi have a larger effect on the Sigma than I would like.

Question:
Is it possible to change a stan setting that makes the optimization more precise? If I would program a newton optimization I could set the optim rule to 0.1 or to 0.0001, the later would take longer but would be more precise. (1.2 to 1.2341). I wonder if there is something similar in optimizing a stan model.
(I realize there may be other issues, and I am looking for answers in all kinds of places)

Honestly, I would place my bets that either:

  • the way you implemented the model in Stan is somewhat different than the implementation in ASreml, or
  • the solution from ASreml is the less correct one

To be more specic:

What do you mean by “better”? Does it mean that the credible/confidence intervals are a tighter? Or a better fit to simulated data? In either case, it is possible that wide intervals actually capture the relevant uncertainty and ASreml is overconfident or that the simulation differs from the Stan model (especially check the priors) and you thus should either change your priors or your simulation.

To some extent, this is achieved by increasing adapt_delta towards 1, but if your diagnostics are OK, it is unlikely this will bring any noticeable change as the model very likely is in the “optimum”.

Looking at the simulation code on Rick Farouni’s post, I think it might indeed be a problem with the priors and hyperpriors, as the Stan model there has Cauchy priors on top of Cauchy hyperpriors. This means that a-priori, quite large values of L_t and psi are considered plausible. You can try this on your own:

N <- 10000
sim_L_t <- rcauchy(N, rcauchy(N, 0, 1), abs(rcauchy(N, 0,1)))
mean(abs(sim_L_t) > 10)
[1] 0.19
mean(abs(sim_L_t) > 100)
[1] 0.0297

Note: is 10 a lot for factor analysis? I don’t know, looks like it, but not really knowledgeable about factor analysis…

It is possible that since you do not have a lot of data (totally guessing here, not sure how much data is a lot for such a model), the fat tails of the prior still have non-negligible influence on the posterior and skew your inferences.

You can check if this is the case by tightening the priors, for example by chaning all the cauchy to normal and see how that changes your posterior.

1 Like

Thank you the replies and thinking with me! You have certainly given me something to think about, I am learning a lot about REML, Bayesian and stan.

The way I understand it now the FA model attempts to summarize the covariance matrix of the data \Sigma_d in lambda and psi. The estimates of ASreml seem to do a better job at that: \Sigma_{as} = \Lambda \Lambda' + \Psi \approx \Sigma_d where the error between \Sigma_d and \Sigma_{as} is lower than that of the Bayesian counterpart (\Sigma_{ba}). I am not sure if over-fitting is a problem here … Thanks for the tip on adapt_delta, in some model variations stan gave me errors for this.

Your point about the priors is interesting. The simple simulation you showed made me think about the priors. The lambda values shown below are typical as far as I know, so these priors do seem to give too much weight to excessively high values. Currently I work with small data sets, about ~200 observations. But in reality the number of observations can go in the thousands. Yet I’ve not seen such (>10) estimates for lambda.

Continuing on the priors: Imagine matrix Lambda with k = 2. Currently the L_t estimates in column 1 have a relation with the L_t estimates in column 2. This is not so evident to me, even though we expect the two vectors to ‘reconstruct’ \Sigma_d . An example based on agridat yan.winterwheat data in R:

model{
mu_lt ~ normal(0, 1);
sigma_lt ~ normal(0,1);
L_t ~ normal(mu_lt, sigma_lt);
}
> Lambda
           [,1]       [,2]
 [1,] 0.4935566 0.00000000
 [2,] 0.3704199 0.19782069
 [3,] 0.3504931 0.23606015
 [4,] 0.3378872 0.20372654
 [5,] 0.5061964 0.03497362
 [6,] 0.2353057 0.26467081
 [7,] 0.3704607 0.06143004
 [8,] 0.1245585 0.25115701
 [9,] 0.2765927 0.28534212

Note that L[5,2] and L[7,2] are very close to 0. ASreml estimates these to be negative values. So I tried an ad-hoc solution.

model{
for(c in 1:M){
	L_t[c] ~ normal(0, 1);
} // M = length(L_t), so each value in L_t has its own prior N(0,1)
}
> Lambda
           [,1]       [,2]
 [1,] 0.6525018  0.0000000
 [2,] 0.5152577  0.1321970
 [3,] 0.4909657  0.1899327
 [4,] 0.4532226  0.1757425
 [5,] 0.7614079 -0.3423222
 [6,] 0.2913932  0.3467770
 [7,] 0.5523000 -0.3473909
 [8,] 0.1154314  0.3891267
 [9,] 0.3600153  0.3501128

Now L[5,2] and L[7,2] are allowed to be negative. I don’t want to say that any of these priors are better than others but I find the sign-switching interesting. The first result would imply that L5/7 are very weak but in “reality” they are strong in an opposite direction.

So I will take some time to wrestle with the priors…! I am starting to wonder if any prior can be considered ‘weak’ in this type of model.