Not sure at all if this is the reason why you are encountering the problems you flag, but is it possible that the conventional Poisson model is not the most appropriate specfication here? Could there be, for example, excess zeroes and non-linearities in the data that need to be accommodated?
Sadykova, D., Scott, B.E., Dominicis, M.D., Wakelin, S.L., Sadykov, A., & Wolf, J. (2017). Bayesian joint models with INLA exploring marine mobile predator-prey and competitor species habitat overlap. Ecology and Evolution, 7(14), 212-5226. doi: 10.1002/ece3.3081
You are perfectly right, and a negative-binomial could be even sufficient.
However, I notice something : I have only one hyperparameter for every loadings of factor, would I not be better to have each factor’s loading with a unique variance?
Exemple : loadings[factor[d]] ~ normal(0, sd[d])
I’m yet wandering how to code that efficiently, given that my non-zero non diagonal loadings are stored in a vector…
I thought about that because of papers of Ovaskainen team (2016, 2017), which implement two very interesting features :
Sparse infinite bayesian factor models. This approach cannot be directly implemented in stan, but could potentially be approached by shrinkage or marginalization, I think, even if I do not undestand the subtility of these subjects
Determine underlying structure of factors (spatial in 2016 paper) or loadings (dependant of environmental variables in 2017 paper).
I would love to be able to implement that in stan language and exploit the powerful sampler it represents.
data{
int N; // sample size
int P; // number of species
int D; // number of dimensions
int<lower=0> Y[N,P]; // data matrix of order [N,P]
}
transformed data{
vector[D] FS_mu; // factor means
vector<lower=0>[D] FS_sd; // factor standard deviations
int<lower=1> M;
M = D*(P-D)+ D*(D-1)/2; // number of lower-triangular, non-zero loadings
for (m in 1:D) {
FS_mu[m] = 0; //Mean of factors = 0
FS_sd[m] = 1;
} //Sd of factors = 1
}
parameters{
//Parameters
vector[N] alpha; //Row intercepts
row_vector[P] b0; // Intercept per species
vector[M] L_lower; //Lower diagonal loadings
vector<lower=0>[D] L_diag; //Positive diagonal elements of loadings
matrix[N,D] FS; //Factor scores, matrix of order [N,D]
cholesky_factor_corr[D] Rho; //Correlation matrix between factors
//Hyperparameters
real<lower=0> sigma_a; //Variance of the row intercepts
real<lower=0> sigma_b0; //Variance of the species intercepts
real<lower=0> mu_lower;
real<lower=0> sigma_lower;
real<lower=0> eta; //Parameter of LKJ prior for Rho
}
transformed parameters{
cholesky_factor_cov[P,D] lambda; //Final matrix of laodings
matrix[D,D] Ld; // cholesky decomposition of the covariance matrix between factors
matrix[N,P] temp; //intermediate predictor
matrix<lower=0>[N,P] mu; // predicted values
// Correlation matrix of factors
Ld = diag_pre_multiply(FS_sd, Rho);
{
int idx2; //Index for the lower diagonal loadings
idx2 = 0;
// Constraints to allow identifiability of loadings
for(i in 1:(D-1)) { for(j in (i+1):(D)){ lambda[i,j] = 0; } } //0 on upper diagonal
for(i in 1:D) lambda[i,i] = L_diag[i]; //Positive values on diagonal
for(j in 1:D) {
for(i in (j+1):P) {
idx2 = idx2+1;
lambda[i,j] = L_lower[idx2];
}
}
}
// Predictor
temp = FS * lambda';
for(n in 1:N) mu[n] = exp(alpha[n] + b0 + temp[n]);
}
model{
// Hyperpriors
sigma_a ~ gamma(2,0.1); //Row intercepts hyperprior
sigma_b0 ~ gamma(2,0.1); //Species intercept effects hyperprior
mu_lower ~ gamma(2,0.1);//Mean of lower loadings
sigma_lower ~ gamma(2,0.1); //Variance of lower loadings
eta ~ gamma(2,0.1); //Parameter of the cholesky prior for FS correlation structure
// Priors
alpha ~ normal(0, sigma_a); //Regularizing prior for row intercepts
b0 ~ normal(0, sigma_b0); //Regularizing prior for species intercepts
L_diag ~ normal(0,0.1); //Prior for diagonal elements
L_lower ~ normal(mu_lower, sigma_lower); //Regularizing prior for lower loadings
Rho ~ lkj_corr_cholesky(eta); //Regularizing prior for Rho
for(i in 1:N){
Y[i,] ~ poisson(mu[i]);
FS[i,] ~ multi_normal_cholesky(FS_mu, Ld);
}
}
generated quantities{
//cov_matrix[P] cov_lambda;
//cov_lambda = lambda * lambda´
}
I modified your code minimally to what I think is absolutely necessary to get rid of the BFMI
warning. Nevertheless I believe that your hyper-priors need refinement.
You essentially moved the fixed parameters of factors in the tranformed data block to avoid sampling, isn’t it?
While reading your new version, I just realized that I let a strictly positive hyperprior for mu_lower, coming from another version in which I constrained loadings to be positive : a great amount of trouble could have come from this misspecification!
for(i in 1:(D-1)) { for(j in (i+1):(D)){ lambda[i,j] = 0; } } //0 on upper diagonal
for(i in 1:D) lambda[i,i] = L_diag[i]; //Positive values on diagonal
for(j in 1:D) {
for(i in (j+1):P) {
idx2 = idx2+1;
lambda[i,j] = L_lower[idx2];
}
}
The above isn’t this just very similar to the Barlett Decomposition of the Wishart Distribution,
check matrix A with L is unit diagonal, thus redundant, or can be used as scale matrix.
If so, we do not need mu_lower and sigma_lower.
Last, I would change the sigma’s from gamma to cauchy:
Ok, the Barlett decomposition really looks like what is implemented in the jags code from boral (normal prior for the lower diag, zero on the upper). However, they do not sample the diagonal from a chi-squared distribution, but from a truncated normal…
Thank you very much for these insights, I would not have guess that by myself!
## JAGS model written for boral version 1.5 on 2018-04-20 09:43:59 ##
model {
## Data Level ##
for(i in 1:n) {
for(j in 1:p) { eta[i,j] <- inprod(lv.coefs[j,2:(num.lv+1)],lvs[i,]) + row.coefs.ID1[row.ids[i,1]] }
for(j in 1:p) { y[i,j] ~ dpois(exp(lv.coefs[j,1] + eta[i,j])) }
}
## Latent variables ##
for(i in 1:n) { for(k in 1:num.lv) { lvs[i,k] ~ dnorm(0,1) } }
## Process level and priors ##
for(j in 1:p) { lv.coefs[j,1] ~ dnorm(0,0.1) } ## Separate species intercepts
for(i in 1:n.ID[1]) { row.coefs.ID1[i] ~ dnorm(0, pow(row.sigma.ID1,-2)) }
row.sigma.ID1 ~ dunif(0,30)
for(i in 1:(num.lv-1)) { for(j in (i+2):(num.lv+1)) { lv.coefs[i,j] <- 0 } } ## Constraints to 0 on upper diagonal
for(i in 1:num.lv) { lv.coefs[i,i+1] ~ dnorm(0,0.1)I(0,) } ## Sign constraints on diagonal elements
for(i in 2:num.lv) { for(j in 2:i) { lv.coefs[i,j] ~ dnorm(0,0.1) } } ## Free lower diagonals
for(i in (num.lv+1):p) { for(j in 2:(num.lv+1)) { lv.coefs[i,j] ~ dnorm(0,0.1) } } ## All other elements
}
That’s ok in Stan too. You can sample from a truncated normal as well. It’s fine.
You may also consider to use a lkj correlation distribution instead of a Barlett Decomp. of Wishart.
lv.coefs[i,i+1] ~ dnorm(0,0.1)I(0,)
be aware that this would be normal(0, sqrt(1/0.1)); in Stan, don’t forget to use constraints.
Also the gamma’s I suspect to be sqrt of inv_gamma in Stan.
Oh, I didn’t know about these differences between jags and stan…
I have litteraly implemented the Barlett decomposition in my code, and it works very well : no divergence and rhats and effective sample sizes looks greats for all loadings! The observed covariance matrix seems to be greatly approximated. I will make quickly a case study with simulated data!
data{
int N; // sample size
int P; // number of species
int D; // number of dimensions
int<lower=0> Y[N,P]; // data matrix of order [N,P]
}
transformed data{
int<lower=1> M;
vector[D] FS_mu; // factor means
vector<lower=0>[D] FS_sd; // factor standard deviations
M = D*(P-D)+ D*(D-1)/2; // number of lower-triangular, non-zero loadings
for (m in 1:D) {
FS_mu[m] = 0; //Mean of factors = 0
FS_sd[m] = 1; //Sd of factors = 1
}
}
parameters{
//Parameters
vector[N] alpha; //Row intercepts
row_vector[P] b0; // Intercept per species
vector[M] L_lower; //Lower diagonal loadings
vector<lower=0>[D] L_diag; //Positive diagonal elements of loadings
matrix[N,D] FS; //Factor scores, matrix of order [N,D]
cholesky_factor_corr[D] Rho; //Correlation matrix between factors
//Hyperparameters
real<lower=0> sigma_a; //Variance of the row intercepts
real<lower=0> sigma_b0; //Variance of the species intercepts
real<lower=0> eta; //Parameter of LKJ prior for Rho
}
transformed parameters{
cholesky_factor_cov[P,D] lambda; //Final matrix of laodings
matrix[D,D] Ld; // cholesky decomposition of the covariance matrix between factors
matrix[N,P] temp; //intermediate predictor
matrix<lower=0>[N,P] mu; // predicted values
// Correlation matrix of factors
Ld = diag_pre_multiply(FS_sd, Rho);
{
int idx2; //Index for the lower diagonal loadings
idx2 = 0;
// Constraints to allow identifiability of loadings
for(i in 1:(D-1)) { for(j in (i+1):(D)){ lambda[i,j] = 0; } } //0 on upper diagonal
for(i in 1:D) lambda[i,i] = L_diag[i]; //Positive values on diagonal
for(j in 1:D) {
for(i in (j+1):P) {
idx2 = idx2+1;
lambda[i,j] = L_lower[idx2];
}
}
}
// Predictor
temp = FS * lambda';
for(n in 1:N) mu[n] = exp(alpha[n] + b0 + temp[n]);
}
model{
// Hyperpriors
sigma_a ~ gamma(2,0.1); //Row intercepts hyperprior
sigma_b0 ~ gamma(2,0.1); //Species intercept effects hyperprior
eta ~ gamma(2,0.1); //Parameter of the cholesky prior for FS correlation structure
// Priors
alpha ~ normal(0, sigma_a); //Regularizing prior for row intercepts
b0 ~ normal(0, sigma_b0); //Regularizing prior for species intercepts
for(d in 1:D) L_diag[d] ~ chi_square(M-d);
L_lower ~ normal(0,1); //Regularizing prior for lower loadings
Rho ~ lkj_corr_cholesky(eta); //Regularizing prior for Rho
for(i in 1:N){
Y[i,] ~ poisson(mu[i]);
FS[i,] ~ multi_normal_cholesky(FS_mu, Ld);
}
}
generated quantities{
matrix[P,P] cov_lambda;
cov_lambda = multiply_lower_tri_self_transpose(lambda);
}
If L is square and lower diagonal, then as long as it has positive diagonals, L * L' will be positive definite. Otherwise, you can use a non-square L for lower ranks. I don’t know if there’s a way to force it to be semi-definite of unknown rank.
If someone wants to code up an efficient Wishart and inverse Wishart for Cholesky factors of covariance matrices, that’d be great. We have the correlation-based LKJ prior for Cholesky factors of correlation matrices, but that’s it for efficient priors.
The efficient Wishart for Cholesky factors of covariance matrices is already in the Stan manual pg. 350.
The inverse W., is easy to calculate, to sample one has to use the transposed the Inverse.
Hi!
This operation aims just to create a covariance matrix with estimated correlations between factors but diagonal fixed to one. This the Φ of classical factor analysis formulation. But I just realized that using direcly a constrained correlation matrix should work, without an unnecessary operation!