I am working in a hierarchical model with assymetric link for binary classification to some text categorization data in Stan.

I am interested in to use shrikage priors as laplace, horseshoe and more, with flexible structure (as normal scale mixtures) used in "Hierarchical Bayesian Survival Analysis and Projective Covariate Selection in Cardiovascular Event Risk Prediction" (http://ceur-ws.org/Vol-1218/bmaw2014_paper_8.pdf) and as done in the paper “On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior” in the horseshoe case (https://arxiv.org/abs/1610.05559).

I have n= 5485 and k=17388 predictors.

The X matrix is a tf-idf matrix for documents (used in text categorization). This matrix is a very sparse matrix. Its columns represent documents and columns words. The values in the matrix are normalized.

Fitting the following model, I have a lot of divergent transitions when using horseshoe or laplace, however shrinkage works to obtaining between 9 and 15 non-zero coefficients (horseshoe always overcomes laplace).

I tried to reduce divergent transitions using adapt_delta=0.9999 or reparameterizing the model but it was not enough. These divergent transitions could be false positives?

I don’t fully understand your model, but there are a few strategies to try:

Test the model on simulated data - e.g. draw parameters from priors then simulate data exactly according to your model. Does the model recover the parameters?

Simplify your model to the bare bones and then test on data simulated from this simple model. Once it works without divergences and reliably recovers the true parameters, add another layer of complexity (e.g. start with simple regression on real values, then add priors, link function, the skew parameter for your link, …)

Note that this is a lot of work, but in my experience it pays off in figuring out bugs that are very hard to catch otherwise.

Also there are a few things that seem odd or “non-Stan” in your model. Those might be OK, but might also hinder sampling:

The prior on alfa (implied by delta) - do you have good reason to use this specific formula? Why would something lika alfa ~ normal(0,2) (or student) not be acceptable? Uniform distributions are also generally frowned upon in the Stan world :-)

The sqrt(inv_gamma) priors on standard deviations - Stan wiki usually suggests normal or student_t priors for those.

Note that divergences are only rarely false positives so I wouldn’t bet on it being the case.