# Horseshoe regularization on matrix factorization not working

Aim
I am attempting to decompose an air-quality 2D matrix (1500 time points x 80 compounds) into several factors (4-5) with some autocorrelation restrictions for the time series. We call this model BAMF (Bayesian Autocorrelated Matrix Factorisation, Rusanen et al., 2024) and we refer to G to talk about the time series matrix and to F for the compound fingerprints matrix (see Figure 1). We want to apply the regularized horseshoe to the compound fingerprints so that the individual compounds’ contribution can be narrowed to zero for certain factors while staying large for the other factors.
I would like to receive some advice on what might be wrong with our horseshoe implementation and some possible solutions for these issues, also if needed, ideas on tests to diagnose the issue. Find the model and input data attached.

Method
We want to the regularized horsehoe prior as indicated in Piironen and Vehtari (2017) to our matrix F so that compounds have zero contributions on certain factors. For this purpose, we set a prior on F:
F_{kj} \sim Normal(0,\sigma_{HS}\cdot \tau \cdot \tilde{\lambda}_{kj})
where the k, j indices sweep the factors, components, respectively, σHS is a parameter set positive and τ and λ ̃ are parameters defined as:
\tau \sim Cauchy^+ (0, \sigma_{HS} \cdot scale_{global}),

\tilde{\lambda} = \sqrt{\frac{c^2 \cdot \lambda ^2}{c^2 + \tau^2 \cdot \lambda^2}}
where
\lambda \sim Cauchy^+(0,1),

scale_{global} \sim \frac{p_0}{(p-p_0)\cdot \sqrt{m}},

c = slab_scale \cdot \sqrt{c_{aux}},

c_{aux} \sim \Gamma^{-1}(0.4\cdot slab_{df}, 0.5\cdot slab_{df}),

and slabscale and slabdf are 2.5 and 4, respectively, as indicated in the article, m is the number of components , p the number of factors, and p0 the desired number of non-zero components of a factor.

Problem
As you may see in Figure 2, the horseshoe application to the 5 factors (x-axis) fingerprints is not working as expected in this range of compounds shown (y-axis). The regular BAMF is in blue the BAMF + the regularized Horseshoe (BAMF+HS) in orange. The orange distributions are not making the compounds more regularized upon the factors, I would expect that it would shrink some components to zero while maintaining the large contributions in certain factors, but this is not happening or not enough. However, I reckon that not even for BAMF we get the expected perfect Gaussians, implying there might be an issue there as well.
I tried to strengthen the horseshoe power by reducing the p0 even to one but no improvement was found regarding that. I believe the issues could be:
a) our parameters and F matrix are not stable enough after the 2000 samples (1000 are warm-up);
b) the horseshoe priors are wrongly set and/or another prior would suit better our problem.
I’m inclined for a) since I found huge variability of thau, lambda_tilde, c, along samples (Figure 3). R hat is low in both cases, BAMF and BAMF+HS, but a bit lower for BAMF+HS (Figure 4). My guess is that since our parameters are not stable, are not horseshoeing properly. If that’s the case, how could I solve this?

Model

//
// BAMF + Horseshoe prior all m/zs
//   7/5/2024
// T-students changed to Cauchys.
// Lambda is m/z dependent.
//
//
functions {
matrix vector_array_to_matrix(vector[] x) {
matrix[size(x), rows(x[1])] y;
for (m in 1:size(x))
y[m] = x[m]';
return y;
}
}
data {
int<lower=1> n;     // number of data points
int<lower=1> m;     // number of dimensions
int<lower=1> p;     // number of components
matrix[n,m] X;      // data matrix
matrix<lower=0>[n,m] sigma;  // separate sigma for each of the observations (!)
vector<lower=0>[n-1] timesteps; //How much time elapses between values in X (arbitrary unit, but unit choice also affects other priors)
real<lower=0> minerr; //Minimal error
}

transformed data {
vector[n*m] Xv = to_vector(X);
vector[n*m] sigmav = to_vector(sigma);
real < lower =0> p_zero =3  ; //# slab scale for the regularized horseshoe
real < lower =0> scale_global;
real < lower =0> slab_scale  ; //# slab scale for the regularized horseshoe
real < lower =0> slab_df; //# slab degrees of freedom for the regularized horseshoe
scale_global = p_zero / ((p-p_zero)*sqrt(n));
slab_scale = 2.5;
slab_df = 4;
}

parameters {
matrix<lower=0>[n,p] G; // score matrix
simplex[m] F[p]; // profiles simplex
vector<lower=0>[p] alpha_a; // Posterior a
vector<lower=minerr>[p] alpha_b; // Posterior b

real < lower =0> tau; // Global shrinkage parameter
matrix < lower =0>[p,m] lambda ; // Local shrinkage parameter
real < lower =0> caux ;  //A regularisation parameter
real<lower=0> logsigma; // Logsigma
}

transformed parameters {
real < lower =0> sigma_hs ; //  noise std
matrix < lower =0 >[p,m] lambda_tilde ; //  'truncated' local shrinkage parameter
real < lower =0> c; // slab scale
matrix [n, m] Z; // latent function values

sigma_hs = exp ( logsigma );
c = slab_scale * sqrt ( caux );
lambda_tilde = sqrt ( c^2 * square (lambda) ./ (c^2 + tau ^2* square (lambda)) );
Z = G*vector_array_to_matrix(F);
}

model {
lambda ~ cauchy ( 0, 1);
tau ~ cauchy ( 0, scale_global * sigma_hs );
caux ~ inv_gamma (0.5* slab_df , 0.5* slab_df );

for (k in 1:p) {
for (j in 1:m) {
F[k,j] ~ normal(0, sigma_hs * tau * lambda_tilde[k,j]); // Horseshoe prior
}
}
for(j in 1:p) {
G[2:,j] ~ cauchy(G[:(n-1),j], alpha_a[j]*timesteps + alpha_b[j]);
}
Xv ~ normal(to_vector(G*vector_array_to_matrix(F)), sigmav);
}



You can find the .stan file attached in this post and the input data file here. Since we use a synthetic dataset for model testing, the “truth" for F, G is also attached. Let me know if you need me to provide something else.

Thanks.

References

Piironen, J., & Vehtari, A. (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors.

Rusanen, A., Björklund, A., Manousakas, M., Jiang, J., Kulmala, M. T., Puolamäki, K., & Daellenbach, K. R. (2023). A novel probabilistic source apportionment approach: Bayesian auto-correlated matrix factorization. Atmospheric Measurement Techniques Discussions, 2023, 1-28.
bamf_HS_lambda_jk.stan (2.7 KB)
F_truth.txt (5.0 KB)
G_truth.txt (108.5 KB)

1 Like

Hi, @martavia and welcome to the Stan forums. Also, thanks for the well-formulated question.

I’m sorry this question hasn’t been answered. I think you’ll need @avehtari for this one, but unfortunately he’s on vacation at the moment.

If I read the data and model description correctly, you have much more time points than the number of elements in F, and if the horseshoe prior does not have a significant effect, then it seems the likelihood is informative and the prior doesn’t matter. In this case, the usual non-centered parameterization may also be computational worse than centered parameterization.

Hi,
Could you please reexplain your last sentence? I don’t get what you refer to with the centered/non-centered parametrization.
Also, I would like to ask a few more questions:

• The horseshoe generates a more precise solution, although not necessarily more accurate (now confirmed with more stable chains). Adding the HS prior is coercing the model, hence it does have an influence in the output. However, why is the HS not shrinking to zero?
• Regarding the last question, could it be that the scale global should not contain only m points but n·m, or maybe just n? In both cases, the thau_0 would be smaller hence the HS more shrinky. Could you confirm this should be the case?
• Would you advice any other prior introduction or any other way to implement the HS with the n/m ratio we work with?
Thanks again,
Marta

Short explanation of centered/non-centered parametrization in Stan User’s guide chapter on Efficiency Tuning. If you google “centered non-centered parameterization Stan” you’ll find several examples, e.g. Chapter 11 Hierarchical models and reparameterization | An Introduction to Bayesian Data Analysis for Cognitive Science

Non-centered parametrization is good if likelihood contributions for group specific parameters are weak. Centered parameterization is good if likelihood contributions for group specific parameters are strong. Since HS is not shrinking close to zero, I inferred that your likelihood is strong, and thus centered parameterization is likely to be better.

If you are uncertain whether to use m, n*m, or n to scale the global scale, I recommend simulating from the prior to see the effect of these for the prior. This would be useful anyway, to see whether the HS prior matches your assumptions in what is close to zero.

Unfortunately, I don’t have strong advise on how to use HS in matrix factorization. The issue is complicated as the elements are not fully exchangeable and you would like to take into account the structure. I remember seeing alternative sparsity priors presented in literature for matrix factorization, but can’t remember exact references at the moment. Based on the results you reported here and if the likelihood is actually informative, then different sparsifying priors should not really matter. To get forward, I think the next useful steps are prior simulation and testing what happens if you make the global scale smaller and smaller. You could also use priorsense (see Priorsense 1.0 is now on CRAN) to check the sensitivity of the results to your prior choices.¨