# Bayes Sparse Regression - reg. horseshoe for multinomial model

#1

I read with interest the article by @betanalpha ,

https://betanalpha.github.io/assets/case_studies/bayes_sparse_regression.html

I will try to apply it to some research questions on mine. I work in genetics where I have the same model structure for ~ 20k genes; and where the coefficient of change for the majority of genes (>60-80%) are likely to be 0.

1. I am interested in knowing whether some tests have been performed on the robustness on the inference given imprecise

real m0 = 10; // Expected number of large slopes

For example whether that number was actually the double in the data

1. Furthermore, I am interested if in principle with enough data, the “Expected number of large slopes” can be inferred rather than given.

Thanks

#2

“Expected number of large slopes” is a prior guess, but data will update that and the posterior will have more information and with enough data the posterior will concentrate. Usually the data doesn’t have much information and the posterior has lot of uncertainty, but the predictions can be good even if we are uncertain how many large slopes there are.

It’s likely that computation will be slow and there can be mixing problems, but it’s not impossible.

#3

Thanks for the paper,

I coding for testing your prior on my data. Just one little question:

I would have thought the data was organised in genes/samples, however “y” is

vector [ n] y; # outputs

where n is

int < lower =0 > n ; # number of observations

Is this analysis done for each gene separately? Am I missing something?

Here the model I am referring to, in your article

data {
int < lower =0 > n ; # number of observations
int < lower =0 > d ; # number of predictors
vector [ n] y; # outputs
matrix [n ,d] x; # inputs
real < lower =0 > scale_icept ; # prior std for the intercept
real < lower =0 > scale_global ; # scale for the half -t prior for tau
real < lower =1 > nu_global ; # degrees of freedom for the half -t prior
# for tau
real < lower =1 > nu_local ; # degrees of freedom for the half - t priors
# for lambdas
real < lower =0 > slab_scale ; # slab scale for the regularized horseshoe
real < lower =0 > slab_df ; # slab degrees of freedom for the regularized
# horseshoe
}
parameters {
real logsigma ;
real beta0 ;
vector [ d] z;
real < lower =0 > tau ; # global shrinkage parameter
vector < lower =0 >[ d] lambda ; # local shrinkage parameter
real < lower =0 > caux ;
}
transformed parameters {
real < lower =0 > sigma ; # noise std
vector < lower =0 >[ d] lambda_tilde ; # ’ truncated ’ local shrinkage parameter
real < lower =0 > c; # slab scale
vector [ d] beta ; # regression coefficients
vector [ n] f; # latent function values
sigma = exp ( logsigma );
c = slab_scale * sqrt ( caux );
lambda_tilde = sqrt ( c ^2 * square ( lambda ) ./ (c ^2 + tau ^2* square ( lambda )) );
beta = z .* lambda_tilde * tau ;
f = beta0 + x* beta ;
}


#4

So how is your data organized? In your first post you mention 20k genes, so I assumed each gene is a predictor and you would have some outcome to predict, but it seems it’s something different?

#5

I see,

I have a different perspective (maybe numerically equivalent). In my case M = number of genes. My data is organised by counts of sequenced molecules (genes). Mine is not organised as classification problem, but a series (n~20K) of (multiple or single) regressions, in my case with a continuous covariate.

For example, 20000 of these

Now, ignore this spline regression (old attempt), if I use a linear function (my final goal is to use a generalised sigmoid, but it changes little for my question)

gene1 -> [y1_1,…,y1_T] = alpha1 + beta1 * [x,…,x]
gene2 -> [y2_1,…,y2_T] = alpha2 + beta2 * [x,…,x]
gene3 -> [y3_1,…,y3_T] = alpha3 + beta3 * [x,…,x]
gene4 -> [y4_1,…,y4_T] = alpha4 + beta4 * [x,…,x]
gene5 -> [y5_1,…,y5_T] = alpha5 + beta5 * [x,…,x]

I would like to have a sparsity assumption on beta, and maybe using your horseshoe I could achieve that

So:

beta ~ finnish_horseshoe(.., .., .., ..)


rather than using

beta ~ normal(0, ..)


and:

• improve my false positive rate. With little replication (e.g., 13 patients rather than 500) I will obtain a lot of noisy results, that I don’t want to be distracted from and cannot easily filtered out, choosing an arbitrary threshold
• identify a tight cluster of fixed genes that will provide normalisation for my patients (see below, I use multinomial(softmax( rates with non changing genes anchors )) framework )

Do you think this would in principle invalidate the use of your prior, because the paradigm is not anymore LASSO/classification/feature selection?

I hope the following model of mine is not distracting from the first question. In the case of your

real tau0 = (m0 / (G - m0)) * (**sigma** / sqrt(1.0 * N));


I don’t have sigma as I use multinomial with softmax (with a biological motivation being gene expression is not a real simplex as there are not limited resources, so genes do not compete in that sense, and the changes in transcriptions can be asymmetric, meaning there could be just increases for a set of genes while the others remain unchanged)

data {
int<lower = 0> G;                   // all genes
int<lower = 0> T;                   // tube/patients/sample/replicates synonimous
int<lower = 0> y[T, G];             // RNA-seq counts
matrix[T,2] X;                      // design matrix, intercept and one slope
}
// Horseshoe
transformed data {
real m0 = 30;           // Expected number of large slopes
real slab_scale = 5;    // Scale for large slopes
real slab_scale2 = square(slab_scale);
real slab_df = 25;      // Effective degrees of freedom for large slopes
real half_slab_df = 0.5 * slab_df;
}

parameters {

// Linear model
row_vector[G] beta0;

// Horseshoe
vector[G] beta_tilde;
vector<lower=0>[G] lambda;
real<lower=0> c2_tilde;
real<lower=0> tau_tilde;

}
transformed parameters{
matrix[2, G] beta;      // 2 coefficients, intercept and slope
row_vector[G] beta1;    // one covariate

{
real tau0 = (m0 / (G - m0)) * (sigma / sqrt(1.0 * N));  // !!! I do NOT have sigma
real tau = tau0 * tau_tilde; // tau ~ cauchy(0, tau0)
real c2 = slab_scale2 * c2_tilde;
vector[G] lambda_tilde = sqrt( c2 * square(lambda) ./ (c2 + square(tau) * square(lambda)) );
beta1[1] = tau * lambda_tilde .* beta_tilde;
}

beta = append_row(beta0, beta1);

}
model {

// Horseshoe
beta_tilde ~ normal(0, 1);
lambda ~ cauchy(0, 1);
tau_tilde ~ cauchy(0, 1);
c2_tilde ~ inv_gamma(half_slab_df, half_slab_df);

// Lnear system
beta[1] ~ normal(0,1);

for (t in 1:T)
y[t] ~ multinomial( softmax( X[t] * beta ) );

}


Finally, I cannot easily match the Michael implementation (Which I like because it treats explicitely the two key variables

real m0 = 10;           // Expected number of large slopes
real slab_scale = 3;    // Scale for large slopes


)

https://betanalpha.github.io/assets/case_studies/bayes_sparse_regression.html

With the implementation on your article, for which sigma is not part of the transformed parameter formulas

https://projecteuclid.org/euclid.ejs/1513306866

#6
• You can use regularize horseshoe (or Finnish horseshoe as Mike calls it) prior for beta. No need to think about “LASSO/classification/feature selection”, just think it as a prior.
• Horseshoe doesn’t model clustering explicitly. It may be enough for your purpose, but for even better identification of clusters you might need something more complex, but I don’t have ready made solution for that.
• Non-Gaussian observation models are discussed in Section 3.5 with formulas for generalized linear models from which you could derive pseudo variance \tilde{\sigma}^2 for multinomial

#7

My dataset is so big (in terms of genes) and noisy (in terms of low replicates/patients) that a more simple approach would be a good starting point.

I admit that section 3.5 is a bit too involved for my knowledge at the moment, unfortunately I could not implement it myself.

I was hoping that your implementation at C.2. Alternative parametrization (since you use this for your genetics studies) was directly applicable to my case, after all my beta is also an unbounded real number. Can you confirm me is not suitable?

If it was, I would have liked to understand, in

data {
int < lower =0 > n; # number of observations
int < lower =0 > d; # number of predictors
vector [ n] y; # outputs
matrix [n ,d] x; # inputs
real < lower =0 > scale_icept ; # prior std for the intercept
real < lower =0 > scale_global ; # scale for the half -t prior for tau
real < lower =1 > nu_global ; # degrees of freedom for the half -t prior
# for tau
real < lower =1 > nu_local ; # degrees of freedom for the half - t priors
# for lambdas
real < lower =0 > slab_scale ; # slab scale for the regularized horseshoe
real < lower =0 > slab_df ; # slab degrees of freedom for the regularized
# horseshoe
}


Where is the correspondence with the Michael

// Expected number of large slopes
// Scale for large slopes
// Effective degrees of freedom for large slopes


In your implementation i see 5 variables to set up for the reg. horseshoe

real < lower =0 > scale_global
real < lower =1 > nu_global
real < lower =1 > nu_local
real < lower =0 > slab_scale
real < lower =0 > slab_df


While only 3 for Michael’s

real m0
real slab_scale
real slab_df = 25


If you could propose an implementation for the multinomial(softmax( alpha + beta * X )) framework, even a starting point to test would be awesome

Also would be helpful to know what do you think is the next best “more standard” prior to do this job? (e.g., double_exponential ?)

Thanks!

#8

@paul.buerkner I was thinking that reverse engineering one of your models could be a smarter route

I found a really good explanation on the regularised horseshoe parameters [https://rdrr.io/cran/brms/man/horseshoe.html]

The horseshoe prior is a special shrinkage prior initially proposed by Carvalho et al. (2009). It is symmetric around zero with fat tails and an infinitely large spike at zero. This makes it ideal for sparse models that have many regression coefficients, although only a minority of them is non-zero. The horseshoe prior can be applied on all population-level effects at once (excluding the intercept) by using set_prior(“horseshoe(1)”). The 1 implies that the student-t prior of the local shrinkage parameters has 1 degrees of freedom. This may, however, lead to an increased number of divergent transition in Stan. Accordingly, increasing the degrees of freedom to slightly higher values (e.g., 3) may often be a better option, although the prior no longer resembles a horseshoe in this case. Further, the scale of the global shrinkage parameter plays an important role in amount of shrinkage applied. It defaults to 1, but this may result in too few shrinkage (Piironen & Vehtari, 2016). It is thus possible to change the scale using argument scale_global of the horseshoe prior, for instance horseshoe(1, scale_global = 0.5). In linear models, scale_global will internally be multiplied by the residual standard deviation parameter sigma. See Piironen and Vehtari (2016) for recommendations how to properly set the global scale. The degrees of freedom of the global shrinkage prior may also be adjusted via argument df_global. Piironen and Vehtari (2017) recommend to specifying the ratio of the expected number of non-zero coefficients to the expected number of zero coefficients par_ratio rather than scale_global directly. As proposed by Piironen and Vehtari (2017), an additional regularization is applied that only affects non-zero coefficients. The amount of regularization can be controlled via scale_slab and df_slab. To make sure that shrinkage can equally affect all coefficients, predictors should be one the same scale. Generally, models with horseshoe priors a more likely than other models to have divergent transitions so that increasing adapt_delta from 0.8 to values closer to 1 will often be necessary. See the documentation of brm for instructions on how to increase adapt_delta.

I was looking for using something like

brm(

family=“multinomial”,
prior = “horseshoe(…)”

)

however I cannot find multinomial implementation, appears to not be included [https://cran.r-project.org/web/packages/brms/vignettes/brms_overview.pdf]

However I am thinking whether in principle your implementation of horseshoe for

poisson(alpha + beta * X)

model can be then converted to

multinomial( log( alpha + beta * X ) )

Or do you have any other better idea?

#9

If by multinomial you mean “categorical” that is one category per observation (generalization of bernoulli), you can use the categorical family of brms. If you mean multinomial as generalization of binomial, I indeed recommend going for the poisson transformation. See https://github.com/paul-buerkner/brms/issues/338 for some discussion about that.

#10

This is indeed the case. Just to clarify: with

you mean, create a model brms with Horseshoe + poisson, and then adapt the horseshoe to my custom model? (as I am using a model that definitely needs multinomial and no need to go through brms route for this one, just takee advantage of your nice implementations ;) )

Thanks!

#11

I ment that you can use the Poisson reparamterization of the multinomial distribution but you might as well just take the Stan code and adapt it to your needs ;-)

#12

I am trying to proceed and I report results while asking for some technical suggestions.

Mine could be a simple multinomial regression where categories are genes. However, the overall changes in genes can be asymmetric, where some genes increase in rate while other remain stable.

• gene0…70 : slope == 0
• gene71…100 : slope == 5

With a naive modelling we obtain slopes for the non changing genes that appear negative

A normalization that works is to assume there are some NON-changing genes, and force the slopes == 0 for them.

However I would like to exploit the sparsity assumption instead for “normalizing” the data: using something like Horseshoe prior of the slope

With that I improve my normalization, and the slope == 0 is actually centered on 0

I have two questions related to my model (attached at the end)

1. with the tuning of these parameters I could not decrease my false positives to 0, although intuitively I should be able to make the attraction to 0 arbitrarily stronger
• data_sigma$hs_df = 3 • data_sigma$hs_df_global = 1
• data_sigma$hs_df_slab = 2 • data_sigma$hs_par_ratio = 0.1
• data_sigma\$hs_scale_slab = 1
1. Maybe I am converting hs_par_ratio to hs_scale_global in a wrong way

in the formula

real hs_scale_global = hs_par_ratio / sqrt( AMOUNT_OF_DATA_POINTS);


I am setting AMOUNT_OF_DATA_POINTS as number of genes (e.g, 100) * number of replcates (e.g., 13 per gene regression), because I have my data organised as 100 regression with 13 point each

e.g.,

But I’m thinking that if I am out in such calculation of many order of magnitude then the % of non 0 matters little.

My model


functions{
/* Efficient computation of the horseshoe prior
* Args:
*   zb: standardized population-level coefficients
*   global: global horseshoe parameters
*   local: local horseshoe parameters
*   scale_global: global scale of the horseshoe prior
*   c2: positive real number for regularization
* Returns:
*   population-level coefficients following the horseshoe prior
*/
vector horseshoe(vector zb, vector[] local, real[] global,
real scale_global, real c2) {
int K = rows(zb);
vector[K] lambda = local[1] .* sqrt(local[2]);
vector[K] lambda2 = square(lambda);
real tau = global[1] * sqrt(global[2]) * scale_global;
vector[K] lambda_tilde = sqrt(c2 * lambda2 ./ (c2 + tau^2 * lambda2));
return zb .* lambda_tilde * tau;
}
}
data {
int<lower = 0> G;                   // all genes
int<lower = 0> T;                   // tube
int<lower=0> R;
int<lower = 0> y[T, G];             // RNA-seq counts
matrix[T,R] X;

// Horseshoe
real<lower=0> hs_df;
real<lower=0> hs_df_global;
real<lower=0> hs_df_slab;
real<lower=0> hs_par_ratio;
real<lower=0> hs_scale_slab;
}
transformed data{
real hs_scale_global = hs_par_ratio / sqrt(G*(T-R));
}
parameters {

// Linear model
row_vector[G] beta0;

// Horseshoe
vector[G] zb;
vector<lower=0>[G] hs_local[2];
real<lower=0> hs_global[2];
real<lower=0> hs_c2;

}

transformed parameters{
matrix[R, G] beta;
matrix[R-1, G] beta1;

// population-level effects
beta1[1] = to_row_vector( horseshoe(zb, hs_local, hs_global, hs_scale_global, hs_scale_slab^2 * hs_c2) );

beta = append_row(beta0, beta1);

}
model {

// Horseshoe
target += normal_lpdf(zb | 0, 1);
target += normal_lpdf(hs_local[1] | 0, 1) - 3 * log(0.5);
target += inv_gamma_lpdf(hs_local[2] | 0.5 * hs_df, 0.5 * hs_df);
target += normal_lpdf(hs_global[1] | 0, 1) - 1 * log(0.5);
target += inv_gamma_lpdf(hs_global[2] | 0.5 * hs_df_global, 0.5 * hs_df_global);
target += inv_gamma_lpdf(hs_c2 | 0.5 * hs_df_slab, 0.5 * hs_df_slab);
target += student_t_lpdf(beta0 | 3, 6, 10);

// Lnear system
beta0 ~ normal(0,5);

for (t in 1:T)
y[t] ~ multinomial( softmax(to_vector(X[t] * beta) ) );

}

`

Many regressions: unique (stable) function for exponential and s-shape behaviour
#13

How do you define false positive? If you define it as. e.g. 90% credible interval, then what is shown in your figure 9/70 false positives is reasonable.

I think this should be also 1, but I’m not certain of the specific parameterization of hs you are using

#14

I am taking the credible interval based on my total number of regressions. In my case 1 - (0.05 / G) = 0.9995 (which I guess is really limited by the number samples/iterations I have 250 * 4 chains, )

I will do more experiment with 1. The number 3 was suggested in brms documentation to avoid divergence

I should point out that further experiments with higher non-zero slopes (> than 1.5) result in a better FP rate

Intuitively I would like to pull the 0 slopes toward 0 with more strenght, but I guess is what the parameter hs_par_ratio does.

Just to confirm: in the classification framework (as you modelled in your article) I would have 20K predictors and 13 observations. It seems a desperate situation, but is what often people have.

then I should use real hs_scale_global = hs_par_ratio / sqrt( 13) ?

My goal is not classification by the way, it is ultimately multiple “hypothesis testing”

#15

Ah, that is an old recommendation. The regularized horseshoe in https://projecteuclid.org/euclid.ejs/1513306866 produces less divergences and you should also consider to use option adapt_delta=0.99 (or higher). With the number 3, the prior will have less shrinkage near zero than with the number 1.

Sorry, it is desperate situation even with 20 predictors. The effects have to be so clear that you would probably know the answer before fitting the model. I think people should spend the time to collect more data (or join data collect in different places). I think with these numbers there is almost probability of 1 to see nothing or see something falsely.

#16

From that study do you recommend the C1 or C2 model (I will start at the moment with the C2 since have been used in your study)?

Agreed, however this is a quite unique study, took me and some collaborators 2.5 years to collect data from key rare cell types from tissue that constitute 1-10% of the tissue normally, which signal would disappear within the mix.

I have 13 patients with 4 cell types collected each = 52 data points per gene (~ 20K genes). However at the moment I am analysing the 4 cell type distinctly and non hierarchically.

I know that this becomes almost a ranking exercise rather than real inference. When this pilot study will be out and I will get a decent grant, the first thing I will do is to increase the cohort ;)