Feature selection, can Automatic Relevance Determination help? (CORRECTED = weighting observations, not features)


(sorry for the possible lack of details of the model but his is at me moment a pretty general question)

I implemented a multiple linear regression model, where features are genes, and their value is gene expression. The majority of 20.000 genes are not informing the regression but rather adding noise (as expected).

I would like to find a feature selection method where from simulated data (having therefore all component of the linear equation) I can perform feature selection.

I have been suggested to look into Automatic Relevance Determination. Is this suitable for this goal?

If so, can I ask clarification on what are some of the parameters in the stan model in the manual?

parameters {
vector<lower=0>[D] rho;
real<lower=0> alpha;  //(value of the coefficients?)
vector[N] eta;

Without knowing how much data you have, it sounds like the classic problem the horseshoe prior is designed for. Basically if the true effect is zero, horse shoe will tend to shrink the linear regression coefficient to zero, but if the value truly isn’t zero it won’t shrink it at all. @avehtari just gave a great talk on this at StanCon that is probably already up on the StanCon website.

Automatic relevance detection in GPs I guess could be used if you think there are higher order effects of the genes, which there very well may be if the value is gene expression. For example, two genes by themselves may not increase the expression of a third gene by themselves, but together they may. I would start with a linear model though since GP hyper-parameters can be hard to sample.

The parameters you mentioned are just hyper-parameters for the squared exponential kernel. rho is a length scale and controls how smooth the GP is and alpha controls the “amplitude” of the GP.

1 Like

Thanks Arya,

I will search for @avehtari talk.

I don’t have an incredible amount of independent test samples, my plan was to simulate data with my training data fist N=600 (to be sampled for 20 different cell types, to be mixed in a linear fashion to form a tissue), and than simulated data from independent few new test set N = 30.

The feature selection I had in mind was quite simple (without assuming high level dependencies), simply the perfect feature for me is the one uncorrelated with anything else, that has high value n a specific cell type = uncorrelates and exclusively high (obviously gene values are highly noisy, that’s why careful feature selection is necessarily).

I would like a final list of weights from which I can exclude “once for all” irrelevant and/or too noisy features.

In that case I would definitely start with a linear model with a horseshoe prior over the weights. They are really convenient in the “high p, low n” regime.

Thanks Arya,

I have realized that I might have used the word “feature” improperly.

My linear system looks like:

Gene_1_tissue = gene_1_cell_type_1 * w_cell_type_1 + ... +  gene_1_cell_type_N * w_cell_type_N
Gene_2_tissue = gene_2_cell_type_1 * w_cell_type_1 + ... +  gene_2_cell_type_N * w_cell_type_N
Gene_M_tissue = gene_M_cell_type_1 * w_cell_type_1 + ... +  gene_M_cell_type_N * w_cell_type_N

Basically I want to exclude genes (dots around the regression line) rather than cell_types.

In this case the weight I would like to get is:

Gene_1 * weight_1
Gene_M * weight_M

Does you proposed solution still applies?

Sorry for the confusion.

Yes, if I understand correctly, I think it still applies, but you would do each Gene_1_tissue in a separate regression. By the way, I think this is one of the default RStanarm models, so you won’t even have to write up your own model.

Thanks Arya,

I’m a bit confused, Gene_1_tissue is just one observation in the model. Am I missing something?

Did you mean “each cell_type” separately into a non-multiple linear regression?

Thinking more clearly about it, indeed I want to weight my observations rather than the “features”.

Thanks for your time, it was my bad I got confused with features. I will open a new topic for clarity.

Hm, can you explain your experimental set up a bit more? The output is gene expression in M different cancer tissue? And you want to model how predictive the expression levels of various genes are on this output?

Thanks for your interest.

the goal is to determine the proportion of different cell types present in a tissue. Every cell type has some preferentially/exclusively activated genes (high expression), those genes are N~20 among a total of 20.000.

Using genes value as observation we want to deconvolute the cell types within. Every gene in a tissue is a weighted average of the gene value for a cell type multiplied by its proportion within the tissue. This is a multiple regression problem with this form.

while I want to infer w_cell_type, everything else is known. M = 20.000 and I want to reduce it dramatically

My problem is to ignore all genes that are not “specific” for any cell type in particular. So I did it in a pretty naive and handcrafted way, selecting the genes that have higher expression in one cell compared to the other from online resources. I would like to do gene selection in a more automated way.

For example I would simulate many mixes, from cell type profiles and: if I had infinite computer power I would use a genetic algorithm that runs my stan prediction model. I thought about a neural network that from simulated data learns which genes do not help in the prediction. Using a probabilistic network is a nother possibility, something like:


 for (n in 1:N)
    increment_log_prob(w[n] * normal_log(y[n], x[n] * beta, sigma)); 

But has been debated as to be problematic, and after having implemented it it didn’t select any particular genes with w ~ dirichlet()

Any idea?
Thanks a lot.

So you have N cell types? And even if some gene is active with Gene_m_tissue with high value, you don’t want to see that if it’s not specific for some cell type?


a gene can be “specific” for more than one cell type of course and that would be informative (I simplified a bit in my previous post).

The training data is incredibly noisy (gene values for cell types), so one of the main challenges is to just select observations (i.e., genes) that are “solid”.

The genes that I want to discart are

  1. genes that have the same activation/expression in all cell types (~95%)
  2. genes that even if are spefic for some cell types are incredibly noisy and unless I use a full hierarchical model (thing that I have approached and wasn’t performing, and I will try in the future), the point estimates for those genes in each cell type add an incredible amount of noise that drive a wrong inference of w_cell_type in the multiple regression.

The genes I ideally want are those with low variance for all cell types and are specific for some while ~0 for others.

The question is: what is the best machine learning approach to do that, given that I can produce a lot of artificial mixes/tissues from single cell types from online resources?

Thanks a lot.

I’m still not sure I’m understanding. The gene expressions in the tissue are the data. When you say you want to ignore them do you mean you want to not use that data?

If M is roughly 20,000, how many different cell types are there (N)?

This almost seems like you have a matrix (gene_i_cell_type_j) and you want to multiply it by the vector w_cell_type_i to get the vector Gene_i_tissue. Is that true?


Correct, I known it is a bit out of the generative model logic, but in the field of biology makes a lot more sense.

N is small from 3 to 9

Yes, the objects

  • gene_i_cell_type_j and Gene_i_tissue are known
  • w_cell_type_i is unknown

Normally I have w_cell_type_i and Gene_i_tissue being a matrix of multiple vectors (because I have many samples)

Therefore I have such matrix multiplication

gene_i_cell_type_j x w_cell_type_j_sample_s = Gene_i_tissue_sample_s