(sorry for the possible lack of details of the model but his is at me moment a pretty general question)
I implemented a multiple linear regression model, where features are genes, and their value is gene expression. The majority of 20.000 genes are not informing the regression but rather adding noise (as expected).
I would like to find a feature selection method where from simulated data (having therefore all component of the linear equation) I can perform feature selection.
I have been suggested to look into Automatic Relevance Determination. Is this suitable for this goal?
If so, can I ask clarification on what are some of the parameters in the stan model in the manual?
parameters {
vector<lower=0>[D] rho;
real<lower=0> alpha; //(value of the coefficients?)
...
vector[N] eta;
}
Without knowing how much data you have, it sounds like the classic problem the horseshoe prior is designed for. Basically if the true effect is zero, horse shoe will tend to shrink the linear regression coefficient to zero, but if the value truly isn’t zero it won’t shrink it at all. @avehtari just gave a great talk on this at StanCon that is probably already up on the StanCon website.
Automatic relevance detection in GPs I guess could be used if you think there are higher order effects of the genes, which there very well may be if the value is gene expression. For example, two genes by themselves may not increase the expression of a third gene by themselves, but together they may. I would start with a linear model though since GP hyper-parameters can be hard to sample.
The parameters you mentioned are just hyper-parameters for the squared exponential kernel. rho is a length scale and controls how smooth the GP is and alpha controls the “amplitude” of the GP.
I don’t have an incredible amount of independent test samples, my plan was to simulate data with my training data fist N=600 (to be sampled for 20 different cell types, to be mixed in a linear fashion to form a tissue), and than simulated data from independent few new test set N = 30.
The feature selection I had in mind was quite simple (without assuming high level dependencies), simply the perfect feature for me is the one uncorrelated with anything else, that has high value n a specific cell type = uncorrelates and exclusively high (obviously gene values are highly noisy, that’s why careful feature selection is necessarily).
I would like a final list of weights from which I can exclude “once for all” irrelevant and/or too noisy features.
In that case I would definitely start with a linear model with a horseshoe prior over the weights. They are really convenient in the “high p, low n” regime.
Yes, if I understand correctly, I think it still applies, but you would do each Gene_1_tissue in a separate regression. By the way, I think this is one of the default RStanarm models, so you won’t even have to write up your own model.
Hm, can you explain your experimental set up a bit more? The output is gene expression in M different cancer tissue? And you want to model how predictive the expression levels of various genes are on this output?
the goal is to determine the proportion of different cell types present in a tissue. Every cell type has some preferentially/exclusively activated genes (high expression), those genes are N~20 among a total of 20.000.
Using genes value as observation we want to deconvolute the cell types within. Every gene in a tissue is a weighted average of the gene value for a cell type multiplied by its proportion within the tissue. This is a multiple regression problem with this form.
while I want to infer w_cell_type, everything else is known. M = 20.000 and I want to reduce it dramatically
My problem is to ignore all genes that are not “specific” for any cell type in particular. So I did it in a pretty naive and handcrafted way, selecting the genes that have higher expression in one cell compared to the other from online resources. I would like to do gene selection in a more automated way.
For example I would simulate many mixes, from cell type profiles and: if I had infinite computer power I would use a genetic algorithm that runs my stan prediction model. I thought about a neural network that from simulated data learns which genes do not help in the prediction. Using a probabilistic network is a nother possibility, something like:
So you have N cell types? And even if some gene is active with Gene_m_tissue with high value, you don’t want to see that if it’s not specific for some cell type?
a gene can be “specific” for more than one cell type of course and that would be informative (I simplified a bit in my previous post).
The training data is incredibly noisy (gene values for cell types), so one of the main challenges is to just select observations (i.e., genes) that are “solid”.
The genes that I want to discart are
genes that have the same activation/expression in all cell types (~95%)
genes that even if are spefic for some cell types are incredibly noisy and unless I use a full hierarchical model (thing that I have approached and wasn’t performing, and I will try in the future), the point estimates for those genes in each cell type add an incredible amount of noise that drive a wrong inference of w_cell_type in the multiple regression.
The genes I ideally want are those with low variance for all cell types and are specific for some while ~0 for others.
The question is: what is the best machine learning approach to do that, given that I can produce a lot of artificial mixes/tissues from single cell types from online resources?
I’m still not sure I’m understanding. The gene expressions in the tissue are the data. When you say you want to ignore them do you mean you want to not use that data?
If M is roughly 20,000, how many different cell types are there (N)?
This almost seems like you have a matrix (gene_i_cell_type_j) and you want to multiply it by the vector w_cell_type_i to get the vector Gene_i_tissue. Is that true?