Speed up a multiple linear regression (log scale)

stemangiola · June 15, 2017, 6:10am

Hello,

I have a model that is working well, however it takes really long. Given the amount of parameters and the simplicity of the model I would expect a much faster execution

I have 25 dimensions, 500 points, 100 samples

the log data has mean 6 sd 2, so the original data has big range. Would standardisation make sense in this case?

Thus the formula is like:

S = 100
P = 500
y[p, s] ~ normal(a[s,1] * b[p,1] + … + a[s,25] * b[p,25]);

data{
  int G;                                                           // Number of marker genes
  int P;                                                           // Number of cell types
  int S;                                                           // Number of mix samples 
  int R;                                                           // Number of covariates (e.g., treatments)
  int K;
  int group[S];                                                    // Array of covariates

  vector[G] y[S];                                                  // Mix samples matrix
vector[P] expr[G];    

}
parameters{
  simplex[P] pi[S];                                                // Matrix of cell type proportions
  real<lower=0> sigma;                           // error
  vector<lower=0>[P] alpha[R];    // Prior to pi
}
transformed parameters{
	vector[P] log_pi[S];
	vector[G] expr_conv[S];    

	for(s in 1:S) log_pi[s] = log(pi[s]);
	for(s in 1:S) for(g in 1:G) {
		expr_conv[s,g] = log_sum_exp(expr[g] + log_pi[s]);
	}
}
model{
  sigma ~ normal(0,0.05);                                           // Prior to sigma
 
  for(s in 1:S) alpha[group[s]] ~ gamma(1.05,0.05);            
  for(s in 1:S) pi[s] ~ dirichlet(alpha[group[s]]);                //proportions
  for(s in 1:S) y[s] ~ student_t(2, expr_conv[s], sigma);    // Calculating probability of mix
}

Some suggestions? (Some simples regression tools in R take seconds, my model several hours, I would like to stay competitive) Thanks!

P.S. I am logging y outside stan, should I use log_normal inside stan?

Bob_Carpenter · June 16, 2017, 8:52pm

Does it mix poorly (low n_eff / iterations), require too many log density evals per
Use better priors—that gamma is going to cause you pain because of the long tails. Also, Student-t with 2 degrees of freedom is very overdispersed. You might want to try something with more d.o.f. here.

that’s not a simple regression model with the log-sum-exps. I couldn’t tell what it’s doing.

Also, vectorize for efficiency, as in

alpha[group] ~ normal(0, 2);

And it looks like you’re providing effects per group for pi, which is probably causing a lot of pain, too.

stemangiola · June 17, 2017, 3:42am

Thanks @Bob_Carpenter ,

I will apply the edits. I have some questions/notes:

I quite liked gamma distribution for >0 parameters because
– stability around 0
– can encompass normal shape as well
– I though acting on the variance I could have controlled for long tail

It is sad that is not recommended (actually is true that leads to non mixing often), as is also conjugate distribution of many others (e.g., NB center)

I use multiple regression in log form because the data is actually gamma distributed and the difference between predicted and observed is heteroschedastic (does this make sense?)

That for real data can be as bad as this

I could cut the execution time of 10x by replacing for(...) log_sum_exp with matrix multiplication of natural values and the log the whole matrix. I am just checking that they produce the same result
[quote=“Bob_Carpenter, post:2, topic:919”]
effects per group for pi, which is probably causing a lot of pain
[/quote]

Pain in what sense?

Bob_Carpenter · June 17, 2017, 7:59pm

Stan doesn’t exploit conjugacy. And the gamma can be problematic both statistically and computationally (these often go together, as Andrew’s codified in his “folk theorem”). They push values away from zero and toward the tails. Andrew’s written about this specific case extensively and even the BUGS manuals no longer recommend diffuse gamma priors (though we’re not fond of the interval priors they recommend, which introduce bias problems if the intervals are too wide or if the data is consistent with values near the boundaries).

If there are too many degrees of freedom in some subset of parameters, the posterior becomes very broad and identifiability becomes very weak. If you wind up having a parameter per observation, unless they get controlled hierarchically, they can cause a lot of posterior uncertainty. At the extreme, they can introduce true non-identifiability and thus impropriety in the posterior.

Topic		Replies	Views
Scaling up a hierarchical model Modeling bioinformatics	28	2936	June 6, 2019
Obvious way to speed up huge logisitic regression? Modeling	8	583	May 24, 2018
How to speed up GLMM fitting when drawing predictions for a hold out data set? Modeling mixed-model , cross-validation	2	422	February 8, 2021
Increasing performance of this code for logistic regression Modeling performance	2	511	March 1, 2020
How can I make my script more efficient? Depending on the size of the data matrix it can take a while to run Modeling techniques , specification , performance	3	387	April 21, 2023

Speed up a multiple linear regression (log scale)

Related Topics