Speed up a multiple linear regression (log scale)


I have a model that is working well, however it takes really long. Given the amount of parameters and the simplicity of the model I would expect a much faster execution

I have 25 dimensions, 500 points, 100 samples

the log data has mean 6 sd 2, so the original data has big range. Would standardisation make sense in this case?

Thus the formula is like:

S = 100
P = 500
y[p, s] ~ normal(a[s,1] * b[p,1] + … + a[s,25] * b[p,25]);

  int G;                                                           // Number of marker genes
  int P;                                                           // Number of cell types
  int S;                                                           // Number of mix samples 
  int R;                                                           // Number of covariates (e.g., treatments)
  int K;
  int group[S];                                                    // Array of covariates

  vector[G] y[S];                                                  // Mix samples matrix
vector[P] expr[G];    

  simplex[P] pi[S];                                                // Matrix of cell type proportions
  real<lower=0> sigma;                           // error
  vector<lower=0>[P] alpha[R];    // Prior to pi
transformed parameters{
	vector[P] log_pi[S];
	vector[G] expr_conv[S];    

	for(s in 1:S) log_pi[s] = log(pi[s]);
	for(s in 1:S) for(g in 1:G) {
		expr_conv[s,g] = log_sum_exp(expr[g] + log_pi[s]);
  sigma ~ normal(0,0.05);                                           // Prior to sigma
  for(s in 1:S) alpha[group[s]] ~ gamma(1.05,0.05);            
  for(s in 1:S) pi[s] ~ dirichlet(alpha[group[s]]);                //proportions
  for(s in 1:S) y[s] ~ student_t(2, expr_conv[s], sigma);    // Calculating probability of mix

Some suggestions? (Some simples regression tools in R take seconds, my model several hours, I would like to stay competitive) Thanks!

P.S. I am logging y outside stan, should I use log_normal inside stan?

Does it mix poorly (low n_eff / iterations), require too many log density evals per
Use better priors—that gamma is going to cause you pain because of the long tails. Also, Student-t with 2 degrees of freedom is very overdispersed. You might want to try something with more d.o.f. here.

that’s not a simple regression model with the log-sum-exps. I couldn’t tell what it’s doing.

Also, vectorize for efficiency, as in

alpha[group] ~ normal(0, 2);

And it looks like you’re providing effects per group for pi, which is probably causing a lot of pain, too.

Thanks @Bob_Carpenter ,

I will apply the edits. I have some questions/notes:

  1. I quite liked gamma distribution for >0 parameters because
    – stability around 0
    – can encompass normal shape as well
    – I though acting on the variance I could have controlled for long tail

It is sad that is not recommended (actually is true that leads to non mixing often), as is also conjugate distribution of many others (e.g., NB center)

  1. I use multiple regression in log form because the data is actually gamma distributed and the difference between predicted and observed is heteroschedastic (does this make sense?)

That for real data can be as bad as this

  1. I could cut the execution time of 10x by replacing for(...) log_sum_exp with matrix multiplication of natural values and the log the whole matrix. I am just checking that they produce the same result

  2. [quote=“Bob_Carpenter, post:2, topic:919”]
    effects per group for pi, which is probably causing a lot of pain

Pain in what sense?

Stan doesn’t exploit conjugacy. And the gamma can be problematic both statistically and computationally (these often go together, as Andrew’s codified in his “folk theorem”). They push values away from zero and toward the tails. Andrew’s written about this specific case extensively and even the BUGS manuals no longer recommend diffuse gamma priors (though we’re not fond of the interval priors they recommend, which introduce bias problems if the intervals are too wide or if the data is consistent with values near the boundaries).

  1. If there are too many degrees of freedom in some subset of parameters, the posterior becomes very broad and identifiability becomes very weak. If you wind up having a parameter per observation, unless they get controlled hierarchically, they can cause a lot of posterior uncertainty. At the extreme, they can introduce true non-identifiability and thus impropriety in the posterior.
1 Like