Help optimizing performance for genetic model

daniel-trejobanos · August 28, 2020, 10:50am

Hello all,

I am very interested in using this model for some enquiries:

The model is very simple, we have f, vector or uniform distributed frequencies between 0 and 0.5, fst a scalar parameter relating populations and beta, the genetic effects for each f.

The authors state that for 10000 snps their model needs one day to run, in my machine its less than an hour, still I think their original code can be improved.

In my case I plan to use it for hundreds of thousands of betas, thus any performance gain would be welcomed. I would be grateful if you have any thoughts on how to improve performance.

This is the original authors code,

data {
  int<lower=0> J; // number of snps
  real beta[J]; // effect sizes
  real<lower=0> f[J]; // minor allele frequences
  real<lower=0> Fst; // Known value of Fst
}
parameters {
  real S; 
  real<lower=0> sigma_b;
  real<lower=0,upper=0.5> p[J]; // true minor allele frequences
}
transformed parameters {
}
model {
    S ~ uniform(-2,2);
    sigma_b ~ uniform(0,2);
    for(j in 1:J){
        f[j] ~ beta( ((1-Fst)/Fst)*p[j], ((1-Fst)/Fst)*(1-p[j]) );
        beta[j] ~ normal(0,pow(p[j]*(1-p[j]),S/2)*sigma_b);
    }
}

which I have vectorized as follows

data {
  int<lower=0> J; // number of snps
   vector[J] beta; // effect sizes
   vector<lower=0,upper=1>[J] f; // minor allele frequences
   real<lower=0> Fst; // Known value of Fst
}
parameters {
  real<lower=-2,upper=2> S; 
  real<lower=0, upper=2> sigma_b;
  vector<lower=0,upper=0.5>[J] p; // true minor allele frequences
}
transformed parameters {
  vector[J] snpsd;
  for(j in 1:J)
      snpsd[j] = pow(p[j]*(1-p[j]),S/2)*sigma_b;
}
model {
  // vectorised likelihood
  f ~ beta( ((1-Fst)/Fst)*p, ((1-Fst)/Fst)*(1-p ));
  beta ~ normal(0,snpsd);
    
}

Which really does not make much difference performance-wise from what I tested.

maxbiostat · September 1, 2020, 3:20am

I’m short on time so only able to offer partial advice. If you can write

in log-space, you might be able to use one of the dot_product tricks lying around. You’d have to transform back in the model block but it might just be worth it. For loops are reasonably fast in Stan, so you don’t always gain much by getting rid of them outside the model block.

The other observation, made for numerical stability rather than performance, is that you might want to do computation in log space whenever possible. So

can be improved. It’s not obvious to me how this can be dramatically improved. @bbbales2?

bbbales2 · September 1, 2020, 3:31am

Parameter constraints should match the support of the distributions you’re using, so use:

real<lower=-2.0, upper = 2.0> S; 
real<lower=0, upper 2.0> sigma_b;

for

S ~ uniform(-2,2);
sigma_b ~ uniform(0,2);

to avoid trouble. The current recommendation is to avoid avoid uniform priors and hard constraints unless you have to have them too. If your model runs this probably isn’t such a big deal though.

Maybe do reduce_sum with this model: https://mc-stan.org/users/documentation/case-studies/reduce_sum_tutorial.html ?

daniel-trejobanos · September 1, 2020, 7:32am

Hi, thank you for replying.

By working in log space you mean something like:

l_snpsd[j]=(0.5*S)*(log(p[j])+log(1-p[j])) +log(sigma_b)
.
.
.
beta ~ normal(0,exp(l_snpsd));

or is there something else about it?

daniel-trejobanos · September 1, 2020, 7:37am

Thank you for replying, indeed I got rid of the uniforms in the vectorised form of the code I posted. It really made no difference in my case.

Thank you for the pointer on the reduce_sum, maybe this and working in log space as @maxbiostat suggested can do the trick for both speed and numerical stability.

maxbiostat · September 1, 2020, 1:04pm

That’s pretty much it. One thing you can do is to use log1m(p[j). Also I recommend you write

As something like


transformed data{
real log_Fst = log(Fst);
}
transformed parameters {
...
real log_Fp = log1m_exp(log_Fst)-log_Fst + log(p);
real log_F1mp = log1m_exp(log_Fst) -log_Fst + log1m(p);
}

model {
  // vectorised likelihood
  f ~ beta(exp(log_Fp), exp(log_F1mp));
...    
}

daniel-trejobanos · September 2, 2020, 12:10pm

Here is my model until now

functions {
  real partial_sum(vector slice_beta,
                   int start, int end,
                   vector l_snpsd,
                   vector f,
                   vector l_fp,
                   vector l_mfp) {
    return normal_lpdf(slice_beta |
                               0,
                               exp(l_snpsd[start:end])) + 
                               beta_lpdf(f[start:end]|exp(l_fp[start:end]),exp(l_mfp[start:end]));
  }
}
data {
  int<lower=0> J; // number of snps
   vector[J] beta; // effect sizes
   vector<lower=0,upper=1>[J] f; // minor allele frequences
   real<lower=0> Fst; // Known value of Fst
}
transformed data{
  real log_Fst = log(Fst);
  real lm1_Fst = log1m_exp(log(Fst));
}
parameters {
  real<lower=-2,upper=2> S; 
  real<lower=0, upper=2> sigma_b;
  vector<lower=0,upper=0.5>[J] p; // true minor allele frequences
}
transformed parameters {
  vector[J] l_snpsd;
  vector[J] log_Fp;
  vector[J] log_F1mp;
  vector[J] l_p;
  vector[J] l_mp;
  l_p=log(p);
  l_mp=log1m(p);
  log_F1mp = lm1_Fst -log_Fst + l_mp;
  log_Fp = lm1_Fst -log_Fst + l_p;
  
  l_snpsd=(0.5*S)*(l_p+l_mp) +log(sigma_b);
  
}
model {
  // vectorised likelihood
  // f ~ beta( exp(log_Fp), exp(log_F1mp));
  // beta ~ normal(0,exp(l_snpsd));
    target += reduce_sum(partial_sum, beta, 1,
                       l_snpsd, f, log_FP, log_F1mp);
}

When I try to compile it using rstan, partial_sum is not recognised and throws an error:

    SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Variable "partial_sum" does not exist.
 error in 'model4b902f1d0df5_drift' at line 47, column 36

Do I have to install cmdstan to make it work or did I make a silly mistake?

andrjohns · September 3, 2020, 12:34am

Hi Daniel,

Yep, the reduce_sum approach requires cmdstan or one of its interfaces (cmdstanR or cmdstanPy), since they’re at a more recent Stan version

daniel-trejobanos · September 3, 2020, 8:05am

Hi everyone, thank you very much for your help, I run in the following problem:

     Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: normal_lpdf: Scale parameter[1] is 0, but must be > 0! (in '/var/folders/5k/602cw74516l4k3tt9j9mx6r00000gn/T/RtmpIajo9P/model-8ffa71e9f5d1.stan', line 48, column 3 to column 33)

I only have this problem when trying to run with cmdstan, even without using reduce_sum but not while running with rstan, I annex my source code, both the Rscript used to simulate data and run the models, and the two stan models, I am only having this problem with drift.stan. To run either

Rscript sim_bayesinf.R

or as follows to use cmdstan

Rscript sim_bayesinf.R cmdstan

Any help would be greatly appreciated.
drift.stan (1.3 KB) sim_bayesinf.R (3.0 KB) simple.stan (432 Bytes)

bbbales2 · September 8, 2020, 7:29pm

Does the sampling stop after you get that error or does it continue? I thought that is more like a warning – the sampler should keep going along afterwards.

daniel-trejobanos · September 10, 2020, 7:45am

Sorry, my previous response turned out to be innacurate, the sampler does in fact continue, it just did it without showing the gradient evaluation estimation, I will look at the output and compare with rstan. Thank you very much

daniel-trejobanos · September 17, 2020, 6:36am

Quick update, I changed sigma from being a parameter to being data (using the estimate from another method) and that solved the warnings at the beginning.

I am only interested in the S parameter, I have managed to run the model on my data (8e6 data points). Using reduce_sum I can run the MLE in under 12 seconds and NUTS in around 10 hours for 4000 samples.

Now comes the problem, my output files are homoungous and I am having problems summarising them (I have not been able to do so in fact). So here comes the question, is there a way to instruct cmdstan to only save parameter “S” without saving all the transformed parameters(which are the same dimension as the data)?

bbbales2 · September 17, 2020, 4:45pm

Can you move these calculations to the model block? If you define the variables in the model block then they won’t be saved in the output.

Transformed parameters is specifically for stuff you want to save in the outputs, and it sounds like you don’t want to save these things.

Topic		Replies	Views
Stan speed drastically decreases when adding one extra parameter Modeling performance	3	1226	November 27, 2017
Model vectorization Modeling techniques , performance	5	477	April 7, 2020
Vectorization and improving speed for a complex mixture type latent trait model for response times Modeling	11	722	January 16, 2025
Computational Efficiency: advanced techniques? Modeling	1	377	October 4, 2022
Suggestions for speeding up a mixture model Modeling performance	5	1038	May 18, 2018

Help optimizing performance for genetic model

Related topics