Help optimizing performance for genetic model

Hello all,

I am very interested in using this model for some enquiries:

The model is very simple, we have f, vector or uniform distributed frequencies between 0 and 0.5, fst a scalar parameter relating populations and beta, the genetic effects for each f.

The authors state that for 10000 snps their model needs one day to run, in my machine its less than an hour, still I think their original code can be improved.

In my case I plan to use it for hundreds of thousands of betas, thus any performance gain would be welcomed. I would be grateful if you have any thoughts on how to improve performance.

This is the original authors code,

data {
  int<lower=0> J; // number of snps
  real beta[J]; // effect sizes
  real<lower=0> f[J]; // minor allele frequences
  real<lower=0> Fst; // Known value of Fst
}
parameters {
  real S; 
  real<lower=0> sigma_b;
  real<lower=0,upper=0.5> p[J]; // true minor allele frequences
}
transformed parameters {
}
model {
    S ~ uniform(-2,2);
    sigma_b ~ uniform(0,2);
    for(j in 1:J){
        f[j] ~ beta( ((1-Fst)/Fst)*p[j], ((1-Fst)/Fst)*(1-p[j]) );
        beta[j] ~ normal(0,pow(p[j]*(1-p[j]),S/2)*sigma_b);
    }
}

which I have vectorized as follows

data {
  int<lower=0> J; // number of snps
   vector[J] beta; // effect sizes
   vector<lower=0,upper=1>[J] f; // minor allele frequences
   real<lower=0> Fst; // Known value of Fst
}
parameters {
  real<lower=-2,upper=2> S; 
  real<lower=0, upper=2> sigma_b;
  vector<lower=0,upper=0.5>[J] p; // true minor allele frequences
}
transformed parameters {
  vector[J] snpsd;
  for(j in 1:J)
      snpsd[j] = pow(p[j]*(1-p[j]),S/2)*sigma_b;
}
model {
  // vectorised likelihood
  f ~ beta( ((1-Fst)/Fst)*p, ((1-Fst)/Fst)*(1-p ));
  beta ~ normal(0,snpsd);
    
}

Which really does not make much difference performance-wise from what I tested.

I’m short on time so only able to offer partial advice. If you can write

in log-space, you might be able to use one of the dot_product tricks lying around. You’d have to transform back in the model block but it might just be worth it. For loops are reasonably fast in Stan, so you don’t always gain much by getting rid of them outside the model block.

The other observation, made for numerical stability rather than performance, is that you might want to do computation in log space whenever possible. So

can be improved. It’s not obvious to me how this can be dramatically improved. @bbbales2?

Parameter constraints should match the support of the distributions you’re using, so use:

real<lower=-2.0, upper = 2.0> S; 
real<lower=0, upper 2.0> sigma_b;

for

S ~ uniform(-2,2);
sigma_b ~ uniform(0,2);

to avoid trouble. The current recommendation is to avoid avoid uniform priors and hard constraints unless you have to have them too. If your model runs this probably isn’t such a big deal though.

Maybe do reduce_sum with this model: https://mc-stan.org/users/documentation/case-studies/reduce_sum_tutorial.html ?

1 Like

Hi, thank you for replying.

By working in log space you mean something like:

l_snpsd[j]=(0.5*S)*(log(p[j])+log(1-p[j])) +log(sigma_b)
.
.
.
beta ~ normal(0,exp(l_snpsd));

or is there something else about it?

Thank you for replying, indeed I got rid of the uniforms in the vectorised form of the code I posted. It really made no difference in my case.

Thank you for the pointer on the reduce_sum, maybe this and working in log space as @maxbiostat suggested can do the trick for both speed and numerical stability.

That’s pretty much it. One thing you can do is to use log1m(p[j). Also I recommend you write

As something like


transformed data{
real log_Fst = log(Fst);
}
transformed parameters {
...
real log_Fp = log1m_exp(log_Fst)-log_Fst + log(p);
real log_F1mp = log1m_exp(log_Fst) -log_Fst + log1m(p);
}

model {
  // vectorised likelihood
  f ~ beta(exp(log_Fp), exp(log_F1mp));
...    
}

2 Likes

Here is my model until now

functions {
  real partial_sum(vector slice_beta,
                   int start, int end,
                   vector l_snpsd,
                   vector f,
                   vector l_fp,
                   vector l_mfp) {
    return normal_lpdf(slice_beta |
                               0,
                               exp(l_snpsd[start:end])) + 
                               beta_lpdf(f[start:end]|exp(l_fp[start:end]),exp(l_mfp[start:end]));
  }
}
data {
  int<lower=0> J; // number of snps
   vector[J] beta; // effect sizes
   vector<lower=0,upper=1>[J] f; // minor allele frequences
   real<lower=0> Fst; // Known value of Fst
}
transformed data{
  real log_Fst = log(Fst);
  real lm1_Fst = log1m_exp(log(Fst));
}
parameters {
  real<lower=-2,upper=2> S; 
  real<lower=0, upper=2> sigma_b;
  vector<lower=0,upper=0.5>[J] p; // true minor allele frequences
}
transformed parameters {
  vector[J] l_snpsd;
  vector[J] log_Fp;
  vector[J] log_F1mp;
  vector[J] l_p;
  vector[J] l_mp;
  l_p=log(p);
  l_mp=log1m(p);
  log_F1mp = lm1_Fst -log_Fst + l_mp;
  log_Fp = lm1_Fst -log_Fst + l_p;
  
  l_snpsd=(0.5*S)*(l_p+l_mp) +log(sigma_b);
  
}
model {
  // vectorised likelihood
  // f ~ beta( exp(log_Fp), exp(log_F1mp));
  // beta ~ normal(0,exp(l_snpsd));
    target += reduce_sum(partial_sum, beta, 1,
                       l_snpsd, f, log_FP, log_F1mp);
}

When I try to compile it using rstan, partial_sum is not recognised and throws an error:

    SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Variable "partial_sum" does not exist.
 error in 'model4b902f1d0df5_drift' at line 47, column 36

Do I have to install cmdstan to make it work or did I make a silly mistake?

1 Like

Hi Daniel,

Yep, the reduce_sum approach requires cmdstan or one of its interfaces (cmdstanR or cmdstanPy), since they’re at a more recent Stan version

1 Like

Hi everyone, thank you very much for your help, I run in the following problem:

     Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: normal_lpdf: Scale parameter[1] is 0, but must be > 0! (in '/var/folders/5k/602cw74516l4k3tt9j9mx6r00000gn/T/RtmpIajo9P/model-8ffa71e9f5d1.stan', line 48, column 3 to column 33)

I only have this problem when trying to run with cmdstan, even without using reduce_sum but not while running with rstan, I annex my source code, both the Rscript used to simulate data and run the models, and the two stan models, I am only having this problem with drift.stan. To run either

Rscript sim_bayesinf.R

or as follows to use cmdstan

Rscript sim_bayesinf.R cmdstan

Any help would be greatly appreciated.
drift.stan (1.3 KB) sim_bayesinf.R (3.0 KB) simple.stan (432 Bytes)

1 Like

Does the sampling stop after you get that error or does it continue? I thought that is more like a warning – the sampler should keep going along afterwards.

Sorry, my previous response turned out to be innacurate, the sampler does in fact continue, it just did it without showing the gradient evaluation estimation, I will look at the output and compare with rstan. Thank you very much

Quick update, I changed sigma from being a parameter to being data (using the estimate from another method) and that solved the warnings at the beginning.

I am only interested in the S parameter, I have managed to run the model on my data (8e6 data points). Using reduce_sum I can run the MLE in under 12 seconds and NUTS in around 10 hours for 4000 samples.

Now comes the problem, my output files are homoungous and I am having problems summarising them (I have not been able to do so in fact). So here comes the question, is there a way to instruct cmdstan to only save parameter “S” without saving all the transformed parameters(which are the same dimension as the data)?

Can you move these calculations to the model block? If you define the variables in the model block then they won’t be saved in the output.

Transformed parameters is specifically for stuff you want to save in the outputs, and it sounds like you don’t want to save these things.

2 Likes