Adapt a probability mass function for increased speed

Dear stan community. I am providing brms with a custom, power law pmf and I’d like to know whether I missed some obvious ways to speed computations up, as fitting it is very time consuming at the moment. Every kind of advice is appreciated a lot!

 real power_lpmf(int y, real mu) {
  vector[200] support;
  for(i in 1:200){
    support[i] = 1/(i^mu);
  }
  return log(   (1/(y^mu))/sum(support)  );
}

Here’s the power law distribution. N = 200 is fine for my purposes, but later on I’d potentially want to increase it to ~ 1000

f(y;\mu,N)={\frac {1/y^{\mu}}{\sum \limits _{i=1}^{N}(1/i^{\mu})}}.

You can optimise this by moving to the log scale and by using the vectorised functions.

Firstly, rather than looping over N (200), you can create a container of the sequence 1:N:

linspaced_vector(N, 1, N)

The construction of support then becomes:

support = 1 / pow(linspaced_vector(N, 1, N), mu)

Given that \log(1/i^\mu) = \log(1) - \mu * \log(i) = - \mu * \log(i), we represent this in stan as:

  vector[N] log_support = -mu * log(linspaced_vector(N, 1, N));

Then we do the same for the numerator, where 1/y^\mu = \log(1) - \mu * \log(y) = - \mu * \log(y).

Putting this together, the function is then defined as:

real power_lpmf(int N, int y, real mu) {
  vector[N] log_support = -mu * log(linspaced_vector(N, 1, N));
  return -lmultiply(mu, y) - log_sum_exp(log_support);
}

Note that I’ve also added a first argument N which lets you specify the number of iterations, rather than hardcoding

1 Like

Note that a further optimisation would be to compute log(linspaced_vector(N, 1, N)) in the transformed data block and then pass it as an argument to the function, rather than re-computing at every iteration.

1 Like

There is something with the linspaced_vector of which I can’t quite make sense of at the moment. In the documentation it says it uses reals from the data block (up to my best knowledge), so linspaced_vector (int n, data real lower, data real upper)

Trying to have it sampling in rstan with

"// generated with brms 2.16.1
functions {
real power_lpmf(int y, int N, real mu) {
  vector[N] log_support = -mu * log(linspaced_vector(N, 1, N));
  return -lmultiply(mu, y) - log_sum_exp(log_support);
}
  }
}
data {
  int<lower=1> N;  // total number of observations
  int Y[N];  // response variable
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
}
parameters {
  real Intercept;  // temporary intercept for centered predictors
}
transformed parameters {
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = Intercept + rep_vector(0.0, N);
    for (n in 1:N) {
      // apply the inverse link function
      mu[n] = exp(mu[n]);
    }
    for (n in 1:N) {
      target += power_lpmf(Y[n] | 200, mu[n]);
    }
  }
  // priors including constants
  target += normal_lpdf(Intercept | 1,.01);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept;
}"

has me running into errors, saying

No matches for:

linspaced_vector(int, int, int)

Function linspaced_vector not found.

I’m not sure which of these two I should believe. I tried to change input types of it in several ways, so far not succesfully.

Ah right, the linspaced_vector function isn’t currently available in the version of rstan on CRAN. Can you try installing the preview of the next version through:

remove.packages(c("StanHeaders", "rstan"))
install.packages("StanHeaders", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
install.packages("rstan", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))

And then trying again?

2 Likes

Thanks, I now managed to run a model with some random effects, 3000 observations and N = 100, however it didn’t converge. I fear that it will still be too slow for larger data anyway.
Taking this as well as a small hint that it might doesn’t fit the data that well after all, I try the logarithmic distribution below which fits better than a number of other discrete distributions and outperforms these models in matters of speed. But I might still could use some hints on how to improve it, it would save hours of computation time.

the logarithmic distribution:

f(y)={\frac {-1}{\ln(1-mu)}}\;{\frac {mu^{k}}{k}}

Where mu \in [0,1]. I stick to

stan_funs <- "
 real log_series_lpmf(int y, real mu) {
  return log((-1/log(1-mu))*((mu^y)/y));
}"

so far (I might move this to a new topic)