Minimiser in Stan?

I have a problem which involves the hypoexponential distribution, which is the distribution which results from a sum of n exponential random variables, each with different rate parameters.

It is a pretty horrific distribution to deal with (see my related Stackexchange query about this, here) because it’s extremely computationally intensive to evaluate and numerically unstable.

Because of these issues, I don’t want to use this distribution, even though this is the most correct thing to do. I can approximate it using a gamma distribution by analytically matching the theoretical mean and standard deviation but this approximation is really poor for certain parameter values for the hypoexponential (see here).

An alternative would be for me to use a more flexible distributional approximation e.g. approximate it with a mixture of gamma distributions or with a lower order hypoexponential (i.e. a hypoexponential distribution with n'<<n distinct exponential random variables).

The more flexible I want this approximation to be, the less able I am to be able to find analytical results giving me the ‘best’ parameters to approximate the hypoexponential distribution with. So, I would prefer a way to determine appropriate parameter values for this distributional approximation via, say, a form of gradient descent.

So what my Stan code would look like is something like (assuming n=3 for brevity):

functions {
  real hypoexponential_approximation_lpdf(real X, real lambda_1, real lambda_2, real lambda_3) {
     vector theta = minimiser(objective_function, lambda_1, lambda_2, lambda_3);
     real log_prob = approximate_lpdf(X | theta);
     return(log_prob);
  }
}

model {
  X ~ hypoexponential_approximation(lambda_1, lambda_2, lambda_3);
}

I don’t think such a minimiser exists in Stan but I wanted to double-check?

A second question, perhaps cheekily, is does anyone have any bright ideas for approximating this distribution in Stan?

Note 1: I realise that Stan has an algebraic solver, but I don’t think this will work apart from in special circumstances; imagine I want to minimise the sum of squared differences between e.g. moments of the hypoexponential distribution and those of the approximating distribution – this won’t generally be zero but I would like it to be as close to zero as possible.

Note 2: the above is very different to say using rstan’s optimisation routine – I am saying that I would like to use minimisation within a sampling routine.

2 Likes

Correct, such a function does not currently exist.

It was first proposed ~9 years ago, but has not had a lot of push to be implemented: Enable optimization within .stan programs · Issue #1319 · stan-dev/stan · GitHub

1 Like

Having his function in Stan could be very useful at least for some of my models. Do you know other probabilistic modeling software that implememt it?

1 Like

The sum won’t but it’s derivative will be zero at the minimum.

2 Likes

Hi @Ben_Lambert , this is more of an answer to coding the hypoexponential to be usable in Stan. I tested and it’s as stable as the 128 bit version in the stack exchange post.

There’s some additional optimizations that I think you can do. Such as reusing computations from the log_ratetoalpha. I was able to do it for the non log version (let me know if you want to see that) but it was getting too complicated for me in log space and we know the non-log is unstable for that calculation.

There’s probably some other clever optimizations to vectorize over the data too.

A few notes about parameter recovery. I tested putting priors on alpha both on the log and nonlog scale. The log scale works much better. Also, since this is a kind of mixture model ordering the vector will help a lot. I’m fact you can partially order and then use that as priors for the unordered alphas. Seems to work fairly well. I tested with up to 50 alphas (values between 0-100) and 5000 data points.

functions {
    real log_sum_exp_signed(vector log_val, array[] int sign) {
      int N = num_elements(log_val);
      vector[N] sorted;
      
      int neg = 0;
      int pos = N + 1;
    for (n in 1:N) {
      if (sign[n] < 0) {
        neg += 1;
        sorted[neg] = log_val[n];
      } else {
        pos -= 1;
        sorted[pos] = log_val[n];
      }
    }
    
    real lse_pos = pos == N + 1 ? 0 : log_sum_exp(sorted[neg + 1:N]);
    real lse_neg = neg == 0 ? 0 : log_sum_exp(sorted[1:neg]);
    
    return lse_pos + log1m_exp(lse_neg - lse_pos);
    }
  
  tuple(real, int) log_prod_diff(vector log_a, real log_b) {
    int N = num_elements(log_a);
    real out = sum(log_a);
    int sign = 1;
    
    for (i in 1:N) {
      if (log_a[i] > log_b) {
        // Case: a > b
        out += -log_diff_exp(log_a[i], log_b);
      } else {
        // Case: a <= b
        out += -log_diff_exp(log_b, log_a[i]);
        sign = -sign;
      }
    } 
    return (out, sign);
  }

 tuple(vector, array[] int) log_ratetoalpha(vector log_rate) {
    int n = rows(log_rate);
    vector[n] log_alpha;
    array[n] int signs;

    for (i in 1:n) {
     (log_alpha[i], signs[i]) = log_prod_diff(append_row(head(log_rate, i - 1), tail(log_rate, n - i)), log_rate[i]);
    }

    return (log_alpha, signs);
  }
  
  real hypoexp_lpdf(vector x, vector log_rate) {
    int K = rows(log_rate);
    int N = num_elements(x);
    tuple(vector[K], array[K] int) log_alpha = log_ratetoalpha(log_rate);
    vector[K] rate = exp(log_rate);
    vector[K] log_S = log_alpha.1 + log_rate;
    real out = 0;

    for (n in 1:N) {
      out += log_sum_exp_signed(log_S - x[n] * rate, log_alpha.2);
    }

    return out;
  }
}
data {
  int<lower=1> N;  // Number of observations
  vector[N] t;     // Observed times
  int<lower=1> K;  // Number of rates
}

parameters {
  vector[K] log_lambda;  // Rates
}
model {
  log_lambda ~ normal(1, 1);
  
  // Likelihood
  target += hypoexp_lpdf(t | log_lambda);
}
generated quantities {
  vector[K] lambda = exp(log_lambda);
}
4 Likes

@andrjohns I think we should expose the signed exp version and some other ones that work with negative numbers on the log scale. This is a pain to code up.

1 Like

Another option for approximation is to use the saddlepoint approximation - since the hypoexponential is the sum of exponential variables, and saddlepoint works very neatly with sum distributions. See Approximate Densities for Sums of Variables: Negative Binomials and Saddlepoint for an example of implementation in Stan. The implementation uses just the simplest approximation, but note that you can add more terms to the series, should you need more precision (e.g. https://arxiv.org/pdf/1712.01410 shows the extra terms).

Also, as mentioned by @nhuurre , you can use algebra_solver to find a point with zero gradient. I’ll add that this will uniquely and reliably be a minimum if the function is strictly convex (which may be the case for some classes of approximating distributions) - it is typically not so hard to check for convexity analytically.

3 Likes

I went ahead and made the log_ratetoalpha more efficient (only going through the upper triangle instead of the entire matrix).

functions {
 tuple(vector, array[] int) log_ratetoalpha(vector log_rate) {
    int n = rows(log_rate);
    vector[n] log_alpha = zeros_vector(n);
    array[n] int signs = ones_int_array(n);

    for (i in 1:n) {
      for (j in i + 1:n) {
        log_alpha[i] += log_rate[j];
        log_alpha[j] += log_rate[i];
        real log_diff_rate;
        if (log_rate[j] > log_rate[i]) {
          log_diff_rate = -log_diff_exp(log_rate[j], log_rate[i]);
          signs[j] *= -1;
        } else {
          log_diff_rate = -log_diff_exp(log_rate[i], log_rate[j]);
          signs[i] *= -1;
        }
        log_alpha[i] += log_diff_rate;
        log_alpha[j] += log_diff_rate;
      }
    }

    return (log_alpha, signs);
  }

  real hypoexp_lpdf(vector x, vector log_rate) {
    int K = rows(log_rate);
    int N = num_elements(x);
    tuple(vector[K], array[K] int) log_alpha = log_ratetoalpha(log_rate);
    vector[K] rate = exp(log_rate);
    vector[K] log_S = log_alpha.1 + log_rate;
    real out = 0;

    for (n in 1:N) {
      out += log_sum_exp_signed(log_S - x[n] * rate, log_alpha.2);
    }

    return out;
  }
}
data {
  int<lower=1> N;  // Number of observations
  vector[N] t;     // Observed times
  int<lower=1> K;  // Number of rates
}
parameters {
  vector[K] log_lambda;  // Rates
}
model {
  log_lambda ~ normal(1, 1);
  
  // Likelihood
  target += hypoexp_lpdf(t | log_lambda);
}
generated quantities {
  vector[K] lambda = exp(log_lambda);
}

Another thing I tested which seemed to be worse (not sure why) is only sorting once and performing the signed log_sum_exp on the index of the pos/neg values. The relevant changes there are below. I believe you could pack all that into the log_ratetoalpha loop.

  tuple(array[] int, int) get_index (array[] int sign) {
      int N = num_elements(sign);
      int neg = 0;
      int pos = N + 1;
      array[N] int index;
      
    for (n in 1:N) {
      if (sign[n] < 0) {
        neg += 1;
        index[neg] = n;
      } else {
        pos -= 1;
        index[pos] = n;
      }
    }
    return (index, neg); 
    }
  
  real hypoexp_lpdf(vector x, vector log_rate) {
    int K = rows(log_rate);
    int N = num_elements(x);
    tuple(vector[K], array[K] int) log_alpha = log_ratetoalpha(log_rate);
    vector[K] rate = exp(log_rate);
    vector[K] log_S = log_alpha.1 + log_rate;
    real out = 0;
    array[K] int index;
    int num_neg;

    (index, num_neg) = get_index(log_alpha.2);
    array[K - num_neg] int index_pos = index[num_neg + 1:K];
    array[num_neg] int index_neg = index[1:num_neg];
    
    for (n in 1:N) {
      vector[K] vals = log_S - x[n] * rate;
      real lse_pos = log_sum_exp(vals[index_pos]);
      real lse_neg = log_sum_exp(vals[index_neg]);
      out += lse_pos + log1m_exp(lse_neg - lse_pos);
    }

    return out;
  }
2 Likes