Hierarchical Lotka Volterra

I’ve done a very mild generalisation of @Bob_Carpenter’s Lotka Volterra model

in order to demonstrate the power of Stan to my colleagues. There are now several colonies of hares and lynxes with a common hyper-parameter.

This is taking a lot longer to run - the first chain has taken 460 seconds to run - in the original model this was about 30 seconds. I.e. it is about 10 times slower.

  • Is this to be expected?

  • Is there anything I can do to speed things up?

Our real models have e.g. 200 variables and 50 parameters. I am worried that estimation for such models will take even longer.

Are there any case studies that I might have missed? I know of the Friberg-Karlsson model for neutrophil production https://www.metrumrg.com/wp-content/uploads/2017/10/Gillespie_MixedSolver_ACOP2017.pdf but they seem to use a modified version of Stan called Torsten.

The data for the model below is the same as the original data i.e. there is only one colony. I’ve attached the data after the code.

functions {
  real[] dz_dt(real t,       // time
               real[] z,     // system state {prey, predator}
               real[] theta, // parameters
               real[] x_r,   // unused data
               int[] x_i) {
    real u = z[1];
    real v = z[2];

    real alpha = theta[1];
    real beta = theta[2];
    real gamma = theta[3];
    real delta = theta[4];

    real du_dt = (alpha - beta * v) * u;
    real dv_dt = (-gamma + delta * u) * v;
    return { du_dt, dv_dt };
  }
}
data {
  int<lower = 0> N;           // number of measurement times
  int<lower = 0> M;           // number of colonies
  real ts[N];                 // measurement times > 0
  real y_init[2];             // initial measured populations
  real<lower = 0> p[N, 2, M]; // measured populations
}
parameters {
  real<lower = 0> mu[4];      // mean for hares and lynxes
  real<lower = 0> phi[4, M];  // { alpha, beta, gamma, delta }
  real<lower = 0> z_init[2];  // initial population
  real<lower = 0> sigma[2];   // measurement errors
}
transformed parameters {
  real z[N, 2];
  real w[N, 2, M];
  real theta[4];
  for (j in 1:M) {
    for (l in 1:4)
      theta[l] = phi[l,j];
    z = integrate_ode_rk45(dz_dt, z_init, 0, ts, theta,
			   rep_array(0.0, 0), rep_array(0, 0),
			   1e-5, 1e-3, 5e2);
    w[ , , j] = z;
  }
}
model {
  phi[1,] ~ normal(mu[1], 0.5);
  phi[2,] ~ normal(mu[2], 0.05);
  phi[3,] ~ normal(mu[3], 0.5);
  phi[4,] ~ normal(mu[4], 0.05);
  sigma ~ lognormal(-1, 1);
  z_init ~ lognormal(log(10), 1);
  for (k in 1:2) {
    y_init[k] ~ lognormal(log(z_init[k]), sigma[k]);
    for (j in 1:M)
      p[ , k, j] ~ lognormal(log(w[, k, j]), sigma[k]);
  }
}
generated quantities {
  real y_init_rep[2];
  real p_rep[N, 2, M];
  for (k in 1:2) {
    y_init_rep[k] = lognormal_rng(log(z_init[k]), sigma[k]);
    for (n in 1:N) {
      for (j in 1:M)
      	p_rep[n, k, j] = lognormal_rng(log(w[n, k, j]), sigma[k]);
    }
  }
}

The data

# Data from http://www.math.tamu.edu/~phoward/m442/modbasics.pdf
# Downloaded 15 October 2017, 4:59 PM EDT
Year, Lynx, Hare
1900, 4.0, 30.0
1901, 6.1, 47.2
1902, 9.8, 70.2
1903, 35.2, 77.4
1904, 59.4, 36.3
1905, 41.7, 20.6
1906, 19.0, 18.1
1907, 13.0, 21.4
1908, 8.3, 22.0
1909, 9.1, 25.4
1910, 7.4, 27.1
1911, 8.0, 40.3
1912, 12.3, 57.0
1913, 19.5, 76.6
1914, 45.7, 52.3
1915, 51.1, 19.5
1916, 29.7, 11.2
1917, 15.8, 7.6
1918, 9.7, 14.6
1919, 10.1, 16.2
1920, 8.6, 24.7
1 Like
 Elapsed Time: 5533.26 seconds (Warm-up)
               5500.68 seconds (Sampling)
               11033.9 seconds (Total)

Warning messages:
1: There were 89 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: There were 3476 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 
3: There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low 
4: Examine the pairs() plot to diagnose sampling problems

This does not look good :(

> print(fitH, pars=c("mu", "phi", "sigma", "z_init"),
+       probs=c(0.1, 0.5, 0.9), digits = 3)
Inference for Stan model: lotka-volterra-hierarchical.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

            mean se_mean     sd   10%    50%     90% n_eff    Rhat
mu[1]      2.368   1.585  2.249 0.223  2.321   5.954     2  12.806
mu[2]      0.209   0.134  0.191 0.001  0.270   0.445     2  11.881
mu[3]     28.523  31.346 49.667 0.344  5.039 100.234     3   2.580
mu[4]      1.273   1.224  1.899 0.098  0.379   4.434     2   2.810
phi[1,1]   1.801   1.737  2.459 0.232  0.478   6.052     2  29.626
phi[2,1]   0.752   0.775  1.097 0.020  0.193   2.641     2 239.445
phi[3,1]  28.537  31.352 49.686 0.304  5.082 100.551     3   2.578
phi[4,1]   1.180   1.262  1.948 0.025  0.229   4.428     2   2.880
sigma[1]   0.310   0.243  0.348 0.000  0.204   0.838     2   6.344
sigma[2]   0.211   0.185  0.263 0.000  0.063   0.642     2  16.539
z_init[1] 21.090   8.580 12.271 0.620 29.985  30.002     2   6.476
z_init[2]  3.971   0.229  0.348 3.493  4.000   4.491     2   2.657

You don’t have any priors on your hierarchical parameters, mu.

Check pairs(fitH, pars = c("mu", "phi")). Are some of those plots diagonals?

It’s really less about the number of parameters and more about the shape of the posterior. A model with a nice, multivariate normal posterior will sample very quickly, whereas simpler one with a pair of non-identified parameters will drive the sampler mad.

Yes, they do. We are working with Metrum. @yizhang will know more.

I’d suspect funnels given the centered parameterizations:

phi[1,] ~ normal(mu[1], 0.5);

Are these prior scales reasonable for your data? The 0.05 is the one to check. But the way to parameterize this is as:

parameters {
  matrix<lower = 0>[4, M] phi_z;

transformed parameters {
  matrix<lower = 0>[4, M] phi;
  phi[1] = mu[1] + 0.5 * phi_z[4]; 
  ...
  phi[4] = mu[4] + 0.05 * phi_z[4];

model {
 to_vector(phi_z) ~ normal(0, 1);

You can also vectorize the sampling statements more:

y_init ~ lognormal(log(z_init), sigma);
for (k in 1:2)
  to_vector(p[ , k, ]) ~ lognormal(log(w[ , k, ]), sigma[k]);

And make sure again that the priors make sense. If they don’t, the posterior winds up being very challenging to sample from as @bbbales2 was suggesting.

And make sure you have the predator and prey in the right columns! I made that mistake and it was a couple hours before I’d debugged it.

Bother! (about forgetting the prior) - thanks very much for spotting that. Yes the plots are diagonal (with red spots).

pairs1

that looks like a pretty bad non-identifiability among the phi. I don’t know where the multimodality is coming from if that’s what the blue circles are.

Does it (mu) get a default prior in the case? How is it treated?

@bbbales2 With

  mu[{1, 3}] ~ normal(1.0, 0.5);
  mu[{2, 4}] ~ normal(0.05, 0.05);

as the prior everything goes a lot quicker

 Elapsed Time: 503.41 seconds (Warm-up)
               521.089 seconds (Sampling)
               1024.5 seconds (Total)

but the diagnostics do not look good

print(fitH, pars=c("mu", "phi", "sigma", "z_init"),
+       probs=c(0.1, 0.5, 0.9), digits = 3)
Inference for Stan model: lotka-volterra-hierarchical.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

            mean se_mean     sd   10%    50%    90% n_eff       Rhat
mu[1]      1.117   0.580  0.821 0.623  0.652  2.539     2    272.859
mu[2]      0.138   0.125  0.177 0.025  0.041  0.445     2    785.332
mu[3]      0.743   0.252  0.358 0.344  0.724  1.127     2     26.598
mu[4]      0.136   0.092  0.130 0.023  0.085  0.355     2    232.891
phi[1,1]   0.416   0.075  0.106 0.232  0.478  0.478     2  10749.576
phi[2,1]   0.679   0.801  1.133 0.025  0.025  2.641     2   8443.048
phi[3,1]   0.634   0.135  0.191 0.304  0.744  0.744     2   5566.840
phi[4,1]   0.053   0.034  0.049 0.025  0.025  0.138     2   4421.656
sigma[1]   0.102   0.125  0.177 0.000  0.000  0.409     2    767.788
sigma[2]   0.161   0.197  0.278 0.000  0.000  0.642     2   5710.104
z_init[1] 22.655   8.995 12.723 0.620 30.000 30.000     2 128303.703
z_init[2]  3.873   0.155  0.219 3.493  4.000  4.000     2    954.052

Samples were drawn using NUTS(diag_e) at Tue Jun  5 09:08:00 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

pairs2

I have yet to try @Bob_Carpenter’s suggestions - one step at a time :)

A quick comment: Even without looking at the plots, n_eff <= number of chains indicates multimodality.

2 Likes

It looks like you are fitting a model with M identical populations. If so then I think the symmetry between each of the M Phi parameters at least contributes to the multi-modality. You might try generating different data for each colony.

1 Like

@groceryheist - I was suspicious that might be a problem. How many colonies is enough? 4?

Sorry, it looks like you are having multimodality problems even when M=1. So my idea can’t be right.

I am generating the samples using Haskell, hopefully the code below is readable.

meanRate :: Rate
meanRate = Rate { theta1 = 0.5, theta2 = 0.025, theta3 = 0.8, theta4 = 0.025 }

sampledRate :: Rate -> IO Rate
sampledRate r = do
  t1 <- exp <$> (sample $ rvar $ Normal (log (theta1 r)) 0.1)
  t2 <- exp <$> (sample $ rvar $ Normal (log (theta2 r)) 0.015)
  t3 <- exp <$> (sample $ rvar $ Normal (log (theta3 r)) 0.1)
  t4 <- exp <$> (sample $ rvar $ Normal (log (theta4 r)) 0.015)
  return $ Rate { theta1 = t1, theta2 = t2, theta3 = t3, theta4 = t4 }

dzdt :: Rate -> Double -> [Double] -> [Double]
dzdt r _t [u, v] = [ (alpha - beta * v) * u
                   , (-gamma + delta * u) * v
                   ]
  where
    alpha = theta1 r
    beta = theta2 r
    delta = theta4 r
    gamma = theta3 r

When I use two colonies with a small number of iterations, things look a bit more reasonable although not great.

Inference for Stan model: lotka-volterra-hierarchical.
4 chains, each with iter=400; warmup=200; thin=1; 
post-warmup draws per chain=200, total post-warmup draws=800.

            mean se_mean    sd    10%    50%    90% n_eff  Rhat
mu[1]      0.943   0.072 0.298  0.541  1.011  1.252    17 1.113
mu[2]      0.057   0.001 0.025  0.024  0.060  0.087   472 1.033
mu[3]      0.925   0.032 0.271  0.659  0.861  1.277    74 1.048
mu[4]      0.048   0.006 0.023  0.014  0.053  0.070    13 1.104
phi[1,1]   0.665   0.098 0.213  0.455  0.645  0.944     5 1.345
phi[1,2]   0.692   0.113 0.241  0.460  0.656  0.992     5 1.338
phi[2,1]   0.044   0.013 0.023  0.025  0.036  0.076     3 1.584
phi[2,2]   0.050   0.015 0.028  0.025  0.042  0.088     3 1.522
phi[3,1]   0.819   0.203 0.339  0.467  0.814  1.335     3 1.842
phi[3,2]   0.999   0.188 0.391  0.565  0.856  1.570     4 1.472
phi[4,1]   0.027   0.008 0.013  0.014  0.025  0.047     3 1.834
phi[4,2]   0.031   0.007 0.015  0.015  0.024  0.053     4 1.512
sigma[1]   0.357   0.147 0.217  0.000  0.459  0.552     2 3.837
sigma[2]   0.511   0.211 0.318  0.000  0.655  0.802     2 3.213
z_init[1] 25.216   2.081 3.735 20.481 24.961 30.000     3 1.608
z_init[2]  6.987   1.272 2.215  4.000  7.239  9.764     3 1.714

NB I still haven’t tried @Bob_Carpenter’s suggestions yet (next on my list). I am just trying doubling the number of iterations while I type this.

BTW 3 of the chains complete within 30 seconds but one takes x10 as long - is this indicative of anything?

I’ve put the code and data in a repo: https://github.com/idontgetoutmuch/LotkaVolterra

One data set isn’t working (at least) cause of the non-identifiability between mu and phi. The mu hyperparameters are just extra degrees of freedom that are making it difficult for Stan to explore the posterior.

So like:

phi[1,] ~ normal(mu[1], 0.5);

You could rewrite it:

psi ~ normal(0.0, 0.5);
phi_transformed[1,] = mu[1] + psi;

where phi_transformed is a transformed parameter not a sampled parameter. psi is a sampled parameter.

Now just think if there’s a “correct” answer for phi_transformed[1,] (which would be informed by the rest of your log density), then you can solve for that by either mu[1] being large and psi being small, or mu[2] being small and psi large. That’s what I mean by the parameters being non-identified. Changing either thing can explain what needs to be explain. (It’s not correct to think about cutting your model into pieces and doing solves like this, but it can be helpful for identifying non-identifiabilities – that’s why I put quotes on that correct in the first sentence of this paragraph)

So if you only had one set of rabbit/lynx measurements, it’s just a questionable model because you’re solving for more many degrees of freedom than you have data for.

But! If you have a bunch of different sets of measurements, then you could think of each of them as requiring different phi_transformed[1,j] values. So if psi became a vector:

psi ~ normal(0.0, 0.5);
phi_transformed[1,1] = mu[1] + psi[1];
phi_transformed[1,2] = mu[1] + psi[2];
phi_transformed[1,3] = mu[1] + psi[3];
// etc ...

And now mu[1] is starting be constrained because it’ll (hopefully) handle the average effect, and the psi[j] variables will handle the deviations.

That make sense at all? tldr; try more different groups.

1 Like

Yeah, unfortunately :D. This is why we run multiple chains. If one does weird stuff, you should investigate why. Might not always find a satisfactory solution but you should look.

So non-identifiability is the answer. I was wondering how Stan did its magic here and the answer is I should not have believed in magic. With 8 colonies the chains for the hyper-parameters appear to converge and the actual values are not too far from the inferred values.

chains

posteriorsposteriors2

On the other the inferred values for the parameters do not look so good.

I have remaining issues but since this thread is already getting a bit messy I won’t raise them here. I may start separate threads.

Many thanks to everyone who has chipped in :-)

1 Like

Code here: https://github.com/idontgetoutmuch/LotkaVolterra if anyone wants to play with it.

1 Like

I think it’s been pretty widely studied that these systems become very chaotic with more than two populations and thus very hard to fit.

This is way beyond my understanding of diff eqs, though, so I’ll have to leave the answers to others.