1-step-ahead forecasting model fitting

Hey @RJTK, your explanation of why I’m getting this error makes a lot of sense to me. I tried including a fixed set of initialising parameters, but ran into some syntax trouble… I combed through the rstan and Stan documentation, but how to actually set initial values remains pretty unclear to me… Any suggestions?

init_theta <- function(chain_id = 4){
  list(list(r=2.5,O=0.012,h=0.075,b=35,c=0.3,u=0.41),list(sigma=0))
}

Fit <- sampling(OSA_Log,
                data = Stan_Data,
                warmup = 250,
                iter = 500,
                chains = 2,
                cores = 2,
                thin = 1,
                control = list(max_treedepth = 15,
                               adapt_delta = 0.99),
                init = init_theta,
                seed = 123,
                check_data = TRUE,
                show_messages = TRUE,
                verbose = TRUE); beep(3)

Additionally, I suspect that the loop may not be reading as intended. I’ve gotten the nan error using similar models before, but the model usually finds its way, while here, the nan error occurs continuously until the model fails. This makes me think that the model may be trying to set initial parameters values for each time step, instead of doing so only once. Thoughts?

Thank you so much for your continued help! :)

No worries, we all have weird schedules :-)

The pairs plot is IMHO very informative. This was generated fitting your actual dataset or some sort of simulated data? I.e. the 0.075 value for h is the “true” simulated value? Or some sort of guess for your actual dataset? If it is simulated, what were the “true” values of other parameters? If it is not simulated, I would strongly recommend writing a simulator and fitting simulated data first.

Anyway lot of the parameters are not independent, only the pairs plots involving b (theta[3]) look mostly healthy. So here are some guesses based on the pairs plot:

  • r (theta[1]) and O (theta[2]) are strongly positively correlated, i.e. the model “can’t decide” whether the parasite grows slowly and is recognized slowly OR that is grows quickly and is recognized quickly (as both presumably result in somewhat similar overall increase). Would the diffrence in O between 0.015 and 0.024 be considered big from biological standpoint? I guess that the range of r is quite wide and the difference between 2.5 and 4.0 would be biologically quite significant.
  • There is a negative correlation between c (theta[4]) and both r and O (theta[1] and theta[2]). I.e. the model “can’t decide” whether the immune system proliferates quickly, but the parasite is only slowly identified or the immune system proliferates slowly and the parasite is quickly identified. This looks a bit like a multiplicative non-identifiability - i.e. only O * c is identified, but not each of them individually (this is made more plausible by the fact that O * c literally is in the formula). Another possibility is that the data only lets us say that O and c can’t be both large (the sharp boundary in the pairs plots) while not providing much information about their individual values. The same could be said for c and r, but since O and r are tightly correlated, it is hard to say which is the culprit. How would the pairs plot look like if you treated r as given? How would they look like if you treated O as given?
  • Similarly, there is a negative correlation between u (theta[5]) and both r and O (theta[1] and theta[2]). I.e. the model “can’t decide” whether the immune cells die quickly but are quick to identify (and hence kill) or that immune cells die slowly but are slow Here it looks a bit less like multiplicative non-identifiability and more like some sharp boundary on one side and diffuse spread elsewhere. It is also more likely this actually involves O rather than r, because

The picture could be made clearer if you showed the pairs plot of the logarithms (for parameters lower-bounded, Stan actually samples the logarithm of the value).

My hunch is that the potential sharp boundaries are the main issue - Stan can handle a decent amount of linear correlations quite OK, only with decreased sampling efficiency.

To further diagnose, you could also try using shinystan and inspect the trivariete plots for a bunch of three-way combinations (and also with the log density). Including log density as third dimension could let you distinguish between “multiplicative non-identifiability” (max density along a ridge somewhere in the middle of the “banana”, density falls roughly the same on both sides) and “sharp boundary” (steep density cliff at one side, slow decrease elsewhere).

I think I said it earlier, but I want to reiterate, this model might be a tough beast to tame, I can see the problems, but I don’t see any easy solutions.

A general strategy that might work, is to start with some quantities that are certainly identifiable from the data and should be roughly independent. For example, you may choose suitable parasite counts j,k and immune cell counts p,q such that both (j,p) and (k,q) and (j,q) are represented in the data. For simplicity, let us introduce:

\zeta = \frac{\theta j}{1 + \theta j h} \\ \eta = \frac{\theta k}{1 + \theta k h}

Then we can get a new set of parameters \tau_{1,2,3},\rho_{1,2} representing derivatives at those points (I’d use b directly as it looks not problematic):

\tau_1 = \frac{dP}{dt} (j,p) = jr - p \zeta\\ \tau_2 = \frac{dP}{dt} (k,q) = kr - q \eta\\ \tau_3 = \frac{dP}{dt} (j,q) = jr - p \eta\\ \rho_1 = \frac{dH}{dt} (j,p) = b + p(c \zeta - \mu) \\ \rho_2 = \frac{dH}{dt} (k,q) = b + q(c \eta - \mu) \\

Given b, this is a linear system of five equations for five unknowns (r, \zeta, \eta, c, \mu) which has an analytical solution (or could be solved with built-in solver in Stan).

I then used Wolfram alpha to solve for \theta,h from \zeta, \eta: solve for theta,h: zeta = ((theta) j) / (1 + (theta) j h), eta = ((theta) k) / (1 + (theta) k h) - Wolfram|Alpha

\theta = \frac{\zeta \eta (j - k)}{j k (\zeta - \eta)} \\ h = \frac{\eta j - \zeta k}{\zeta \eta (j - k)}

(provided all variables are > 0)

Note that you should be able to find reasonable priors for the new parameters as you should have some intuition on how fast the populations can change.

The tricky part is also finding suitable j,k,p,q… Obviously there are other ways - we could for example use c or \mu as a parameter directly and then only look at derivatives at (j,p) and (k, q) which could be chosen to represent the vallyes/peaks…

Oh, but there is a problem that currently the reparametrization could lead to negative values of some original parameters (because not all combinations of derivatives are sensible)… Will have to think about this, maybe just enforcing some orderings might be enough… OK, so maybe this wouldn’t work, but I’ve just spent a lot of time writing this, so I won’t delete it - maybe it is possible to still salvage the idea :-/.

EDIT: I think parametrizing in terms of \zeta, \eta instead of \theta, h for j,k roughly at the extremes of observed data could be useful anyway.

I’ll probably wait for you to report the additional experiments/diagnostics before digging much deeper.

Best of luck!

3 Likes

Woah, thanks so much for such a robust reply!

I’ll put some work into this, and see where it takes me!

Thank you! :)

1 Like