Model specification for hierarchical AR(1) and Ornstein-Uhlenbeck process

I am working my way up in understanding how to use Stan to model longitudinal data on multiple subjects with irregular measurement times. My context is that I want to have a general approach that works the same for linear models as for binary and ordinal logistic models, so for the latter I want to specify random effects on the logit scale and need to specify the per-observation contribution to the log-likelihood. My ordinal logistic model has a lot of nuances such as interval censoring and departures from the proportional odds assumption. So I need an approach that does not require latent variables but models directly on the logit scale random effects.

Starting with Autoregressive Models in Stan I see a very simple specification for a non-hierarchical (one subject time series) Gaussian situation. To build towards my ultimate goal, how would you re-write the concise code there to get the same result by specifying the per-observation log-likelihood? Then how would you generalize that to the hierarchical multi-subject case where there is a random normal intercept for each subject that kicks off the process at the first time?

Once I understand that regular equal time-spacing AR(1) approach I’d like to generalize to the unequal measurement time case using an Ornstein-Uhlenbeck lag-1 process as described here. That doesn’t look too hard once I gain an understanding of how to bring in the AR(1) random components.

@bgoodri developed this code but I had some sampling problems and could not get a wide range of autocorrelations using it. And I can’t see how it doesn’t have a large number of parameters for the white noise eps (n \times (T-1) parameters where T is the maximum number of time points per subject and n is the number of subjects).

There is a long related discussion here but it’s hard to dive into such a lengthy discussion.

Thanks any advance for any help.

3 Likes

Hi Frank! Long been a fan of yours, so it’s exciting to see that you’ve joined the Stan community :)

By concise code, do you mean the slicing version? Did you see the regular version above that?

Here’s how I’d do it (and one can maintain the more efficient slicing while doing so):

data {
  int<lower=0> num_obs ; //number of observations per subject
  int<lower=0> num_subjects ; //number of subjects
  matrix[num_obs,num_subjects] y ; //observation matrix
  int<lower=0,upper=1> centered_intercept ; //binary toggle for intercept centered/non-centered parameterization
  int<lower=0,upper=1> centered_ar ; //binary toggle for ar centered/non-centered parameterization
  int<lower=0,upper=1> centered_log_noise ; //binary toggle for log_noise centered/non-centered parameterization
}
parameters {

  //intercept parameters
  real intercept_mean ;
  real<lower=0> intercept_sd ;
  vector<
    offset = (centered_intercept ? 0 : intercept_mean)
    , multiplier = (centered_intercept ? 1 : intercept_sd)
  >[num_subjects] intercept ;

  //AR parameters
  real ar_mean ;
  real<lower=0> ar_sd ;
  vector<
    offset = (centered_ar ? 0 : ar_mean)
    , multiplier = (centered_ar ? 1 : ar_sd)
  >[num_subjects] ar ;

  //noise parameters
  real log_noise_mean ;
  real<lower=0> log_noise_sd ;
  vector<
    offset = (centered_log_noise ? 0 : log_noise_mean)
    , multiplier = (centered_log_noise ? 1 : log_noise_sd)
  >[num_subjects] log_noise ;

}
model {
  vector[num_subjects] noise = exp(log_noise) ;

  intercept_mean ~ normal(0,1) ;
  intercept_sd ~ weibull(2,1) ;
  intercept ~ normal( intercept_mean , intercept_sd ) ;

  ar_mean ~ normal(0,1) ;
  ar_sd ~ weibull(2,1) ;
  ar ~ normal( ar_mean , ar_sd ) ;

  log_noise_mean ~ normal(0,1) ;
  log_noise_sd ~ weibull(2,1) ;
  log_noise ~ normal( log_noise_mean , log_noise_sd ) ;

  for (this_subject in 1:num_subjects){
    y[2:num_obs,this_subject] ~ normal(
      intercept[this_subject]
      + ar[this_subject] * y[1:(num_obs - 1),this_subject]
      , noise[this_subject]
    );
  }
}

That implements hierarchy on each of the three parameters in the original model, with a log-link for the noise term. I’ve also added data toggles for each to switch between centered and non-centered parameterizations; generally non-centered tends to work best in the data-sparse regimes folks are usually working in, but I thought it might be useful to show how to include both so you can explore the consequences for your data. Note that I put priors that would be fairly weakly-informed in the context of data that have been scaled to mean=0/sd=1, so you should tweak those if you prefer stronger/weaker priors for your context.

The above works when there are equal numbers of observations across subjects. If you have unequal counts, it should be fairly straight-forward to use the indexing tricks from the SUG to accommodate such ragged data.

I’ll post this now while I look at your later Q regarding unequally spaced observations and OU processes, but my immediate thought in that context is that it should be possible to extend the above code to unevenly spaced data by imagining that there could be data at equal spacings, but it’s merely missing, in which case the missing data points can be treated as parameters as in the SUG. That feels pretty expensive compute-wise, so I’ll take a look at the OU code you linked to see if that looks any more efficient.

1 Like

Oh, I forgot to address this:

If I understand correctly (and it’s possible I don’t! I’m not expert in AR models), the way model is parameterized here and in the SUG, the intercept parameter serves the same purpose as conditioning on the first data point, no? Or maybe it makes sense to add:

  y[1,] ~ normal(intercept,noise) ;

to the model block. But in that case does one still include the intercept in the likelihood for later points? Or should it be removed, as:

  for (this_subject in 1:num_subjects){
    y[2:num_obs,this_subject] ~ normal(
      ar[this_subject] * y[1:(num_obs - 1),this_subject]
      , noise[this_subject]
    );
  }

?
@Charles_Driver Since I see you’ve explored this area with cstem, could you comment on my confusion on this point?

OK, had a chance to check out the OU stuff, at least sufficiently to determine it’s beyond my expertise! But take a look at @aaronjg’s StanCon talk on OU processes with student-t measurement model.

You usually need some kind of initialization intercept for the first observation, and some kind of steady state / continuous input intercept for later observations – depending on the exact specification / data they may be highly correlated.

Avoiding having anything on the latent level can be quite a burden unless you have perfect measurements. In order to do this properly, it means that you need to compute the covariance matrix between all observations of each subject. When propagating the lag 1 regression through the latents, or ‘true states’, the expectation for each observation is conditionally independent of all except the previous, given the parameters. When operating on the measurement layer, the lag 1 regression between noisy measurements does not capture all the dependence – predictions of measurement 10 based on measurement 9 can be further improved by including measurement 8, 7, 6, so on. In the latent case predictions of measurement 10 only depend on latent state 9. This is all manageable, but requires quite a different sort of approach. The first version of ctsem used such an approach, it’s typical of SEM type longitudinal models but for higher numbers of time points and a variable timing structure (thereby different covariance matrices for each subject) can require a lot of big matrix inversions.

2 Likes

Mike and Charles thanks so much for the thoughts. Mike what I was referring to was the question of whether, for the general case (for me the longitudinal proportional odds model) I need to avoid the slicing version so that I can avoid ~ normal( ) so that I can specify the white noise residuals as having a normal distribution, not the response variable y having a normal distribution. For the proportional odds model and other categorical models it seems necessary to have random effect distributions on the logit scale, which is still consistent with y being discrete while avoiding latent variables per se.

Mike I’ll keep studying your code. A lot for a person who is still new to Stan to digest. But I’m very glad to be part of this community.

I’m not clear on why noise is being anti-logged.

I’m also still not clear on how we’re avoiding having a very large effective number of parameters.

I learning something new about offsets in vector. Why have two versions of the vector, where the centered one specifies offset=0, multiplier=1?

Charles is this statement consistent with having the usual per-subject random effect be the “kickoff”?

I clearly had a lot to learn.

My goal is to have all the likelihood components for a “tall and thin” dataset with arbitrary number of rows and observation times per subject, with an O-E AR(1) generalization.

I’ve read @aaronjg’s paper that goes with that talk. If I understand it correctly, it is for the single time series (i.e., single subject) case and not for the hierarchical model I need for multiple subjects.

@Charles_Driver I’ve also read much of your 2018 paper. There is so much to digest there I could easily spend a year on it.

@harrelfe I have a paper about the OU process where I wrote code in Stan. You can find the code in github. Maybe the code is useful for use in some ways. If yes, and if something is not clear, please let me know.

1 Like

I don’t understand this sorry. So many different terminologies and approaches from related fields to handling this stuff. For a stationary ar1 process of 0.25ar, the steady state / continuous intercept will be .75 times the stationary level, and the initial intercept will be, of course, 1 times the stationary level.

Thanks for this reference - an important one I had missed. It’s too bad that such a promising model has such computational challenges. If you narrow the focus to the single longitudinal response variable case do you still get excessive execution times? And staying focused on the single response case, am I correct that it is an overkill to speak of this as a latent variable model? For that case the model seems to involve random effects and doesn’t require a latent variable interpretation, right? This would reduce to your answer at https://stats.stackexchange.com/questions/498872 .

In extending your answer to the multiple-subject hierarchical model case, the dimension L of the white noise is equal to the total number of rows in the tall and thin dataset, i.e., L, which is the number of subjects times the average number of measurements per subject. Since we assume that the white noise is independent for every row, we have effectively L parameters for white noise. Why assuming one variance for all L elements of white noise, and by having a prior for this variance that easily allows it to be small, the effective number of parameters in the model is much smaller than L but still may be large enough to cause instability / wide posterior distributions for the main parameters, right?

That makes sense. Since the ordinary random effect hierarchical model can be so directly implemented in Stan, I can understand these autoregressive models if they kick in after the first measurement for a given subject, and for the first measurement they use the ordinary random effect.

I do not remember by hart but this model takes a lot of time because (i) it has two levels, one for observed response and one for latent variables (here OU process is used); (ii) matrix exponential operator; and (iii) a large number of random intercepts.

Now, if you

then you do not have (i), and (ii), I think the calculation will be much faster.

You are right, now you only have random effects and OU process for residuals (if I am correct as in your post you mention residuals).

I developed another model for this (at the latent level). You can find here with Stan codes here.

You can remove the IRT model part, just keep the model at level 2.

For the last part of your reply,

I am not sure I understand your idea. If you could show the model you are trying to fit, I may compare it to mine and see where they share similarities.

1 Like

Just a heads up that I am belatedly realizing that the model I wrote (but in poor-form didn’t actually check for syntax errors) needed some tweaks to actually compile. I just edited my post so it works now.

1 Like

Charles this will take me a while to digest as I’ve avoid latent variables in my career, but this is immensely helpful.

2 Likes

In discrete time approaches you can at least partially get around the problem by including higher lags, and this is not uncommon, but does mean you may need more parameters for the same underlying ar1 system. (Of course there is the benefit that if not a simple exponential decay, the extra parameters help to capture it) You could also take such an approach in continuous time, but to me at least it feels pretty awkward to have the expectation for a certain time point change depending on the timing schedule of previous measurements.

2 Likes

Taking a fresh look at the problem and seeing the complexities involved in modeling an AR(1) structure on the linear predictor scale, and seeing the limitations this places on the magnitude of autocorrelation on the original ordinal Y scale, I am leaning more towards a simpler solution of using a first-order Markov model assuming conditional independence, where ordinal Y(t) at time t for a given subject is modeled as a proportional odds model conditional on Y(t-1) and the baseline covariates. I’ll attempt to do the marginalization (e.g., to get the distribution of Y(t_{max}) conditional only on baseline covariates) using the posterior predictive distribution. I’m about to start a new topic on that.

@harrelfe I realize that you are posting this on a Stan forum and also are trying to avoid latent variables, but this problem otherwise sounds like something in INLA’s wheelhouse. With INLA, functions are (presumptively) treated as Markov processes, proportional odds / survival options appear to be available, ARx / OU and various other Markov processes are supported, and you can lag observations within groups of interest, all in a traditional formula-style interface without the need to program Stan code. Plus, the INLA result may end up being more accurate due to the lack of simulation error.

I’m interested in seeing the Stan solution too, and note that you opened another topic to this effect, but if the ultimate goal in the meantime is to solve the problem and solve it well, I thought this alternative might be worth considering. Good luck!

2 Likes