Improving efficiency when modeling autocorrelation

I’m trying to model the residuals with an AR(1) structure, but I’m having the runtime issue. Without accounting for the autocorrelation structure, the computation time is about 15 minutes, but once I add the AR(1) structure, it seems to be taking forever. Is there a way to increase the efficiency for the following code?

functions { 
  matrix cov_matrix_ar1(real ar, real sigma, int nrows) { 
  matrix[nrows, nrows] mat; 
  vector[nrows - 1] gamma; 
  mat = diag_matrix(rep_vector(1, nrows)); 
  for (i in 2:nrows) { 
    gamma[i - 1] = pow(ar, i - 1); 
    for (j in 1:(i - 1)) { 
      mat[i, j] = gamma[i - j]; 
      mat[j, i] = gamma[i - j]; 
    } 
  } 
  return sigma^2 / (1 - ar^2) * mat; 
  }
} 
data { 
  int<lower=1> N;  
  vector[N] y;  
  int<lower=1> nX;
  matrix[N,nX] X;  
  int<lower=1> nZ;
  matrix[N,nZ] Z; 
} 
transformed data {
vector[N] se2 = rep_vector(0, N); 
} 
parameters { 
  vector[nX] beta;
  vector[nZ] u;
  real<lower=0> sigma;  
  real<lower=0> tau;    
  real <lower=-1,upper=1> phi;   
} 
model {
  matrix[N, N] res_cov_matrix;
  matrix[N, N] Sigma; 
  vector[N] mu = X*beta + Z*u;
  res_cov_matrix = cov_matrix_ar1(phi, sigma, N);
  Sigma = res_cov_matrix + diag_matrix(se2);
  Sigma = cholesky_decompose(Sigma); 
  sigma ~ cauchy(0,5);
  tau ~ cauchy(0,5);
  u ~ normal(0,tau);
  y ~ multi_normal_cholesky(mu,Sigma);
} 

Hi! You might want to check out this implementation by @rtrangucci. This will likely be faster, since you don’t need to build the whole covariance matrix. This is another example (for a non-gaussian likelihood) which uses the trick laid out in the StanCon Helsinki Beginners Tutorial (near the end of the notebook).

3 Likes

I really appreciate the suggestion, @Max_Mantei!

The dynamic panel model you referred to is

y_{i,t}=α+u_i+δy_{i,t−1}+X_{i,t}β+ϵ_{i,t}\\ u_i\sim N(0,σ_u)

while my model is

Y_{N\times 1} \sim N(\mu_{N\times 1}, \Sigma_{N\times N}) \\ \mu_{N\times 1} = X_{N\times nX}\beta_{nX\times 1}+Z_{N\times nZ}u_{nZ\times 1} \\ u_{nZ\times 1}\sim N(0_{nZ\times nZ}, \tau I_{nZ\times nZ})

where \Sigma_{N\times N} is supposed to capture the serial correlation structure of AR(1) in the residuals.

Since this is actually my first piece of Stan code, I’m still struggling as to how to make the specific connection between the two models and borrow the idea of @rtrangucci you mentioned. Possible to help me out with a more specific chunk of code? Thanks a lot!

Hey! I think I’ll be able to put something together (code/example) next week (I’m on vacation and don’t really have time to post at length right now). :)

@Max_Mantei Enjoy your vacation! Looking forward to your code.

Hey!

Some more general notes: I briefly skimmed your code and I think ... + diag_matrix(se2) won’t do anything (except being costly) since se2 is all zeros, so diag_matrix(se2) will be all zeroes. Also, you build the covariance matrix and then cholesky decompose it – it’s likely way more efficient to directly build the cholesky decomposed version in the first place.

The “conventional” way of coding an AR(1) process is laid out in the Stan users guide and for your model, this would be something like this:

data { 
  int<lower=1> N;  
  vector[N] y;  
  int<lower=1> nX;
  matrix[N,nX] X;  
  int<lower=1> nZ;
  matrix[N,nZ] Z; 
}
parameters { 
  vector[nX] beta;
  vector[nZ] u;
  real<lower=0> sigma;  
  real <lower=-1,upper=1> phi;   
} 
model {
  vector[N] mu = X*beta + Z*u;

  sigma ~ cauchy(0,5);
  tau ~ cauchy(0,5);

  ... // priors for beta?
  u ~ normal(0,tau); // maybe use non-centered parameterization?

  y[2:N] ~ normal(mu[2:N] + phi*y[1:(N-1)], sigma);

} 

I’m not an expert on time-series stuff, but I guess the basic idea is to not model the joint distribution, but a bunch of conditionals (conditional on one prior time-step in the case of AR(1)), so y_t | y_(t-1) ~ normal(...). You can also derive the marginal distribution of y[1] and add it to the target density.

That’s the basic idea to get the speed-up, because there is no need for the large N by N covariance matrix.

1 Like

As @betanalpha once wrote: “Cauchys are almost always a bad idea, especially in priors” (Large model with skewed distributions), so perhaps thinking about something more informative could help improve efficiency.

2 Likes

@Max_Mantei I really appreciate that you spent time coming up with a solution.

The original model I had in mind is

Y_{N\times 1} \sim N(\mu_{N\times 1}, \Sigma_{N\times N}) \\ \mu_{N\times 1}=X_{N\times nX}\beta_{nX\times 1}+Z_{N\times nZ}u_{nZ\times 1}\\ u_{nZ\times 1}\sim N(0_{nZ\times nZ}, \tau I_{nZ\times nZ})

If I understand it correctly, the underlying model for your code is

y_t\ |\ y_{t-1} \sim N(\mu_t, \sigma) \\ \mu_t = \boldsymbol{x}_t\beta_{nX\times 1}+\boldsymbol{z}_tu_{nZ\times 1}\\ u_t\sim N(0, \tau)

In other words, the autocorrelation structure \Sigma in my original model is specified in the residuals, while in your case the autocorrelation structure is explicitly modeled through the first-order lagged response time series as an explanatory variable.

I’m going to try out your code soon. In the meantime, I have two more questions:

  1. What exactly are the differences between the two models in terms of assumptions and results?
  2. How to derive the marginal distribution of y[1] and add it to the target density?

Hey! Thanks for your patience, I didn’t have much time to write a full comprehensive answer. I think I also confused some things in my answers, sorry for that! Also, again, I’m no time series expert, so I hope you bear with me. If you feel like we’re stuck, we can also try to ping some of the people who definitely know alot more about this than I do. :)

But lets, take a step back. your model is (let me drop the subscripts and take \mu as you have defined it above:

Y \sim MVN(\mu,\Sigma),

where \mu is a vector as specified in you post and \Sigma is variance-covariance matrix with an auto-regressive structure. Now – just to clarify – you indexed everything by N or n, do you have like one time series, or is it multiple units you observe over time?

Ok, I hope I get this right:

If the series of Y you observe is a single time series (coming from one unit), an auto-regressive AR(1) process for this would be

y_{n} \sim N(\mu_n + \rho y_{n-1}, \sigma),

so that is no longer a multivariate normal (MVN), but a univariate normal and the AR component enters into the mean vector – this is just the lagged dependent variable. That is basically, the model I described above.

It’s also possible to specify a moving average MA(1) process for the data, which is given by

\begin{align} y_{n} &\sim N(\mu_n + \rho (y_{n-1} - \mu_{n-1}), \sigma) \\ \leftrightarrow y_{n} &\sim N(\mu_n + \rho \epsilon_{n-1}, \sigma), \end{align}

where \epsilon_{n-1} is the lagged residual. This seems to be more like what you wanted to do right? And because,

\begin{align} \begin{bmatrix} y_{n-2} \\ y_{n-1} \\ y_{n} \end{bmatrix} \sim MVN \left( \begin{bmatrix} \mu_{n-2} \\ \mu_{n-1} \\ \mu_{n} \end{bmatrix}, \begin{bmatrix} \sigma^2 & \rho & \rho^2 \\ \rho & \sigma^2 & \rho \\ \rho^2 & \rho & \sigma^2 \end{bmatrix} \right) \end{align}

implies

\begin{align} y_{n-1} | y_{n-2} &\sim N \left(\mu_{n-1} + \rho(y_{n-2} - \mu_{n-2}), \sqrt{\sigma^2(1 - \rho^2)} \right) \\ y_{n} | y_{n-1} &\sim N \left(\mu_{n} + \rho(y_{n-1} - \mu_{n-1}), \sqrt{\sigma^2(1 - \rho^2)} \right) \\ \end{align}

and letting y_{n-1} - \mu_{n-1} = \epsilon_{n-1} as above and \omega = \sqrt{\sigma^2(1 - \rho^2)}, we can just write:

y_{n} \sim N(\mu_n + \rho \epsilon_{n-1}, \omega)

so it has the same structure as an MA(1) process. The huge benefit is, that you don’t have to use the multivariate normal distribution, but you can rather use the univariate normal. This should be much faster.
And if you take a look again at the other thread you can see, that a similar structure hase been discussed there, but for multiple units (multiple time series --> a panel), thats why I thought that this thread could have been helpful.

Regarding assumptions of the models… well, there is a ton of literature about AR(1) and MA(1) models, but I like to “look” at the model. For example if you come from the multivariate normal distribution it’s pretty clear that |\rho|<1 for example… I can not tell you if one of these models makes sense for your specific application, at least not without knowing a bit more about the context.

How to derive the marginal distribution of y[1] and add it to the target density?

—see edit below!

I guess this is usually not done. If you check the users guide for the AR(1) and MA(1) models, it is not done there. If you have a look again at the StanCon tutorial that I have linked above you should find the the gist of it – but generally, you are probably safe if you don’t do that (at least if you have more than just a handful of data points).

EDIT: I got around doing this over in another thread, in case you want to check that out, @Gang.

I hope that was not that confusing…

3 Likes