Spatial Durbin model

I’m trying to find efficient Stan code for a spatial Durbin model. This model is typically specified as

y = \rho W y + \beta_0 + X \beta + WX \theta + \varepsilon
or
y = (I - \rho W)^{-1} (\beta_0 + X \beta + WX \theta + \varepsilon)

where y is a vector of observed response values at n locations, X is a n \times p matrix of covariates, W is a n \times n spatial weights matrix, and the Greek letters are all parameters with the obvious form. See, e.g., eqn (7) in http://thescipub.com/PDF/jmssp.2013.169.174.pdf.

Does anyone have or know of Stan code to fit this model efficiently? (If not, I’ll post my inefficient attempt and a dataset, but I figured I’d see if there’s already an answer out there before I went to that trouble.)

1 Like

My attempt is attached. This is a couple of orders of magnitude faster than my previous attempts, but I still have some concerns.

In my previous attempts, I used the second specification, which introduces both a matrix inversion and autocorrelation in the error structure. These are the main reasons why my previous versions were too slow to be useful.

The current attempt essentially uses the first specification, defining y_hat, the non-stochastic part, in terms of y. So no matrix inversion necessary. However, I’m not 100% confident that I’m handling the error structure correctly. The \varepsilon are supposed to be i.i.d. But the Stan implementations of an AR§ time series model use autocorrelated errors, even though the standard specification of AR§ treats them as i.i.d.

The other trick was to use sparse matrices. I’m 95% I’ve coded this correctly, but if anyone can confirm that would be helpful.

Currently this takes ~5 seconds for a single chain of 100 transitions. That’s fast enough to be useable, but I’d appreciate any advice on optimizing further, since eventually I have to run this six times on much larger datasets.

mwe.stan (1.4 KB)
mwe.R (997 Bytes)

mwe.Rdata

1 Like

I did a few comparisons of my code to the frequentist estimates produced by spdep::larsarlm. The code is slow, but usable, and doesn’t appear to dramatically diverge from the frequentist methods. A version with priors and a comparison to spdep is available in this gist: https://gist.github.com/dhicks/0abc2e27f9171f74319735c257892c7c

I guess any time you have linear predictors, then you could throw in a QR reparameterization and see if that gets you anywhere (page 125 of the 2.17.0 manual).

I don’t think you can get away from the solve if that’s what the model says. It’s hard for me to check the sparse implementation off the top of my head but let me write a couple eqs. and you can check if we’re on the same page.

We have:

y = (I - \rho W)^{-1} (\beta_0 + X \beta + W X \theta + \epsilon)

Assuming \epsilon at one point was IID normal noise with standard deviation \epsilon, then this is the same as saying:

z \sim N(\beta_0 + X \beta + W X \theta, \epsilon^2 I)
y = (I - \rho W)^{-1} z

where N is a multivariate normal and I is the identity matrix.

We want a distribution on y so we can write our likelihood, so just multiplying the (I - \rho W)^{-1} transform through:

y \sim N((I - \rho W)^{-1}(\beta_0 + X \beta + W X \theta), \epsilon^2 (I - \rho W)^{-1} (I - \rho W)^{-T})

The inverse comes out in the covariance so that you can use multi_normal_prec in Stan:

y \sim N((I - \rho W)^{-1}(\beta_0 + X \beta + W X \theta), \epsilon^2 ((I - \rho W)^T (I - \rho W))^{-1} )

multi_normals formulated as precision matrices can be way easier to work with cause you have the inverse of the covariance (it’s just ((I - \rho W)^T (I - \rho W))). You still gotta worry about the determinant though, but maybe that’s easy. W is fixed, right? What is \rho?

Now the trouble is the term on the inside.

Just substituting in placeholder vars A and b, this looks like we’re saying the mean of the normal is:

(I - A)^{-1} b

or

\mu = (I - A)^{-1} b

which corresponds to:

\mu =A \mu + b

which we could solved with fixed point iteration depending on the properties of A. I suspect by doing the yhat = p * W * y + B0 + ... thing you’re basically doing one fixed point iteration. If the problem is easy enough, maybe that’s good enough.

Any of that make sense? I’ve probably screwed something up, or missed the point of something, so don’t take this stuff as totally true haha. I’ve just seen this thread a couple times and keep being too lazy to finish my response. Thanks for keeping us posted on your progress!

1 Like

The QR decomposition of X is new to me, thanks!

The rest of your math all makes sense to me. \mu = A\mu + b is basically the non-stochastic part of the first specification. I’m not sure what properties A needs to have in order for a single fixed point iteration to yield a good approximation for the solution, though. (My linear algebra is weaker than I would like.)

Same. Just try solving the eqs. like this outside of the model and see if it actually works is step 1. If it does then you can dig through eqs. and find an explanation if yah want.

Check that the largest absolute value of the eigenvalues of \rho W are less than 1.

Surprisingly, while the QR decomposition did speed up gradient estimation, it also caused a dramatic increase in divergent transitions. Like, n_eff values of 3 and R-hat values of 28 in 2 chains of 100 samples each.

Stan’s new sparse matrix implementation (which I found browsing through the 2.17 manual — I was using the 2.10 manual) was more useful. About a 2x speedup, with no apparent drop in n_eff.

It looks like one major problem is with \rho (the spatial autoregression coefficient, in \rho Wy) and \alpha (the intercept term). There’s a very strong correlation between the samples for these parameters (across all of the Stan implementations so far), but they’re modeled as independent. I tried replacing them with a 2-vector representation and a multivariate normal prior. But this led to initialization errors, I think because I couldn’t figure out how to set the prior on the covariance matrix. I got as far as something like this:

parameters {
    vector[2] ra; 
    vector[2] mu_ra; 
    cov_matrix[2] Sigma_ra;
    ...
}
transformed parameters {
    real<lower = -1, upper = 1> rho = ra[1];
    real alpha = ra[2];
    ...
}
model {
    ra ~ mult_normal(mu_ra, Sigma_ra);
    mu_ra ~ normal(0, 3);
}

Is the intercept correlated much with the other parameters? If that were the case I’d say center your covariates.

1 Like

No; the problem is just with \alpha and \rho. The covariate parameters (\beta and \gamma) all seem to be well-behaved.

Could centering the response help? At the very least it would remove \alpha. I’ll try that when I get a chance.

It might not be that surprising, actually. If the QR (re)parametrisation speeds up computations, it might be facilitating exploration of the parameter space which uncovers pathologies that were otherwise hidden. Anyway, I might be way off here.

I wonder if this has something to do with how W is constructed. It’s an adjacency matrix, right? So the elements are 1s if two points are connected? Is the diagonal nonzero?

It’s row-normalized, and in the case I’ve been testing it’s based on K=3 nearest neighbors because the locations generally aren’t adjacent. (They’re census-designated places, which are designed to correspond to cities, towns, and other places with relatively high population density in rural areas.)

So each row has exactly 3 nonzero values, each equal to 1/3.