Spatial error model with Stan and comparison with maximum likelihood estimation

Hi all I have a panel data set about county-level corn yield. The dependent variable y is log(cornyield) and independent variables X are climate variables including precipitation and growing degree days, and time trend.
In the test dataset, I have 102 counties and 3 years so n=102 and t=3.

The model would be y=Xβ+u. Assuming u is spatially autocorrelated, I have uWu+ϵ where λ is a scalar parameter indicating for spatial dependence and W is the row-standardized neighbor matrix, and assume ϵ is i.i.d. normal.

Considering that u=[(I_n-λW)]^(-1) ϵ (where I_n is the diagonal matrix) so we have ϵ=(I_n-λW)(y-Xβ)`. So my Stan code for the model is as below:

 data {
      int<lower = 1> n;
      int<lower = 1> t;
      int<lower = 1> p;

  matrix[n, n] W;
  matrix[n, n] In;
  
  
  matrix[n,p] X[t];
  vector[n] y[t];
}



parameters {
  vector[p] beta;
  
  real<lower=-1, upper=1> lamda;
  real<lower=0> sigma;
}



model {
  for (i in 1:t){
    vector[n] X2;
    X2 = (In-lamda*W)*(y[i]-X[i]*beta);
    X2~normal(0, sigma);
  }
}

spatialBayes2.stan (399 Bytes)

where n stands for the 102 regions, p stands for the 6 control variables, and t stands for the 3 years.
The data is as below:

data: 1st column the 102 regions with fips ID; 2nd column year, 3rd column is y and the rest are X.
testdata.csv (43.6 KB)

Row-standardized neighbor matrix:
w.csv (26.8 KB)
.

R code to read the data and Stan code:

testdata <- read.csv("testdata.csv", header = TRUE)

n=length(unique(testdata$fips))
t=length(unique(testdata$year))


X=testdata[,c(4:9)]
y=testdata[,3]


X2=array(unlist(X),c(102,t,6))  ## this is in (n,t,p)
X2=aperm(X2,c(2,1,3))            ## change into (t,n,p)
y2=array(unlist(y),c(t,102))

p=dim(X)[2]


W <- read.csv("w.csv",header = TRUE)
W <- W[,2:103]
In <- diag(1,dim(W)[1])



p=dim(X)[2]


FE_data <- list(n=n,
                t=t,
                p=p,
                
                W=W,
                In=In,
                
                X=X2,
                y=y2)


fit <- stan(file = "spatialBayes2.stan",
            data = FE_data,
            iter = 2000,
            chains = 4)

The estimation results, however, is significantly different from some existing R packages using maximum likelihood. The did not agree in terms of sign. And I am getting a close-to-1 lambda (the spatial parameter).

The maximum likelihood estimates gives the following

fit2 <- spml(logcornyield~dday0C+dday29C+prec+precsq+t+t2,
                   data = allPanel_DM, listw = W2,
             model="within",effect="individual",spatial.error="b",Hess=FALSE)

> summary(fit2)
Spatial panel fixed effects error model
 

Call:
spml(formula = logcornyield ~ dday0C + dday29C + prec + precsq + 
    t + t2, data = allPanel_DM, listw = W2, model = "within", 
    effect = "individual", spatial.error = "b", Hess = FALSE)

Residuals:
      Min.    1st Qu.     Median    3rd Qu.       Max. 
-0.2498821 -0.0332716  0.0013296  0.0366884  0.2737303 

Spatial error parameter:
    Estimate Std. Error t-value              Pr(>|t|)    
rho 0.646942   0.051952  12.453 < 0.00000000000000022 ***

Coefficients:
            Estimate   Std. Error t-value        Pr(>|t|)    
dday0C      0.907453     0.260309  3.4861       0.0004902 ***
dday29C   -13.585637     2.198045 -6.1808 0.0000000006378 ***
prec      -10.021127     5.937466 -1.6878       0.0914538 .  
precsq      0.096969     0.045563  2.1282       0.0333182 *  
t       -1756.566054   884.931103 -1.9850       0.0471473 *  
t2         66.003375    32.456722  2.0336       0.0419939 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

where the Stan gives

> print(fit)
Inference for Stan model: spatialBayes2.
8 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=8000.

           mean se_mean     sd     2.5%     25%     50%     75%  97.5% n_eff Rhat
beta[1]   -0.17    0.00   0.12    -0.41   -0.25   -0.17   -0.09   0.07  4395    1
beta[2]   -0.96    0.02   1.32    -3.55   -1.85   -0.96   -0.07   1.65  3066    1
beta[3]    2.46    0.05   3.10    -3.52    0.38    2.38    4.58   8.60  3662    1
beta[4]   -0.02    0.00   0.02    -0.07   -0.04   -0.02    0.00   0.03  3631    1
beta[5] -554.59    8.69 481.66 -1499.97 -876.73 -562.52 -228.79 376.36  3070    1
beta[6]   19.46    0.32  17.62   -14.62    7.50   19.80   31.28  54.09  3063    1
lamda      0.98    0.00   0.02     0.94    0.97    0.98    0.99   1.00  6734    1
sigma      0.05    0.00   0.00     0.05    0.05    0.05    0.05   0.05  4999    1
lp__     764.63    0.04   2.00   759.89  763.50  764.92  766.11 767.57  3013    1

Could anyone help me with it. The key parameter of interest if dday0C or beta[1], which should be positive as warm temperature benefit the crop growth. From this perspective, I think the ML estimate might be more sound. But I still want to see if how can I improve the Stan estimates. I am relatively new with Stan. Thank you in advance for the help!!

2 Likes

Hi, I have a suggestion to pressure test what you are doing. Can you re-run everything but simply use one year’s data?

Hi Thank you so much for your reply. I just did the one-year model with both maximum likelihood and Stan.

Results from likelihood:
Call:errorsarlm(formula = logcornyield ~ dday0C + dday29C + prec +
precsq, data = allPanel, listw = W2, tol.solve = 0.000000000000000000000000000001)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.218425 -0.031173  0.010333  0.045016  0.115917 

Type: error 
Coefficients: (asymptotic standard errors) 
            Estimate Std. Error z value     Pr(>|z|)
(Intercept)  5.03760    0.98272  5.1262 0.0000002957
dday0C      -0.22723    0.26378 -0.8615      0.38898
dday29C      0.92122    2.75541  0.3343      0.73813
prec        26.66457   12.62478  2.1121      0.03468
precsq      -0.22125    0.10162 -2.1772      0.02946

Lambda: 0.857, LR test value: 61.815, p-value: 0.0000000000000037748
Asymptotic standard error: 0.049614
    z-value: 17.273, p-value: < 0.000000000000000222
Wald statistic: 298.37, p-value: < 0.000000000000000222

Log likelihood: 122.7876 for error model
ML residual variance (sigma squared): 0.0041721, (sigma: 0.064592)
Number of observations: 102 
Number of parameters estimated: 7 
AIC: -231.58, (AIC for lm: -171.76)

Results from stan:

Inference for Stan model: spatialBayes2.
8 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=8000.

               mean     se_mean           sd          2.5%         25%      50%       75%        97.5% n_eff Rhat
beta[1] 25901605.47 55858296.35 612373185.30 -679554191.93 -1157583.10 -5578.85 238489.69 947213471.66   120 1.08
beta[2]       -0.11        0.01         0.36         -0.80       -0.35    -0.12      0.13         0.59  1351 1.01
beta[3]        2.12        0.09         3.01         -3.91        0.15     2.11      4.15         7.92  1196 1.01
beta[4]       27.78        0.38        14.31         -1.35       18.35    28.20     37.52        55.58  1406 1.00
beta[5]       -0.23        0.00         0.11         -0.45       -0.31    -0.23     -0.16         0.00  1536 1.00
lamda          1.00        0.00         0.00          1.00        1.00     1.00      1.00         1.00    71 1.09
sigma          0.06        0.00         0.00          0.06        0.06     0.06      0.07         0.07  2378 1.00
lp__         208.20        1.34         4.29        199.59      205.10   208.49    211.41       215.63    10 1.31

Samples were drawn using NUTS(diag_e) at Thu Dec 14 11:51:38 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

New data and R code to read it are as below

R code

testdata <- read.csv("testdata.csv", header = TRUE)[,-1];dim(testdata)

n=length(unique(testdata$fips))
t=length(unique(testdata$year))


X=testdata[,c(6:10)]
y=testdata[,5]

p=dim(X)[2]


X2=array(unlist(X),c(n,t,p))  ## this is in (n,t,p)
X2=aperm(X2,c(2,1,3))            ## change into (t,n,p)
y2=array(unlist(y),c(t,n))



W <- read.csv("w.csv",header = TRUE)
W <- W[,2:103]
In <- diag(1,dim(W)[1])






FE_data <- list(n=n,
                t=t,
                p=p,
                
                W=W,
                In=In,
                
                X=X2,
                y=y2)


fit <- stan(file = "spatialBayes2.stan",
            data = FE_data,
            iter = 2000,
            chains = 4)

test dataset and W matrix
w.csv (26.8 KB)
testdata.csv (14.4 KB)

Thank you so much for your help!

Two computational suggestions:

The bounds for lambda are (-1, 1). Right now yours are [-1, 1], which can lead to problems in the computation.

Similarly, I always set the lower bound of sigma to be .0001, not 0.


Let us know if these help.
If these suggestions still do not solve your problem, you may need to spell out the matrix inversion more explicitly.

Note that theoretically the transformation used by Stan is such that bounds are open. Technically floating point representation doesn’t allow open bound and under- and overflow may lead to observing boundary values.

This will avoid division by zero or log(0) errors, but may be in practice using lower bound 0.0001 can be worse because if you have too high density near zero, then the errors would give you hint about that, but non-zero lower bound can hide the problem. If you notice you need to change the lower bound from 0 to 0.0001 you should consider more carefully the priors you use. Of course there can be situations where using non-zero lower bound can be justified also by physical constraints.

My understanding is that for spatial error models, the theoretically-sound bounds for lambda are strictly within -1 and 1. Therefore, it is incorrect to even consider -1 or 1.

The bit that looks funny to me is:

u = lambda * W * u + epsilon

I’m not experienced with this stuff, but if there was no noise:

u = lambda * W * u

Doesn’t seem like it’d be very interesting? u = 0 if (I - lambda * W) is full rank? This feels like a solution to another problem that is hiding in here. I’m very likely just being a goof though :P. Please correct me.

Have you had a look at: http://mc-stan.org/documentation/case-studies/IAR_Stan.html ? Does this look related to your problem at all?

The linear algebra that Mark laid out in the initial post is correct for Spatial Error Models. But I am only familiar with these models using MLE. The CAR formulation in your link is more convenient for Bayesian analysis. SEM and CAR are certainly related. I hope this helps.

Yes I did check the post for Mitzi. For the CAR model it has an analytical solution of joint normal distribution to be implemented with STAN. I am not sure if similar analytical solution could be achieved also with SEM. Will try to spend some time on it. Thanks for your suggestion!

Thank you so much for your comments. Will try to implement in the model and see if it improves the estimation.

One follow-up question: my initial incentive to try the Bayesian approach is that, since I have a Panel dataset, I would like to relax the assumption that the spatial coefficient, lambda, is constant over time. This would certainty brings more complexity to the maximum likelihood function and that’s why i was thinking about doing it with hierarchical modeling. Have you see any previous examples related? Thanks.

No, I have not seen anything in the way you described. But I have not been really up to date with the most current literature.

Forget about the issue in STAN estimation. Even within MLE, I would have tried to fit separate yearly models to assess how much the lambda parameter changes over time, before building a big panel model. Spatial-temporal models are very complex and it is really hard to identify all the parameters, regardless of estimation methods.

Thank you for reply. I think I will take a step back and re-think about it. Will definitely try a separate yearly model first and see if I can find anything interesting first!

u = lambda * W * u

Maybe another thing about this… If this is an eigenvalue problem for u and 1/lambda. Could you solve for these things explicitly before you run your model?

I see what you’ve done with the algebra, but I’m trying to think of how the data gets generated from the parameters so I can understand what’s going on. I’m having trouble with the lambda.

It just feels like it should be possible to write this model like:

y ~ multi_normal(X * beta, some_spatially_correlated_noise)

If you consider -1+epsilon and 1-epsilon to be valid with any epsilon, then it is correct to set
real<lower=-1,upper=1>
What I tried to say that, this notation in Stan is excluding -1 and 1, and thus it’s not [-1, 1].
Unfortunately, computers can’t handle arbitrary small epsilon, so it’s not exactly (-1, 1) either.
You may also get underflows and overflows, in which case you might consider do you know that you need values so close to -1 or 1, and I was suggesting that it would be easier to use smooth priors to state that values extremely close to -1 or 1 are unlikely than to choose some fixed limit further away from -1 and 1…

It should be: y ~ multi_normal(X * beta, σ^2*[ (I - λW)’(I - λW) ]^(-1))

I hope Mark can check my work…

Ohhh, if you have the inverse of the covariance:

y ~ multi_normal_prec(X * beta, (I - λW)' * (I - λW)); // Not sure how to add in the iid normal noise on the diagonal here -- don't think it's so trivial any more

Or whatever should be fine. It’s apparently the precision parameterization? Maybe this is a little inefficient, but is it possible to just try the model using this? Or the regular multi-normal if it’s better that way?

I’m wondering if there’s a Jacobian missing if we do:

X2 = (In-lamda*W)*(y[i]-X[i]*beta);
X2 ~ normal(0, sigma);

Right? Like, we’re taking a non-linear transformation of parameters (there’s beta * lambda terms) and then attaching a distribution.

Thanks guys. @sonicking I think you are correct. Let me just make sure I get the math correct here:

so if ε~N(0,σ^2*I), then since y=x*beta + (I - λW)^(-1)*ε and for convience let take A=(I - λW)^(-1), then the conditional distribution of y on X, beta should be N(x * beta, A * σ^2 * I * t(A)), or N(x * beta, σ^2 * A * t(A)).

So maybe can I use the multi_normal directly instead of the multi_normal_prec. And I think @bbbales2 is correct that I might miss the Jacobian in the previous setup.

I suspect the precision formulation of the multi_normal is faster. Not sure, but since you supply the inverse of the covariance then the solve to evaluate the log density is just a matrix multiply. And supplying your inverse requires no actual matrix inverses, which are really nice to avoid.

I might miss the Jacobian in the previous setup.

I think I meant Jacobian as a placeholder for some sort of density-of-space transformation thingy. Just happens to be a Jacobian in some cases. There’s gotta be a better word for it :P. But try things out and see if the estimates change. These sorts of volume changing things are important for sampling but might not matter for MLE, but let’s see what the actual experiments say. You might also try running Stan’s optimizer on your model (?optimizing in rstan).