Bym2

hi everyone. I hope you are well, I am a student of biostatistics. i have data including age group and gender group and the prevalence of disease as well as the shape file of my country.
but i can’t give the bym2 model for it. my question is, should I change the model written by mitzi moriss based on the data I have?

i mean, in the data part or other parts of the model, should I make changes in writing the model???

I have seen two examples that are on Git Hub. but I am very confused. I don’t know how to define my variable in the model…please guide me…thanks

README.Rmd (16.4 KB)

BYM2.stan (1.3 KB)
icar-functions.stan (6.5 KB)

@mitzimorris @cmcd please guide me…

Hi @shayeste! I think you will get helpful answers faster if you can provide a bit more detail about your problem. What do your data look like? Where do you get stuck?

Note also that brms provides the option to estimate a bym2 term on the intercept in regressions specified using R’s formula syntax. Perhaps this could provide a shortcut for you. See Spatial conditional autoregressive (CAR) structures — car • brms

2 Likes

hi @jsocolar .
my data include variables of age, gender, and province.
i managed to calculate the relative risk (RR) of the disease in each province by the bym model in open bugs according to the observed value and the expected value.
now according to the topic of my thesis, i have to estimate bym2 model as well.
but i don’t know how to define my variable in this model…it is really so difficult and confusing.

@mitzimorris 's implementation given on page 15 here http://www.stat.columbia.edu/~gelman/research/published/bym_article_SSTEproof.pdf already has built in the opportunity to pass covariates (i.e. the design matrix) and a sparse representation of the adjacency matrix (i.e. node1 and node2). Is there a way to be more specific about your question? Are you confused about how to construct the design matrix? How to construct the sparse representation of the adjacency matrix? What sort of likelihood term to use?

1 Like

@jsocolar I have read mitzi morris article several times.

I think you hit the target.
I implemented the model that mitzi morris gave in the article. but i failed to implement it.
i don’t know how to define the adjacency matrix. i have to write it according to the data that i have available .but how should define it??
i have had studies in this field. but I could not reach the goal. studies such as the links below…

https://mc-stan.org/users/documentation/case-studies/icar_stan.html

https://ourcodingclub.github.io/tutorials/stan-intro/

The adjacency matrix will in general be a square matrix with a row for each spatial unit (province in your case). It will contain ones if two regions are adjacent and zeros otherwise. In most applications, most of the regions don’t touch each other, so the matrix will be sparse (mostly zeros). The sparse representation that @mitzimorris uses is: for every pair of touching regions that exists in your data (that is, for every 1 in the lower triangle of your adjacency matrix), make an ordered pair giving the row index of the first region and the column index of the second region. Store these ordered pairs as two column vectors: one called node1 gives the indices of the first region, the other called node2 gives the indices of the second region. This is the sparse matrix encoding described at the bottom of page 8 of the paper.

The main pitfalls to be aware of here are:

  • I said “ordered pair” but actually the order doesn’t really matter. Make sure that each pair is represented only once (i.e. only index the 1s from the upper triangle or the lower triangle of the adjacency matrix, but not both!)
  • To get the indexing right, the rows and columns of the adjacency matrix must have regions in the same order as each other and also in the same order as the rows of the design matrix.
  • The code given in the paper works out-of-the-box only if you have just one observation per region. If you have multiple observations per region, then you’ll need to make a slight modification to the way that the bym2 component gets indexed back into the linear predictor.
2 Likes

@jsocolar thank you for your response and patience.
That’s right , on page 8 of the article , it is explained about the construction of the matrix.
I have 32 provinces, should my matrix be a 32*K matrix?

after I make the matrix, should I insert the nude1 and nude2 numbers into the model?
there is no need to put the matrix made in the model?
sorry, maybe my question seem a bit stupid. but although i have tried to have studies in this field , i have not understood much…

With 32 provinces, your adjacency matrix will be 32x32. Each cell will correspond to one pair of provinces, and will contain a 1 if the provinces are adjacent. An alternative way to encode the exact same matrix is to pass a list of all the cells that contain a 1. Since the adjacency matrix is symmetric, it is sufficient to pass a list of all cells in the lower triangle that contain a 1. node1 and node2 are this list, so when you pass node1 and node2 as data you are passing the adjacency matrix, just encoded differently.

2 Likes

This might help; you can create the list of nodes using geostan::prep_icar_data, then pass these nodes in to the Stan model as node1 and node2

Here’s an example (minus the Stan model):

library(geostan)
data(georgia)

C <- shape2mat(georgia, "B")
nodes <- prep_icar_data(C)
head(nodes$node1)
head(nodes$node2)

https://connordonegan.github.io/geostan/reference/prep_icar_data.html

3 Likes

thank you dear @cmcd and @jsocolar .
is it possible to introduce me to an example other than Scotland lip cancer data that explains the complete steps of implementing the bym2 model?
and I have another question …
can I use this helper function( mungeCARdata4stan)?
because I have implemented the BYM model in openbugs . now in stan, I want to run and this helper function in Scotland data example, data$adj, and data$num had turned it into a structure with nude1, nude2, N, N-edges field.
I mean, with the help of this helper function, I don’t need to write nude1 and nude2 anymore. this may be?

dear @cmcd and dear @jsocolar .
i am trying to run the model . at this stage , i have a problem , i dont know the meaning of this error . i searched but did not find the result .
library(“rstan”)
library(“rstudioapi”)
library(“parallel”)
library(“brms”)

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectgores())

library(pkgbuild) # load packAge
find_rtools() # should be TRUE, assuming you have Rtools 3.5

#fit model icar.stan to NYC census tracts neighborhood map
install.packages(‘tidyverse’, dependencies = TRUE)
install.packages(‘rstanarm’, dependencies = TRUE)
library(rstan);
library(tidyverse)
library(rstanarm)

"data {
int<lower=0> N;
int<lower=0> N_edges;
int<lower=1, upper=N> node1[N_edges]; // node1[i] adjacent to node2[i]
int<lower=1, upper=N> node2[N_edges]; // and node1[i] < node2[i]

int<lower=0> y[N]; // count outcomes
vector<lower=0>[N] E; // exposure
int<lower=1> K; // num covariates
matrix[N, K] x; // design matrix

real<lower=0> scaling_factor; // scales the variance of the spatial effects
}
transformed data {
vector[N] log_E = log(E);
}
parameters {
real beta0; // intercept
vector[K] betas; // covariates

real<lower=0> sigma; // overall standard deviation
real<lower=0, upper=1> rho; // proportion unstructured vs. spatially structured variance

vector[N] theta; // heterogeneous effects
vector[N] phi; // spatial effects
}
transformed parameters {
vector[N] convolved_re;
// variance of each component should be approximately equal to 1
convolved_re = sqrt(1 - rho) * theta + sqrt(rho / scaling_factor) * phi;
}
model {
y ~ poisson_log(log_E + beta0 + x * betas + convolved_re * sigma); // co-variates

// This is the prior for phi! (up to proportionality)
target += -0.5 * dot_self(phi[node1] - phi[node2]);

beta0 ~ normal(0.0, 1.0);
betas ~ normal(0.0, 1.0);
theta ~ normal(0.0, 1.0);
sigma ~ normal(0, 1.0);
rho ~ beta(0.5, 0.5);
// soft sum-to-zero constraint on phi)
sum(phi) ~ normal(0, 0.001 * N); // equivalent to mean(phi) ~ normal(0,0.001)
}
generated quantities {
real logit_rho = log(rho / (1.0 - rho));
vector[N] eta = log_E + beta0 + x * betas + convolved_re * sigma; // co-variates
vector[N] mu = exp(eta);
}"
options(mc.cores = parallel::detectCores())

library(INLA)

source(“mungecardata4stan.R”)
source(“iran_data.R”)
y = data$y;
E = data$E;
K = 1;
x = 0.1 * data$x;

nbs = mungeCARdata4stan(data$adj, data$num);
N = nbs$N;
node1 = nbs$node1;
node2 = nbs$node2;
N_edges = nbs$N_edges;
adj.matrix = sparseMatrix(i=nbs$node1,j=nbs$node2,x=1,symmetric=TRUE)
Q= Diagonal(nbs$N, rowSums(adj.matrix)) - adj.matrix
Q_pert = Q + Diagonal(nbs$N) * max(diag(Q)) * sqrt(.Machine$double.eps)
Q_inv = inla.qinv(Q_pert, constr=list(A = matrix(1,1,nbs$N),e=0))
scaling_factor = exp(mean(log(diag(Q_inv))))
scot_stanfit = stan(“bym2_predictor_plus_offset.stan”, data=list(N,N_edges,node1,node2,y,x,E,scaling_factor), warmup=5000, iter=6000);

Error in new_CppObject_xp(fields$.module, fields$.pointer, …) : **
** Exception: variable does not exist; processing stage=data initialization; variable name=N; base type=int (in ‘string’, line 3, column 2 to column 17)

In addition: Warning message:
In readLines(file, warn = TRUE) :
** incomplete final line found on ‘C:\Users\Uaer\Downloads\bym2_predictor_plus_offset.stan’**
failed to create the sampler; sampling not done

The data must be passed as a named list, e.g. list(N = N)

1 Like

thanks @jsocolar .

dear @jsocolar and @cmcd And @mitzimorris
finally, I failed to implement the model that I told you about above…
I will face the same error that I shared with you …
please guide me…
where is my work, it is wrong?
@jsocolar as you said, I tried to define the data correctly
I took the code from this address: Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data

my code:

pkgs <- c("sf", "spdep", "INLA", "rstan")
lapply(pkgs, require, character.only = TRUE)
rstan_options(javascript=FALSE)


'data {
  int<lower=0> N;
  int<lower=0> N_edges;
  array[N_edges] int<lower=1, upper=N> node1; // node1[i] adjacent to node2[i]
  array[N_edges] int<lower=1, upper=N> node2; // and node1[i] < node2[i]
  
  array[N] int<lower=0> y; // count outcomes
  vector<lower=0>[N] E; // exposure
  int<lower=1> K; // num covariates
  matrix[N, K] x; // design matrix
  
  real<lower=0> scaling_factor; // scales the variance of the spatial effects
}
transformed data {
  vector[N] log_E = log(E);
}
parameters {
  real beta0; // intercept
  vector[K] betas; // covariates
  
  real<lower=0> sigma; // overall standard deviation
  real<lower=0, upper=1> rho; // proportion unstructured vs. spatially structured variance
  
  vector[N] theta; // heterogeneous effects
  vector[N] phi; // spatial effects
}
transformed parameters {
  vector[N] convolved_re;
  // variance of each component should be approximately equal to 1
  convolved_re = sqrt(1 - rho) * theta + sqrt(rho / scaling_factor) * phi;
}
model {
  y ~ poisson_log(log_E + beta0 + x * betas + convolved_re * sigma); // co-variates
  
  // This is the prior for phi! (up to proportionality)
  target += -0.5 * dot_self(phi[node1] - phi[node2]);
  
  beta0 ~ normal(0.0, 1.0);
  betas ~ normal(0.0, 1.0);
  theta ~ normal(0.0, 1.0);
  sigma ~ normal(0, 1.0);
  rho ~ beta(0.5, 0.5);
  // soft sum-to-zero constraint on phi)
  sum(phi) ~ normal(0, 0.001 * N); // equivalent to mean(phi) ~ normal(0,0.001)
}
generated quantities {
  real logit_rho = log(rho / (1.0 - rho));
  vector[N] eta = log_E + beta0 + x * betas + convolved_re * sigma; // co-variates
  vector[N] mu = exp(eta);
  "bym2_predictor_plus_offset.stan"
}'

library(rstan)   
options(mc.cores = parallel::detectCores())  

library(INLA)

source("mungeCARdata4stan.R")  
source("iran_data.R")
y = data$y;
E = data$E;
K = 1;
x = 0.1 * data$x;

nbs = mungeCARdata4stan(data$adj, data$num);
N = nbs$N;
node1 = nbs$node1;
node2 = nbs$node2;
N_edges = nbs$N_edges;


library(SpatialEpi)
icar.data$inv_sqrt_scale_factor

TWN <- readOGR(dsn = "C:/Users/Uaer/Desktop/New folder (2)/irn_adm_unhcr_20190514_shp (2)", 
               layer = "irn_admbnda_adm1_unhcr_20190514")
A <- spdep::nb2mat(spdep::poly2nb(TWN,queen = TRUE), style = "B", zero.policy = TRUE)

icar.data <- prep_icar_data(A)
icar.data$inv_sqrt_scale_factor

## calculate the scale factor for each of k connected group of nodes, using scale_c function
k <- icar.data$k
scale_factor <- vector(mode = "numeric", length = k)
'for (j in 1:k) {
  g.idx <- which(icar.data$comp_id == j) 
  if (length(g.idx) == k) {
    scale_factor[j] <- 1
    next
  }    
  Cg <- C[g.idx] 
  scale_factor[j] <- scale_c(Cg) 
}'
icar.data$inv_sqrt_scale_factor <- 1 / sqrt( scale_factor )
## see the new values
print(icar.data$inv_sqrt_scale_factor)

#Build the adjacency matrix using INLA library functions
adj.matrix = sparseMatrix(i=nbs$node1,j=nbs$node2,x=1,symmetric=TRUE)
#The ICAR precision matrix (note! This is singular)
Q=  Diagonal(nbs$N, rowSums(adj.matrix)) - adj.matrix
#Add a small jitter to the diagonal for numerical stability (optional but recommended)
Q_pert = Q + Diagonal(nbs$N) * max(diag(Q)) * sqrt(.Machine$double.eps)

# Compute the diagonal elements of the covariance matrix subject to the 
# constraint that the entries of the ICAR sum to zero.
#See the inla.qinv function help for further details.
Q_inv = inla.qinv(Q_pert, constr=list(A = matrix(1,1,nbs$N),e=0))

#Compute the geometric mean of the variances, which are on the diagonal of Q.inv
scaling_factor = exp(mean(log(diag(Q_inv))))
dl <- list('    
  N = nbs$N;
  N_edges = nbs$N_edges;
  node1 = nbs$node1;
  node2 = nbs$node2;
  y = data$y;
  E = data$E;
  K = 1;
  x = 0.1 * data$x;
  scaling_factor
 
')
dl <- c(dl, icar.data)

BYM2 <- "bym2_predictor_plus_offset.stan"
stanfit = stan(BYM2, data=dl, warmup=5000, iter=6000);

and error :

> **Error in new_CppObject_xp(fields$.module, fields$.pointer, ...) : **
> **  Exception: variable does not exist; processing stage=data initialization; variable name=N; base type=int (in 'string', line 3, column 2 to column 17)**
> **In addition: Warning message:**
> **In readLines(file, warn = TRUE) :**
> **  incomplete final line found on 'C:\Users\Uaer\Desktop\Data hyperglysemia_files\bym\bym2_predictor_plus_offset.stan'**
> **failed to create the sampler; sampling not done**
1 Like

This is not the correct syntax for generating a named list in R. You need:

dl <- list(
  N = nbs$N,
  N_edges = nbs$N_edges,
  ...
)

@jsocolar
sorry , i didn’t understand what you meant.
i don’t see a difference between what you said and what I have in my model …
my code above is not copied correctly…
my code is :

bym2 .R (4.2 KB)

oh ! yeah.
i understand.

@jsocolar
now i face this error… :(

**> **
**> Error in new_CppObject_xp(fields$.module, fields$.pointer, ...) : **
**>   Exception: int variable contained non-int values; processing stage=data initialization; variable name=y; base type=int (in 'string', line 8, column 2 to column 26)**
**> In addition: Warning message:**
**> In readLines(file, warn = TRUE) :**
**>   incomplete final line found on 'C:\Users\Uaer\Desktop\Data hyperglysemia_files\bym\bym2_predictor_plus_offset.stan'**
**> failed to create the sampler; sampling not done**

This error is telling you that in your Stan program you declared y to be of integer type, but in your data object you are passing values for y that are not integers.