# Multivariate analysis

Hi everyone,
I’m new to working with stan and I am trying to fit this multivariate model. I saw one article online on how to analyse “Bayesian Varying Effects Models in R and Stan”. Bayesian Varying Effects Models in R and Stan - Will Hipson.

The data was simulated this way:

theme_set(theme_bw(base_size = 14)) # setting ggplot2 theme

library(mvtnorm)

set.seed(12407) # for reproducibility

# unstandardized means and sds
real_mean_y <- 4.25
real_mean_x <- 3.25
real_sd_y <- 0.45
real_sd_x <- 0.54

N <- 30 # number of people
n_days <- 7 # number of days
total_obs <- N * n_days

sigma <- 1 # population sd
beta <- c(0, 0.15) # average intercept and slope
sigma_p <- c(1, 1) # intercept and slope sds
rho <- -0.36 # covariance between intercepts and slopes

cov_mat <- matrix(c(sigma_p[1]^2, sigma_p[1] * sigma_p[2] * rho, sigma_p[1] * sigma_p[2] * rho, sigma_p[2]^2), nrow = 2)
beta_p <- rmvnorm(N, mean = beta, sigma = cov_mat) # participant intercepts and slopes

x <- matrix(c(rep(1, N * n_days), rnorm(N * n_days, 0, 1)), ncol = 2) # model matrix
pid <- rep(1:N, each = n_days) # participant id

sim_dat <- map_dfr(.x = 1:(N * n_days), ~data.frame(
mu = x[.x, 1] * beta_p[pid[.x], 1] + x[.x, 2] * beta_p[pid[.x], 2],
pid = pid[.x],
x = x[.x, 2]
))

sim_dat$y <- rnorm(210, sim_dat$mu, sigma) # creating observed y from mu and sigma

dat <- sim_dat %>%
select(-mu) %>% # removing mu
mutate(y = real_mean_y + (y * real_sd_y), # unstandardize
x = real_mean_x + (x * real_sd_x)) # unstandardize

stan_dat3 <- list(
N_obs = nrow(dat),
N_pts = max(as.numeric(dat$pid)), K = 2, # intercept + slope pid = as.numeric(dat$pid),
x = matrix(c(rep(1, nrow(dat)), (dat$x - mean(dat$x)) / sd(dat$x)), ncol = 2), # z-score for x y = (dat$y - mean(dat$y)) / sd(dat$y) # z-score for y
)



and here is the stan model saved as mod_c.stan:

data {
int N_obs; // number of observations
int N_pts; // number of participants
int K; // number of predictors + intercept
int pid[N_obs]; // participant id vector
matrix[N_obs, K] x; // matrix of predictors
real y[N_obs]; // y vector
}

parameters {
matrix[K, N_pts] z_p; // matrix of intercepts and slope
vector<lower=0>[K] sigma_p; // sd for intercept and slope
vector[K] beta; // intercept and slope hyper-priors
cholesky_factor_corr[K] L_p; // Cholesky correlation matrix
real<lower=0> sigma; // population sigma
}

transformed parameters {
matrix[K, N_pts] z; // non-centered version of beta_p
z = diag_pre_multiply(sigma_p, L_p) * z_p;
}

model {
vector[N_obs] mu;

// priors
beta ~ normal(0, 1);
sigma ~ exponential(1);
sigma_p ~ exponential(1);
L_p ~ lkj_corr_cholesky(2);
to_vector(z_p) ~ normal(0, 1);

// likelihood
for(i in 1:N_obs) {
mu[i] = beta[1] + z[1, pid[i]] + (beta[2] + z[2, pid[i]]) * x[i, 2];
}
y ~ normal(mu, sigma);
}

generated quantities {
matrix[2, 2] Omega;
Omega = multiply_lower_tri_self_transpose(L_p);
}


I used this to run the model

stan_fit3 <- stan(file = "mod_c.stan",
data = stan_dat3,
chains = 4, cores = 4)

stan_fit3


and I got the same result.

Now here is my application: I divided the dataset into two parts (105 each)[this was done by plotting the dataset and using the plot to divide] and I used this model to run the first part (first data) (where my pid = rep(1:N/2, each = n_days=7) (I used N/2 =15 since I divided the data into two) ) and also run the second part. In the model, K=2, N_obs = nrow(dat)=105, N_pts = max(as.numeric(dat$pid))=15. For the first part, I got my random effect correlation to be \bigg(\begin{matrix} 1 & -0.40\\ -0.40& 1 \end{matrix}\bigg) For the second part also, I got my random effect correlation to be \bigg(\begin{matrix} 1 & -0.36\\ -0.36& 1 \end{matrix}\bigg) Now using the result from the two dataset, I simulated another dataset by using correlation 0.6 and -0.6 to join the random effect correlation from the two dataset together and this gives a 4 by 4 matrix: Like this \begin{pmatrix} 1 & 0.6&-0.40&-0.6\\ 0.6& 1&-0.6&-0.36\\ -0.4& -0.6&1&0.6\\ -0.6& -0.36&0.6&1 \end{pmatrix} In this case, my random effect follows MVNormal(c(\beta_0[1],\beta_0[2],\beta_1[1],\beta_1[2]), \Sigma), where \beta_0[1] and \beta_1[1] are from data 1, \beta_0[2] and \beta_1[2] are from data 2 and \Sigma is the sigma of the random effect after applying the correlation above. Here is my stan code using the simulated dataset: data { int N_obs; // number of observations int N_pts; // number of participants int<lower=1>J; int<lower=1,upper=J>dam[N_obs]; int K; // number of predictors + intercept int pid[N_obs]; // participant id vector matrix[N_obs, K] x; // matrix of predictors real y[N_obs]; // y vector } parameters { matrix[K, N_pts] z_p; // matrix of intercepts and slope vector<lower=0>[K] sigma_p; // sd for intercept and slope vector[K] mu1; // intercept and slope hyper-priors cholesky_factor_corr[K] L_p; // Cholesky correlation matrix vector[2] sigma; // population sigma } transformed parameters { matrix[K, N_pts] z; // non-centered version of beta_p z = diag_pre_multiply(sigma_p, L_p) * z_p; } model { vector[2] beta1; vector[2] beta2; vector[N_obs] mu; // priors mu1~ normal(0, 1); sigma ~ exponential(1); sigma_p ~ exponential(1); L_p ~ lkj_corr_cholesky(2); to_vector(z_p) ~ normal(0, 1); // likelihood for(i in 1:N_obs) { beta1[1] = mu1[1]+z[1,pid[i]]; beta1[2] = mu1[2]+z[2,pid[i]]; beta2[1] =mu1[3]+z[3,pid[i]]; beta2[2]=mu2[4]+z[4,pid[i]]; mu[i] = beta1[dam[i]]+ (beta2[dam[i]]) * x[i, 2]; } y[i] ~ normal(mu, sigma[dam[i]]); } generated quantities { matrix[4,4] Omega; Omega = multiply_lower_tri_self_transpose(L_p); }  where my Dam is a factor of (1,1,1,…,2,2,2) for “1” represent data 1 and “2” represent data 2, dam=as.integer(Dam) and J=2. Also pid = rep(1:N"=30", each =7) , K=4, N_obs = nrow(dat)=210, N_pts = max(as.numeric(dat$pid))=30.

Using my simulated dataset, I expect my code to capture the 0.6 and -0.6 correlation but I got 0.002 and -0.04 instead. Is there anything I’m not doing right? Please help.

Thank you

1 Like

Hi,
sorry for not getting to you earlier. It appears that the second model is only ever using a single datapoint:

y[i] ~ normal(mu, sigma[dam[i]]);


should probably have been inside the for loop. It is actually surprising that the model compiles at all (actually, if ti compiles it might be a bug in Stan compiler).

Since you seem to not be using almost any data to fit the model, the correlations would be expected stay almost at their prior location, which is close to 0.

Does that make sense? And does the problem disappear after fixing the likelihood?

Thank you so much for replying. I corrected the loop so that y[i] will be inside the loop. That makes sense to me now. I noticed that after correcting the error, my correlation was reduced. For example, where I’m expecting the correlation to be 0.9, I got 0.7.

Why are you expecting it to be 0.9? It is because you simulated the data with correlation of 0.9? It is also usually better to look at some wider posterior intervals than just at point estimates. Note that the LKJ prior actually puts quite a lot of prior probability on low correlations, so maybe this is to be an expected amount of shrinkage? The best way to find out is to test with data simulated exactly from the prior, so instead of working with a fixed correlation matrix, you would draw the matrix from the LKJ(2) distribution and see if your posterior intervals cover the true values well…

Does that make sense?

1 Like

Yes it does. Thank you so much for your reply. I’m so grateful.