Learning how to write multivariate outcomes models

I’m trying to learn how to fit a multivariate outcomes model. I read section 9.15 from the stan manual, and I have a question about how to include different covariates for different outcomes. For example, suppose my two outcomes are y1 and y2 and the underlying data generating process is the following:

Generate some fake data

set.seed(212018)
rbvn <- function (n, m1, s1, m2, s2, rho) {
   X1 <- rnorm(n, mu1, s1)
   X2 <- rnorm(n, mu2 + (s2/s1) * rho * 
                 (X1 - mu1), sqrt((1 - rho^2)*s2^2))
   cbind(X1, X2)
}

N <- 100
mu1 <- 0
s1 <- 1
mu2 <- 0
s2 <- 1
rho <- 0.5

e <- rbvn(N,mu1,s1,mu2,s2,rho)

x1 <- rnorm(N,0,1)
x2 <- rnorm(N,0,1)
x3 <- rnorm(N,0,1)
t <- sample(x = c(0,1), size=N, replace = T)

y1 <- 0.5 + 2.12*x1 + 5.78*x3 + 9.15*t + e[,1]
y2 <- 2 + 7.82*x1 + 4.21*x2 + 3.57*t + e[,2]

Note that y2 is not a function of x3 and y1 is not a function of x2. If I ignore this fact, I can fit this model using the example code from the manual:


data {
  int<lower=1> K;
  int<lower=1> J;
  int<lower=0> N;
  vector[J] x[N];
  vector[K] y[N];
}

parameters{
  matrix[K,J] beta;
  cov_matrix[K] Sigma;
}

model {
  vector[K] mu[N];
  for (n in 1:N)
    mu[n] = beta * x[n];
  y ~ multi_normal(mu,Sigma);
}

I’m trying to change the code to incorporate the fact that the two outcomes should have a different set of covariates. I tried this:


data {
  int<lower=1> J;
  int<lower=0> N;
  vector[J] x[N];
  vector[1] xtwo[N];
  vector[1] xtrhee[N];
  vector[2] y[N];
}

parameters{
  vector[J] beta_one;
  vector[J] beta_two;
  vector[1] gamma_one;
  vector[1] gamma_two;
  cov_matrix[2] Sigma;
}

model {
  vector[2] mu[N];
  for(i in 1:N){
    mu[1,i] = dot_product(x[i], beta_one)+ dot_product(xtrhee[i], gamma_one);
    mu[2,i] = dot_product(x[i], beta_two)+ dot_product(xtwo[i], gamma_two);
  }
  y ~ multi_normal(mu,Sigma);
}

But when I try to fit it I get an error:

library(rstan)

df <- data.frame(x1=x1,x2=x2,x3=x3,t=t,y1=y1,y2=y2)

fml <- as.formula(y1 ~ x1+x2+x3+t)

model_data <- model.matrix(data = df, fml)

outcomes <- data.frame(y1=y1,y2=y2)

stan_dat <- list(J = ncol(model_data[,c(1,2,5)]),
                 N = 100,
                 x = as.matrix(model_data[,c(1,2,5)]),
                 xtwo = as.matrix(df$x2),
                 xtrhee = as.matrix(df$x3),
                 y = as.matrix(outcomes))

fit2 <- sampling(model2, data = stan_dat) 

print(fit2)

SAMPLING FOR MODEL 'stan-5f8a22861036' NOW (CHAIN 1).
Unrecoverable error evaluating the log probability at the initial value.
Exception: []: accessing element out of range. index 3 out of range; expecting index to be between 1 and 2; index position = 2mu  (in 'model5f8a1432a18b_stan_5f8a22861036' at line 22)

[1] "Error in sampler$call_sampler(args_list[[i]]) : "                                                                                                                                         
[2] "  Exception: []: accessing element out of range. index 3 out of range; expecting index to be between 1 and 2; index position = 2mu  (in 'model5f8a1432a18b_stan_5f8a22861036' at line 22)"
error occurred during calling the sampler; sampling not done

My guess is that my syntax for mu is wrong, but i tried a few different things that did not work either.

What is the right way of doing this?

Thanks!

To answer my own question, this code works if I change mu[1,i] to mu[i,1].

That said, my guess is that I’m not following best practices. Any and all comments about how to improve the code would be highly appreciated

Whenever you see counting numbers in variable names, you know there should be arrays instead, as in,

vector[2, J] beta;

There are much better parameterizations than a covariance matrix. The manual chapter on regression explains how to factor into a Cholesky factor of a correlation matrix and a scale vector. That’s much more efficient than having a direct covariance matrix.

In two dimensions, it all works out very easily as the entries are just sigma[1] * sigma[2] * rho, sigma[1]^2 and sigma[2]^2

That should just be a scalar.

You want to just write each regression out separately using a matrix of predictors. Select that matrix out of a single data matrix x with something like x[ , { 2, 3, 5 }] where that collection of indexes picks out the columns to save. Doing that once and for all in the transformed data block and assigning to a new matrix of appropriate dimension is easiest. Then the parameter vectors probably won’t be the same length.

You can then write the multivariate normal by appending the two vectors of predictions to form mu.

1 Like