# Linear multivariate model

Hi all, I am trying to specify a linear multivariate model and I’m having some difficulties: 1) I’m not recovering parameters correctly from a simulated data set and 2) I need to speed it up for my real data.

First here is a simulated data script: gen_test_data_y2_x3.R (530 Bytes)
And here is the script to call stan from r: run_test_model.R (311 Bytes)

My model is based on section 9.15 of the Stan manual (save to linear_multivariate2.stan):

``````    int<lower=1> NvarsY;
int<lower=1> NvarsX;
int<lower=0> N;
vector[NvarsX] x[N];
vector[NvarsY] y[N];
}

parameters {
matrix[NvarsY, NvarsX] beta;
cholesky_factor_corr[NvarsY] L_Omega;
vector<lower=0>[NvarsY] L_sigma;
}

model {
vector[NvarsY] mu;
matrix[NvarsY, NvarsY] L_Sigma;

//Priors
L_Sigma = diag_pre_multiply(L_sigma, L_Omega);
to_vector(beta) ~ normal(0, 5);
L_Omega ~ lkj_corr_cholesky(4);
L_sigma ~ cauchy(0, 2.5);

//Model
for (n in 1:N){
mu = beta * x[n];
}

y ~ multi_normal_cholesky(mu, L_Sigma);
}
``````

My recovered parameters from running this were:

summary(params\$beta[,1,])
V1 V2 V3
Min. :-3.953 Min. :-18.506 Min. :-13.2165
1st Qu.: 3.760 1st Qu.: -4.397 1st Qu.: -0.1258
Median : 5.483 Median : -1.051 Median : 2.8229
Mean : 5.465 Mean : -1.053 Mean : 2.8344
3rd Qu.: 7.095 3rd Qu.: 2.287 3rd Qu.: 5.7141
Max. :14.481 Max. : 18.401 Max. : 20.0149
summary(params\$beta[,2,])
V1 V2 V3
Min. : 0.7665 Min. :-19.278 Min. :-11.595
1st Qu.: 9.0719 1st Qu.: -5.228 1st Qu.: 2.816
Median :10.7551 Median : -2.049 Median : 5.885
Mean :10.7435 Mean : -2.046 Mean : 5.835
3rd Qu.:12.4121 3rd Qu.: 1.115 3rd Qu.: 8.754
Max. :20.2864 Max. : 15.609 Max. : 24.415

… which don’t match the correlations I specified in the generated data.

My second problem is that for my real dataset (~2000 rows, 23 Y variables, 40 X variables) it takes about 24 hours to run. Perhaps I’m simply expecting too much giving the dimensionality of the data and not huge number of rows. I can reduce the number of X variables conceivably based on prior knowledge, but there would still be perhaps 20 X variables, and the number of Y variables are fixed.

This might be because you are not simulating directly from your model: in general you should simulate by first drawing parameters from priors (e.g. `beta`, `L_Omega` and `L_sigma`) and follow exactly your model (e.g. calculate `mu` from `beta` and `x`). It is well possible that `sigma` or `mu` you hardcoded in your simulation script actually have very low prior mass and thus cannot be recovered. Also simulating from the prior helps you understand the implications of your prior and check that it makes sense (as in the Visualisation paper)

1 Like

Hi @martinmodrak - thanks for your answer. I actually figured this out yesterday by using the model from this thread and removing the subject specific terms: Divergent transitions in a multivariate multilevel regression model

Also, in the simulation r script I set `mvrnorm` command I set `empircal=TRUE`.

The new model I’m using is below:

``````data{
int<lower=0> NvarsX; // num independent variables
int<lower=0> NvarsY; // num dependent variables
int<lower=0> N;     //Total number of rows
matrix[N, NvarsX] x;
vector[NvarsY] y [N];
}

parameters{
cholesky_factor_corr[NvarsY] L_Omega; //For the correlations between outcome variables
vector<lower=0>[NvarsY] L_sigma; //Residual SD of outcome variables

vector[NvarsY] Beta0;        //intercepts
vector[NvarsX] Beta [NvarsY]; //slopes
}

transformed parameters{
vector[NvarsY] mu[N];
matrix[NvarsY, NvarsY] L_Sigma;

L_Sigma = diag_pre_multiply(L_sigma, L_Omega);

for (i in 1:N){
for (dv in 1:NvarsY){
mu[i, dv] = Beta0[dv] + dot_product(to_vector(Beta[dv, 1:NvarsX]), x[i, 1:NvarsX]);
}
}
}

model{
//Priors
L_Omega ~ lkj_corr_cholesky(1);
L_sigma ~ cauchy(0,5);

for (i in 1:N){
y[i, 1:NvarsY] ~ multi_normal_cholesky(mu[i, 1:NvarsY], L_Sigma);
}

}

generated quantities{
matrix[NvarsY, NvarsY] Omega;
matrix[NvarsY, NvarsY] Sigma;
Omega = multiply_lower_tri_self_transpose(L_Omega);