Hi everyone,
I’m working on a mixture regression model but I really don’t know how to consider a sample with repeated measures. We have data for I subjects and T measures per subject , resulting in a total sample of N observations.
I can easily run the model for the individual observation case, but how can I consider the repeated measures? So, of course, I want to take into account, that the whole sequence of the T measures for subject i comes from mixture component g instead of each single observation.
Some simulated data would be:
i <- 20 #number of subjects
t <- 10 #number of measures (time points) per subject
k <- 10 #number of coefficients
# True coefficients of class 1
alpha1 <- -2 #true intercept
beta1 <- round(rnorm(k,0,0.5),digits=2) #true slopes
sigma1 <- 0.1 #true sigma
# True coefficients of class 2
alpha2 <- 2 #true intercept
beta2 <- round(rnorm(k,0,0.5),digits=2) #true slopes
sigma2 <- 0.2 #true sigma
# This generates some values for the predictors
X <- matrix(rnorm(i*t*k), nrow = i*t)
# This generates the index for each subject and the index for each measure per subject
id <- rep(1:i, each=t)
time <- rep(1:t, i)
# This generates our outcomes for the 2 classes based on 50% of the observations
Y1 <- alpha1 + X[which(id<(i/2+1)),]%*%beta1 + rnorm(i*t/2, 0, sigma1)
Y2 <- alpha2 + X[which(id>(i/2)),]%*%beta2 + rnorm(i*t/2, 0, sigma2)
#This combines the outcomes for the 2 classes into one vector
Y <- c(Y1,Y2)
This runs the model looping over each observation without considering the repeated measures:
stan_code_regression="data {
int<lower=1> N; // number of observations
int<lower=1> I_id; // number of observations
int<lower=1> T_id; // number of observations
int<lower=1> id[N]; // id
int<lower=1> time[N]; // time
int<lower=1> K; // number of coefficients
vector[N] Y; // response variable
matrix[N,K] X; // predictor matrix
int<lower=1> G; // number of groups
}
parameters {
//coefficients
vector[K] beta [G]; // slopes
ordered[G] alpha; // intercept
//scale
vector <lower=0> [G] sigma;
//mixture proportion
simplex[G] theta;
}
model {
vector[G] ps;
for (n in 1:N) {
ps[1] = log(theta[1]) + normal_lpdf(Y[n] | alpha[1] + X[n]*beta[1],sigma[1]);
ps[2] = log(theta[2]) + normal_lpdf(Y[n] | alpha[2] + X[n]*beta[2],sigma[2]);
target += log_sum_exp(ps);
}
for (g in 1:G) {
beta[g] ~ normal(0, 1);
alpha[g] ~ normal(0, 5);
sigma[g] ~ inv_gamma(3, 1);
}
theta ~ dirichlet(rep_vector(10,G));
}
"
standata = list(N = NROW(X),
I_id = i,
T_id = t,
id = id,
time = time,
Y = as.numeric(Y),
K = k,
X = X,
G = 2)
fit_regression = stan(model_code = stan_code_regression,
data = standata,
chains = 1,
cores = 4,
iter = 2000,
control=list(adapt_delta=0.95, max_treedepth=15))
Inference for Stan model: 805be154d6c727012fec5299c4409018.
1 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=1000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[1,1] 0.45 0.00 0.01 0.43 0.44 0.45 0.46 0.47 1929 1
beta[1,2] -0.84 0.00 0.01 -0.87 -0.85 -0.84 -0.84 -0.82 1659 1
beta[1,3] -0.76 0.00 0.01 -0.78 -0.77 -0.76 -0.75 -0.74 1372 1
beta[1,4] -0.19 0.00 0.01 -0.21 -0.20 -0.19 -0.18 -0.17 1635 1
beta[1,5] -0.91 0.00 0.01 -0.93 -0.92 -0.91 -0.90 -0.89 1516 1
beta[1,6] -1.15 0.00 0.01 -1.17 -1.16 -1.15 -1.14 -1.13 1099 1
beta[1,7] -0.04 0.00 0.01 -0.06 -0.05 -0.04 -0.03 -0.02 1398 1
beta[1,8] 0.63 0.00 0.01 0.61 0.63 0.63 0.64 0.65 1678 1
beta[1,9] 0.42 0.00 0.01 0.40 0.41 0.42 0.43 0.44 1682 1
beta[1,10] 0.53 0.00 0.01 0.52 0.53 0.53 0.54 0.55 1774 1
beta[2,1] 0.71 0.00 0.02 0.67 0.69 0.71 0.72 0.75 1461 1
beta[2,2] 0.18 0.00 0.02 0.13 0.16 0.18 0.19 0.22 1413 1
beta[2,3] -0.03 0.00 0.02 -0.07 -0.04 -0.03 -0.01 0.02 1720 1
beta[2,4] -0.27 0.00 0.02 -0.32 -0.29 -0.27 -0.25 -0.22 1512 1
beta[2,5] -0.05 0.00 0.02 -0.10 -0.07 -0.05 -0.04 -0.01 2206 1
beta[2,6] 0.79 0.00 0.02 0.74 0.78 0.79 0.81 0.84 1688 1
beta[2,7] -0.02 0.00 0.02 -0.07 -0.03 -0.02 0.00 0.03 1646 1
beta[2,8] -0.01 0.00 0.02 -0.06 -0.03 -0.01 0.01 0.04 2354 1
beta[2,9] 0.07 0.00 0.02 0.02 0.05 0.07 0.08 0.11 1975 1
beta[2,10] -1.39 0.00 0.02 -1.44 -1.41 -1.39 -1.37 -1.34 1420 1
alpha[1] -1.99 0.00 0.01 -2.01 -2.00 -1.99 -1.99 -1.97 1901 1
alpha[2] 2.03 0.00 0.02 1.98 2.01 2.03 2.04 2.07 1415 1
sigma[1] 0.10 0.00 0.01 0.08 0.09 0.09 0.10 0.11 1100 1
sigma[2] 0.21 0.00 0.02 0.18 0.20 0.21 0.22 0.25 1349 1
theta[1] 0.50 0.00 0.03 0.44 0.48 0.50 0.52 0.56 1758 1
theta[2] 0.50 0.00 0.03 0.44 0.48 0.50 0.52 0.56 1758 1
lp__ -44.68 0.19 3.87 -53.47 -46.98 -44.29 -41.84 -38.55 426 1
Samples were drawn using NUTS(diag_e) at Sat Aug 1 01:49:42 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
But how do I loop over the subjects and ensure that each subject including all its measures is assigned to one of the components?
I already tried something like the following, which results in wrong estimates:
...
model {
vector[G] ps;
for (n in 1:N) {
ps[1] = log(theta[1]) + normal_lpdf(Y[id[n]] | alpha[1] + X[id[n]]*beta[1],sigma[1]);
ps[2] = log(theta[2]) + normal_lpdf(Y[id[n]] | alpha[2] + X[id[n]]*beta[2],sigma[2]);
target += log_sum_exp(ps);
}
Can someone help me with this? Thanks in advance! :)