How to consider repeated measures in mixture regression model?

Hi everyone,

I’m working on a mixture regression model but I really don’t know how to consider a sample with repeated measures. We have data for I subjects and T measures per subject , resulting in a total sample of N observations.

I can easily run the model for the individual observation case, but how can I consider the repeated measures? So, of course, I want to take into account, that the whole sequence of the T measures for subject i comes from mixture component g instead of each single observation.

Some simulated data would be:

i <- 20 #number of subjects
t <- 10 #number of measures (time points) per subject
k <- 10 #number of coefficients

# True coefficients of class 1 

alpha1 <- -2 #true intercept
beta1 <- round(rnorm(k,0,0.5),digits=2) #true slopes
sigma1  <- 0.1 #true sigma

# True coefficients of class 2

alpha2 <- 2 #true intercept
beta2 <- round(rnorm(k,0,0.5),digits=2) #true slopes
sigma2 <- 0.2 #true sigma

# This generates some values for the predictors

X <- matrix(rnorm(i*t*k), nrow = i*t)

# This generates the index for each subject and the index for each measure per subject 
id <- rep(1:i, each=t)
time <- rep(1:t, i)

# This generates our outcomes for the 2 classes based on 50% of the observations
Y1 <- alpha1 + X[which(id<(i/2+1)),]%*%beta1 + rnorm(i*t/2, 0, sigma1)

Y2 <- alpha2 + X[which(id>(i/2)),]%*%beta2 + rnorm(i*t/2, 0, sigma2)

#This combines the outcomes for the 2 classes into one vector
Y <- c(Y1,Y2)

This runs the model looping over each observation without considering the repeated measures:

stan_code_regression="data {
  int<lower=1> N;  // number of observations
  int<lower=1> I_id;  // number of observations
  int<lower=1> T_id;  // number of observations
  int<lower=1> id[N];  // id
  int<lower=1> time[N];  // time
  int<lower=1> K;  // number of coefficients
  vector[N] Y;  // response variable
  matrix[N,K] X;  // predictor matrix
  int<lower=1> G; // number of groups
}
parameters {
  //coefficients
  vector[K] beta [G];  // slopes
  ordered[G] alpha;  // intercept
  //scale
  vector <lower=0> [G] sigma;
  //mixture proportion
  simplex[G] theta;
}
model {
  vector[G] ps;
  
  for (n in 1:N) {
    ps[1] = log(theta[1]) + normal_lpdf(Y[n] | alpha[1] + X[n]*beta[1],sigma[1]);
    ps[2] = log(theta[2]) + normal_lpdf(Y[n] | alpha[2] + X[n]*beta[2],sigma[2]);


    target += log_sum_exp(ps);
  }
  for (g in 1:G) {
  beta[g] ~ normal(0, 1);
  alpha[g] ~ normal(0, 5);
  sigma[g] ~ inv_gamma(3, 1);
  }
  theta ~ dirichlet(rep_vector(10,G));
}
"
standata = list(N = NROW(X),
                I_id = i,
                T_id = t,
                id = id,
                time = time,
                Y = as.numeric(Y),
                K = k,
                X = X,
                G = 2)
fit_regression = stan(model_code = stan_code_regression,
                      data = standata,
                      chains = 1,
                      cores = 4,
                      iter = 2000,
                      control=list(adapt_delta=0.95, max_treedepth=15))
Inference for Stan model: 805be154d6c727012fec5299c4409018.
1 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=1000.

             mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
beta[1,1]    0.45    0.00 0.01   0.43   0.44   0.45   0.46   0.47  1929    1
beta[1,2]   -0.84    0.00 0.01  -0.87  -0.85  -0.84  -0.84  -0.82  1659    1
beta[1,3]   -0.76    0.00 0.01  -0.78  -0.77  -0.76  -0.75  -0.74  1372    1
beta[1,4]   -0.19    0.00 0.01  -0.21  -0.20  -0.19  -0.18  -0.17  1635    1
beta[1,5]   -0.91    0.00 0.01  -0.93  -0.92  -0.91  -0.90  -0.89  1516    1
beta[1,6]   -1.15    0.00 0.01  -1.17  -1.16  -1.15  -1.14  -1.13  1099    1
beta[1,7]   -0.04    0.00 0.01  -0.06  -0.05  -0.04  -0.03  -0.02  1398    1
beta[1,8]    0.63    0.00 0.01   0.61   0.63   0.63   0.64   0.65  1678    1
beta[1,9]    0.42    0.00 0.01   0.40   0.41   0.42   0.43   0.44  1682    1
beta[1,10]   0.53    0.00 0.01   0.52   0.53   0.53   0.54   0.55  1774    1
beta[2,1]    0.71    0.00 0.02   0.67   0.69   0.71   0.72   0.75  1461    1
beta[2,2]    0.18    0.00 0.02   0.13   0.16   0.18   0.19   0.22  1413    1
beta[2,3]   -0.03    0.00 0.02  -0.07  -0.04  -0.03  -0.01   0.02  1720    1
beta[2,4]   -0.27    0.00 0.02  -0.32  -0.29  -0.27  -0.25  -0.22  1512    1
beta[2,5]   -0.05    0.00 0.02  -0.10  -0.07  -0.05  -0.04  -0.01  2206    1
beta[2,6]    0.79    0.00 0.02   0.74   0.78   0.79   0.81   0.84  1688    1
beta[2,7]   -0.02    0.00 0.02  -0.07  -0.03  -0.02   0.00   0.03  1646    1
beta[2,8]   -0.01    0.00 0.02  -0.06  -0.03  -0.01   0.01   0.04  2354    1
beta[2,9]    0.07    0.00 0.02   0.02   0.05   0.07   0.08   0.11  1975    1
beta[2,10]  -1.39    0.00 0.02  -1.44  -1.41  -1.39  -1.37  -1.34  1420    1
alpha[1]    -1.99    0.00 0.01  -2.01  -2.00  -1.99  -1.99  -1.97  1901    1
alpha[2]     2.03    0.00 0.02   1.98   2.01   2.03   2.04   2.07  1415    1
sigma[1]     0.10    0.00 0.01   0.08   0.09   0.09   0.10   0.11  1100    1
sigma[2]     0.21    0.00 0.02   0.18   0.20   0.21   0.22   0.25  1349    1
theta[1]     0.50    0.00 0.03   0.44   0.48   0.50   0.52   0.56  1758    1
theta[2]     0.50    0.00 0.03   0.44   0.48   0.50   0.52   0.56  1758    1
lp__       -44.68    0.19 3.87 -53.47 -46.98 -44.29 -41.84 -38.55   426    1

Samples were drawn using NUTS(diag_e) at Sat Aug  1 01:49:42 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

But how do I loop over the subjects and ensure that each subject including all its measures is assigned to one of the components?

I already tried something like the following, which results in wrong estimates:

...
model {
  vector[G] ps;
  
  for (n in 1:N) {
    ps[1] = log(theta[1]) + normal_lpdf(Y[id[n]] | alpha[1] + X[id[n]]*beta[1],sigma[1]);
    ps[2] = log(theta[2]) + normal_lpdf(Y[id[n]] | alpha[2] + X[id[n]]*beta[2],sigma[2]);
    target += log_sum_exp(ps);
  }

Can someone help me with this? Thanks in advance! :)

Is there an assumption of ordering or change in what is measured by those repeated measures? Or are they several measurements of the same quantity? Are you able to say something about what kind of data you are modelling?

Thanks for your reply and help! We are observing 20 brands and 200 weeks for each brand. For each data point we observe the same variables. So, our data structure looks like the following

 brand week       y          X1           X2
    1    1   0.36483365 -0.47932128 -0.418754663
    1    2   6.94543117 -0.44043776  1.084886116
    1    3  -1.39903123  0.04591838 -1.139041751
    1    4   4.42756239 -0.27425924  0.613478589
    1    ... 0.50578554 -1.86046577  0.335184074
    1    197 8.41049729  1.77183586  0.431087334 
    1    198 1.26711632 -0.41893649 -0.029954152
    1    199 1.55176065 -1.72004195  0.775841769
    1    200 3.22550655 -1.07989198  0.298175355
    2    1   7.88291164  0.01570205  0.714675222
    2    2   5.32624061  0.24111697 -0.085391985
    2    3   5.83939980  0.80518412 -0.296355431
    2    4   3.10174506  1.29220384 -0.818626214
    2    ... 12.88680907 2.00759163  0.602590042
    2    197 19.45547685 2.64466520  1.979154409
    2    198 -2.30610403 0.44491581 -0.976861594
    2    199 4.00729995  0.88807782 -0.673091182
    2    200 7.26248355  0.32420348  0.407071464

The ordering of the weeks is indeed important, because we are using stock variables as independent variables that cumulate over time. So it is important for us to (1) keep the ordering of the time series for each brand and (2) to ensure that the whole sequence of 200 observations for brand i is assigned to a certain mixture component instead of each single observation.

Thanks again!

Don’t know how much I can actually be of help, but I can at least ask some questions that might make the problem more clear to others (and perhaps even to you as well)? Feel free to say so if it’s not helpful!

So you are observing three variables over time for 20 brands. Looking at your code, am I right to assume that what you want to model is the assumption that these 20 brands belong to two distinct groups, where the relationship between those variables differ between those groups? And you would like to estimate the parameters of those relationships? Or are you primarily interested in the categorisation of the brands?

What is the assumed relationship between time and the other variables? As your model is coded now, the observations are assumed to be independent (as you point out in your initial post). If all you want to model is the interdependence of the measurements, could you perhaps add a varying intercept, something like:

parameters{

   vector[N] alpha_N;

model {
    ps[1] = log(theta[1]) + normal_lpdf(Y[n] | alpha[1] + alpha_N[n] + X[n]*beta[1],sigma[1]);
    ps[2] = log(theta[2]) + normal_lpdf(Y[n] | alpha[2] + alpha_N[n] + X[n]*beta[2],sigma[2]);
}

But as it stands, your model doesn’t use the information in the time variable, does it? Is that intentional? Or am I missing something?

Exactly. We are interested in the relationship between an outcome and a set of predictors. We assume that there are certain categories of brands (in the example above, two different categories) with category specific regression parameters \alpha and \beta . But of course, we are also interested in the categorisation, i.e. what is the probability that brand i comes from component g. I know how to calculate the probabilities and class indexes in the generated quantities block, but again only for the independent observation case.

Maybe using the flexmix package in R can help to make it more clear what I want to achieve.

First, let’s create some data with the toy data code above:

i <- 20 #number of subjects
t <- 10 #number of measures (time points) per subject
k <- 10 #number of coefficients

# True coefficients of class 1 

alpha1 <- -2 #true intercept
beta1 <- round(rnorm(k,0,0.5),digits=2) #true slopes
sigma1  <- 0.1 #true sigma

# True coefficients of class 2

alpha2 <- 2 #true intercept
beta2 <- round(rnorm(k,0,0.5),digits=2) #true slopes
sigma2 <- 0.2 #true sigma

# This generates some values for the predictors

X <- matrix(rnorm(i*t*k), nrow = i*t)

# This generates the index for each subject and the index for each measure per subject 
id <- rep(1:i, each=t)
time <- rep(1:t, i)

# This generates our outcomes for the 2 classes based on 50% of the observations
Y1 <- alpha1 + X[which(id<(i/2+1)),]%*%beta1 + rnorm(i*t/2, 0, sigma1)

Y2 <- alpha2 + X[which(id>(i/2)),]%*%beta2 + rnorm(i*t/2, 0, sigma2)

#This combines the outcomes for the 2 classes into one vector
Y <- c(Y1,Y2)

data <- data.frame(Y,X,id)

Now I can run the model using flexmix:

library(flexmix)

model.independent.observations <- flexmix(Y~X1+
                                            X2+
                                            X3+
                                            X4+
                                            X5+
                                            X6+
                                            X7+
                                            X8+
                                            X9+
                                            X10, data = data, k = 2)

summary(refit(model.independent.observations))

$Comp.1
              Estimate Std. Error  z value Pr(>|z|)    
(Intercept)  2.0076580  0.0234393  85.6533   <2e-16 ***
X1           0.5303711  0.0232263  22.8349   <2e-16 ***
X2           0.2716315  0.0239054  11.3628   <2e-16 ***
X3           0.3645439  0.0222966  16.3498   <2e-16 ***
X4          -0.0057822  0.0231147  -0.2502   0.8025    
X5          -0.2327022  0.0227441 -10.2313   <2e-16 ***
X6          -0.3674968  0.0221378 -16.6004   <2e-16 ***
X7          -0.4727714  0.0227476 -20.7834   <2e-16 ***
X8           0.3532774  0.0237987  14.8444   <2e-16 ***
X9           1.1991413  0.0249789  48.0061   <2e-16 ***
X10          0.8836317  0.0234765  37.6389   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

$Comp.2
              Estimate Std. Error   z value  Pr(>|z|)    
(Intercept) -2.0213783  0.0082375 -245.3861 < 2.2e-16 ***
X1          -0.1697561  0.0076671  -22.1409 < 2.2e-16 ***
X2           0.3996527  0.0083589   47.8117 < 2.2e-16 ***
X3          -0.4719784  0.0083913  -56.2463 < 2.2e-16 ***
X4          -0.0192435  0.0090212   -2.1331   0.03291 *  
X5          -0.9426009  0.0088545 -106.4550 < 2.2e-16 ***
X6           0.0163882  0.0091558    1.7899   0.07346 .  
X7          -0.7414705  0.0089829  -82.5420 < 2.2e-16 ***
X8           0.0638086  0.0079456    8.0307 9.695e-16 ***
X9          -1.0905444  0.0074149 -147.0748 < 2.2e-16 ***
X10         -1.4072041  0.0094151 -149.4619 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now we can check the assignment to the different components for each observation:

model.independent.observations@cluster

  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [47] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [93] 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
[139] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[185] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2

Due to the way how I created the data of course most of the observations have be assigned to the components correctly anyway but we see that there are some observations that were assigned to one component although all other observations of the same subject (i.e. brand) have been assigned to the other component, e.g., the last observation.

In flexmix, we can easily force the assignment to the clusters to be subject-wise instead of observation-wise by using:

model.nested.observations <- flexmix(Y~X1+
                                            X2+
                                            X3+
                                            X4+
                                            X5+
                                            X6+
                                            X7+
                                            X8+
                                            X9+
                                            X10|id, data = data, k = 2)
#adding |id forces all T observations for subject i to be assigned to the same component

summary(refit(model.nested.observations))

$Comp.1
              Estimate Std. Error   z value  Pr(>|z|)    
(Intercept) -2.0222416  0.0082498 -245.1260 < 2.2e-16 ***
X1          -0.1690278  0.0076626  -22.0588 < 2.2e-16 ***
X2           0.4014024  0.0083343   48.1627 < 2.2e-16 ***
X3          -0.4728288  0.0083994  -56.2934 < 2.2e-16 ***
X4          -0.0192878  0.0089666   -2.1511   0.03147 *  
X5          -0.9432458  0.0087659 -107.6035 < 2.2e-16 ***
X6           0.0153278  0.0090034    1.7024   0.08867 .  
X7          -0.7413475  0.0090338  -82.0634 < 2.2e-16 ***
X8           0.0639670  0.0079413    8.0550 7.948e-16 ***
X9          -1.0898351  0.0074074 -147.1271 < 2.2e-16 ***
X10         -1.4069547  0.0093554 -150.3901 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

$Comp.2
              Estimate Std. Error  z value Pr(>|z|)    
(Intercept)  2.0072776  0.0231342  86.7665   <2e-16 ***
X1           0.5314049  0.0228681  23.2378   <2e-16 ***
X2           0.2727254  0.0231249  11.7936   <2e-16 ***
X3           0.3642449  0.0220335  16.5314   <2e-16 ***
X4          -0.0054475  0.0227547  -0.2394   0.8108    
X5          -0.2326400  0.0223491 -10.4094   <2e-16 ***
X6          -0.3678728  0.0217820 -16.8889   <2e-16 ***
X7          -0.4725342  0.0224242 -21.0725   <2e-16 ***
X8           0.3526268  0.0234742  15.0219   <2e-16 ***
X9           1.2004681  0.0244864  49.0259   <2e-16 ***
X10          0.8833220  0.0232137  38.0517   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model.nested.observations@cluster

  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [47] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [93] 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[139] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[185] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

This might be completely off, so take it as a suggestion more than an attempted answer. As you want to model the brand-wise probability of category membership, could a solution be to have a N-length vector of probabilities for belonging to category 1 (taking the role of the theta simplex in your current code), and then use overall proportion of category 1 to inform a hyperparameter for the priors on those probabilities? And then just use the inverse in the mixture proportion for category 2?

It just seems to me that as it stands now, you are only modelling the overall probability of category membership per observation, but in order to model a hierarchical structure, you’d need a hierarchical parameter. But then again, I’m only making a suggestion, I’m not sure this is viable.

Also, I’m still not getting what the time variable is doing in your model? It only features in the data block?