Hierarchical model with multivariate outcome that sums to 1

Hi,

I am trying to fit a hierarchical model in stan with a multivariate outcome.

The data has a panel structure. There are N individuals, denoted i. For each i, I observe T periods, denoted t. So, observations are indexed it. Each period t, I observe three univariate outcomes about i that sum to 1: \{y_{it1}, y_{it2}, y_{it3}\}.

The data generating process is:
y_{it1} = \mathbf{x_{it}'\beta_i} + \epsilon_{it1}
y_{it2} = \mathbf{x_{it}'\beta_i} + \epsilon_{it2}
y_{it3} = \mathbf{x_{it}'\beta_i} + \epsilon_{it3}
\epsilon_{itj} \sim_{iid} N(0, \sigma^2)
y_{it1} + y_{it2} + y_{it3} = 1

x_{it} (observed data) is 1 \times K and \beta_i (parameter to be estimated) is K \times 1. I assume \beta_i \sim MVN(\beta, \Sigma), with hyperpriors on \beta and \Sigma.

I have pasted below my current stan code, which uses a non-centered parameterization. My question is: how can I modify this to include the y_{it1} + y_{it2} + y_{it3} = 1 constraint?

data {
  int N_obs; // total number of observations
  int N_pts; // total number of i's
  int K; // number of predictors
  int pid[N_obs]; // vector of i id's
  matrix[N_obs, K] x; // matrix of predictors
  real y[N_obs]; // outcome vector
}

parameters {
  matrix[K, N_pts] z_p;
  vector<lower=0>[K] sigma_p; 
  vector[K] beta;
  cholesky_factor_corr[K] L_p; 
  real<lower=0> sigma;
}

transformed parameters {
  matrix[K, N_pts] z; 
  z = diag_pre_multiply(sigma_p, L_p) * z_p; 
}

model {
  vector[N_obs] mu;
  
  // prior
  beta ~ normal(1, 1); 
  sigma ~ normal(0, 5); 
  sigma_p ~ normal(0, 5); 
  L_p ~ lkj_corr_cholesky(2);
  to_vector(z_p) ~ normal(0, 1);
  
  // likelihood
  for(i in 1:N_obs) {
    mu[i] = x[i,] * (beta + z[,pid[i]]);
  }
  y ~ normal(mu, sigma);
}

Thanks,
Kim

Your stated data generating process is not actually the data generating process, because the errors cannot be iid while respecting the sum-to-one constraint. Figuring out how you actually want to specify the generating process is the key here. A first question: can the y’s take on any value, including negative values and values greater than one? Or are they a simplex?

The y's are a simplex.

An i observes some information x_{it} and then reports the probability of three mutually exclusive and exhaustive events j = \{1,2,3\}. These reported probabilities are y_{it1}, y_{it2}, y_{it3}. I force a person to make their reports add to 1.

The error terms are there because people’s reports are not a deterministic function of the information x_{it} that they receive. This is all I really care about capturing, and I am not sure the right way to do it.

Cool. A second question: in your data generating process, you have that all three of the y’s have identical expectations. Is this what you want, or do you want \beta_i to also be indexed by the third index that you apply to y and \epsilon? Note that if you want all three y_{it} to have identical expectation, then their expectation will always be one third, regardless of the covariates.

Apologies, I should have started with more specificity than I did, because I realize I mischaracterized something.

Each t, i gets information and then reports a posterior belief about each event. The three probabilities do need to sum to 1, but in the model I don’t use the raw probabilities directly (this was my oversight).

The goal is to see whether people are approximately Bayesian in how they update their beliefs. In log odds form, Bayes rule becomes linear: log posterior odds = log prior odds + log likelihood ratio. So, I use log posterior odds (rather than raw probability) as the outcome, and I use the log-linear form of Bayes rule as the DGP. For example, y_{it1} = \mathbf{x_{it}}\beta_{i} + \epsilon_{it1} is:

log\left (\frac{P(event \: 1 | info)_{it}}{P(event \: 2 | info)_{it}} \right) = \delta_i log\left (\frac{P(event \: 1)_{i,t-1}}{P(event \: 2)_{i,t-1}} \right) + \gamma_i log\left (\frac{P(info | event \: 1)_{it}}{P(info | event \: 2)_{it}} \right) + \epsilon_{itj}

The first term is the prior and the second term is the log likelihood ratio, which are both observed data. I permit each person to incorrectly weight the prior (\delta_i) and signal (\gamma_i). But, I assume these weights are fixed within a person and come from some common MVNormal distribution.

I believe that this code generates the above regression:

  for(i in 1:N_obs) {
    mu[i] = x[i,] * (beta + z[,pid[i]]);
  }
  y ~ normal(mu, sigma);

Again, I’m not really sure what to do about the error terms, as they are not fully unrelated across equations for a given i and t.

So, just to summarize, y_{it1}, y_{it2}, and y_{it3} are log odds ratios. What goes into calculating them needs to add to 1.

I still don’t quite see all the way through what the intended data-generating process is. I suppose that at least some of the time (i.e. for some i, t), y_{it2} will be different in expectation from y_{it1}. But you write

wherein the expectation x_{it}\beta_i has no dependency on the third index. I don’t think this is what you intend? x_{it}\beta_i also has no dependency on the prior belief state y_{i,t-1,[1,2,3]}, which I think seems different from what you’re describing verbally.

If we can figure out the actual desired data generating process, then I’m confident we’ll be able to write down the relevant Stan model.

Yes, this is my mistake. There should be a j on x:
y_{itj}=x_{itj}β_i+ϵ_{itj} for j \in \{1,2,3\}.

Let’s say j=1 corresponds to the odds ratio of event 1 and event 2. Then:
y_{it1} = log\left(\frac{P(event 1)_{it}}{P(event 2)_{it}}\right)
x_{it1} = \left\{ log\left(\frac{P(event 1)_{i,t-1}}{P(event 2)_{i,t-1}}\right), log\left(\frac{P(info|event 1)_{it}}{P(info | event 2)_{it}}\right)\right\}.
So, I do mean for x_{itj}β_i to depend on the prior; it’s included as part of x_{itj}.

If we define j=2 and j=3 analogously, we get the following three regressions:
log\left(\frac{P(event \textcolor{red}{1})_{it}}{P(event \textcolor{red}{2})_{it}}\right) = \delta_i log\left(\frac{P(event \textcolor{red}{1})_{i,t-1}}{P(event \textcolor{red}{2})_{i,t-1}}\right) + \gamma_i log\left(\frac{P(info|event \textcolor{red}{1})_{it}}{P(info | event \textcolor{red}{2})_{it}}\right) + \epsilon_{it\textcolor{blue}{1}}

log\left(\frac{P(event \textcolor{red}{2})_{it}}{P(event \textcolor{red}{3})_{it}}\right) = \delta_i log\left(\frac{P(event \textcolor{red}{2})_{i,t-1}}{P(event \textcolor{red}{3})_{i,t-1}}\right) + \gamma_i log\left(\frac{P(info|event \textcolor{red}{2})_{it}}{P(info | event \textcolor{red}{3})_{it}}\right) + \epsilon_{it\textcolor{blue}{2}}

log\left(\frac{P(event \textcolor{red}{1})_{it}}{P(event \textcolor{red}{3})_{it}}\right) = \delta_i log\left(\frac{P(event \textcolor{red}{1})_{i,t-1}}{P(event \textcolor{red}{3})_{i,t-1}}\right) + \gamma_i log\left(\frac{P(info|event \textcolor{red}{1})_{it}}{P(info | event \textcolor{red}{3})_{it}}\right) + \epsilon_{it\textcolor{blue}{3}}

The fact that the underlying probabilities must sum to 1 means:
log\left(\frac{P(event \textcolor{red}{2})_{it}}{P(event \textcolor{red}{3})_{it}}\right) = log\left(\frac{P(event \textcolor{red}{1})_{it}}{P(event \textcolor{red}{3})_{it}}\right) - log\left(\frac{P(event \textcolor{red}{1})_{it}}{P(event \textcolor{red}{2})_{it}}\right)

I hope this clears everything up, save for the issue of the error terms.

Great! What you need is a way to parameterize the error terms such that they sum to zero. There are a few potential strategies for this, discussed here 1.7 Parameterizing centered vectors | Stan User’s Guide

Thanks for that link! I modified the first example on that page, and am getting results that are quite similar to when I let each \epsilon_{itj} get drawn iid.

First, I just wanted to check whether I specified something wrong. I pasted the new code at the end for reference, but will highlight the changes I made:

Each “observation” now is at the it level and is a triplet:
\begin{align} \underbrace{\begin{bmatrix} y_{it1} \\ y_{it2} \\ y_{it3} \end{bmatrix}}_{Y \times 1} =&\underbrace{\begin{bmatrix} x_{it11} & ... & x_{it1K} \\ x_{it21} & ... & x_{it2K} \\ x_{it31} & ... & x_{it3K} \\ \end{bmatrix}}_{Y \times K} \underbrace{\begin{bmatrix} \beta_{i1} \\ \vdots \\ \beta_{iK} \end{bmatrix}}_{K \times 1} +\underbrace{\begin{bmatrix} \epsilon_{it1} \\ \epsilon_{it2} \\ \epsilon_{it3} \end{bmatrix}}_{Y \times 1} \end{align}

I added for the \epsilon_{itj}'s:

parameters {
   vector<lower=0>[Y-1] sigma_raw;
}
transformed parameters {
   vector<lower=0>[Y] sigma = append_row(sigma_raw, (sigma_raw[2]-sigma_raw[1]));
}

Then, in the model section, I specified each component of the outcome vector y_{it} individually:

model {
   array[N] vector[Y] mu;
  ....
  // likelihood
  for(i in 1:N) {
    mu[i] = x[i] * (beta + z[,pid[i]]);  
    y[i,1] ~ normal(mu[i,1], sigma[1]);
    y[i,2] ~ normal(mu[i,2], sigma[2]);
    y[i,3] ~ normal(mu[i,3], sigma[3]);
}

Is that a permissible way to do it?

Second, I wanted to ask about constraints on the parameters. \beta_i should always be positive. If I put this constraint in the parameters section, e.g.:

parameters {
  matrix<lower=0>[K, N_pts] z_p; // matrix of individual coefficients
  vector<lower=0>[K] beta; // coefficient hyperprior
}

can I continue to code the non-centered parameterization in the way I have below? My understanding was that the non-centered parameterization relies on the assumption of a multivariate normal (so putting the positive constraint would truncate that).


Here’s all the code for reference:


data {
  int N_pts; // number of individuals
  int N; // number of reported triplets: y_{it1}, y_{it2}, y_{it3}
  int Y; // 3 (the number of outcomes)
  int K; // number of covariates
  int pid[N]; // vector of individual ids
  array[N] vector[Y]  y; // outcomes
  array[N] matrix[Y, K]  x; // covariates
} 

parameters {
  matrix[K, N_pts] z_p; // matrix of individual coefficients (centered)
  vector[K] beta; // hyperprior: mean of coefficients
  // hyperprior: covariance of coefficients (decompose into diagonal variance and correlation matrix)
  vector<lower=0>[K] sigma_p; // diagonal variance matrix
  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix
  vector<lower=0>[Y-1] sigma_raw; // error term in likelihood
}

transformed parameters {
  matrix[K, N_pts] z; // matrix of individual coefficients (noncentered)
  z = diag_pre_multiply(sigma_p, L_p) * z_p;
  vector<lower=0>[Y] sigma = append_row(sigma_raw, (sigma_raw[2]-sigma_raw[1])); // add third component of error term subject to additive constraint
}

model {
  array[N] vector[Y] mu;
  
  // priors
  beta ~ normal(1, 1);
  sigma ~ normal(0, 5); 
  sigma_p ~ normal(0, 5); 
  L_p ~ lkj_corr_cholesky(2);
  to_vector(z_p) ~ normal(0, 1);
  
  // likelihood
  for(i in 1:N) {
    mu[i] = x[i] * (beta + z[,pid[i]]);  
    y[i,1] ~ normal(mu[i,1], sigma[1]);
    y[i,2] ~ normal(mu[i,2], sigma[2]);
    y[i,3] ~ normal(mu[i,3], sigma[3]);
  }
}