Hierarchical model with only two groups

I am modelling the number of successes in two consecutive trials of size N_1 and N_2, respectively, using the following model:

n_i \mid \lambda_i, N_i \sim Bin(N_i, \lambda_i), for i=1,2.

This treats the number of successes as independent but I expect the success proportions to be strongly positively correlated between the trials. One way to introduce correlation is to use a hierarchical model. I.e., by letting

\lambda_i \mid a,b \sim Beta(a,b)

and then specifying some hyperpriors. I specify

a ~ gamma(0.1,0.1);
b ~ gamma(5,5);

I ran some simulations, sampling (a,b) and then sampling (\lambda_1, \lambda_2) for each (a,b) and found that

cor(\lambda_1, \lambda_2) \approx 0.8 using the above hyperpriors.

I ran the above model and found that cor(\lambda_1, \lambda_2) = 0.0380693 using the posterior estimates of (\lambda_1, \lambda_2). The Stan code is included below. Should I expect the correlation from the posterior estimates of (\lambda_1, \lambda_2) to be around 0.8?

data {
  int N_1;
  int N_2;
  int n_1;
  int n_2;
  int M; 
}
parameters{
  real<lower=0, upper=1> lambda[M];
  real<lower=0> a;
  real<lower=0> b;
}
model{
  target+= binomial_lpmf(n_1| N_1, lambda[1]) + binomial_lpmf(n_2| N_2, lambda[2]);
  //priors
  lambda ~ beta(a,b);
  a ~ gamma(0.1,0.1);
  b ~ gamma(5,5);
}

Since this model did not appear to work I specified the following priors:

\bigg(log(\lambda_1/(1-\lambda_1)), log(\lambda_2/(1-\lambda_2))\bigg) \sim N((\mu,\mu), \Sigma),

where \Sigma = \begin{pmatrix} \sigma^2 & \sigma^2 \rho \\ \sigma^2 \rho & \sigma^2\end{pmatrix}

with suitable hyperpriors

mu ~ normal(2,1);
sigma ~ normal(0.75,0.25);
rho ~ normal(0.85,0.05);

This model captures everything I want in the context of the problem.

This model gives divergent transitions on the real data and simulated data. I then tried inputting \mu and \sigma as known and only estimating \rho but this also gave divergent transitions.

I wonder if there is a problem with my code or if it is because I have a hierarchical model with only two groups (one proportion for each group) and the model is struggling to identify anything. I have included the code below for the second model.

data {
  int N_1;
  int N_2;
  int n_1;
  int n_2;
  int M; 
  //real mu; 
  //real<lower = 0> sigma;
}
parameters{
  vector[M] lambda_raw;
  real mu;
  real<lower=0> sigma;
  real<lower=0, upper=1> rho;
}
transformed parameters{
  vector[M] lambda;
  lambda[1] = 1/(1+exp(-lambda_raw[1]));
  lambda[2] = 1/(1+exp(-lambda_raw[2]));
}
model{
  matrix[2,2] Sigma;
  Sigma[1,1] = sigma^2;
  Sigma[2,2] = sigma^2;
  Sigma[1,2] = rho*sigma^2;
  Sigma[2,1] = rho*sigma^2;
  row_vector[2] Mu;
  Mu[1] = mu;
  Mu[2] = mu;
  target+= binomial_lpmf(n_1| N_1, lambda[1]) + binomial_lpmf(n_2| N_2, lambda[2]);
  //priors
  lambda_raw ~ multi_normal(Mu,Sigma);
  mu ~ normal(2,1);
  sigma ~ normal(0.75,0.25);
  rho ~ normal(0.85,0.05);
}

Does anyone have any ideas on:
(1) if I should expect the correlation from the posterior estimates of (\lambda_1, \lambda_2) of the first model to be around 0.8?
(2) why the second model gives divergent transitions and if I can hope for it to work?
(3) Another way to model correlated proportions?

I may be misunderstanding, but isn’t a defining feature of hierarchical models that the group specific parameters \lambda_i are independent draws from the population distribution? The hierarchical structure introduces within-group correlation for the observations.

I think a simpler way to introduce some prior “expectation” of correlation could be to work on the logit scale and give the two parameters something like a a multivariate normal prior with an appropriate covariance (correlation) matrix. So like the second model but you aren’t trying to estimate a correlation.

This was just a quick response off the top of my head so others may wish to join in.

1 Like

I might also be misunderstanding something, but it sounds like you’re comparing two different things. Your first simulation sounds like you’re sampling a and b many times (say, N) and from each pair you’re sampling \lambda_1 and \lambda_2. So you have an N \times 2 matrix of \lambda with a correlation of 0.8.

In the second case, it sounds like you are taking one pair of \lambda s, simulating a dataset, and estimating them from the data. In that case, I don’t see why the posteriors should be correlated, as described by @js592. This would be closer to simulating a and b once and then simulating N pairs of \lambda s from them.

1 Like

When simulating from the priors I did simulate N pairs of (a,b) and then for each (a,b) I sampled a pair (\lambda_1,\lambda_2). The correlation between (\lambda_{1,1}, \dots, \lambda_{1,N}) and (\lambda_{2,1}, \dots, \lambda_{2,N}) is approximately 0.8.

Using the law of total covariance we find:

Cov(\lambda_1,\lambda_2) = \underbrace{E[Cov(\lambda_1,\lambda_2 \mid a,b)]}_{0} + \underbrace{Cov(E[\lambda_1 \mid a,b], E[\lambda_2 \mid a,b])}_{\neq 0}.

Then dividing by \sqrt{Var(\lambda_1)Var(\lambda_2)} this gives 0.8. When calculating the covariance using the posterior samples of \lambda_1 and \lambda_2 I must be calculating something else.

Could you elaborate on how you calculated the correlation of the posterior samples? What data did you pass to the model?

run_model = stan("hierarchical_model.stan", data = stan_list,iter =10000, seed = 1, chains = 1)

lambda1 = run_model@sim$samples[[1]]$`lambda[1]`[5001:10000]
lambda2 = run_model@sim$samples[[1]]$`lambda[2]`[5001:10000]

I called the script “hierarchical_model.stan” and passed in the number of trials (N_1, N_2) and number of successes (n_1,n_2) in each stage and set M=2 (all contained in stan_list) and obtained the posterior samples for \lambda_1 and \lambda_2 as shown above and then calculated

cor(\lambda_1, \lambda_2) in R.

Does this mean you you reran the model once for each pair of lambdas as described below? Then took the posterior means for each model and correlated those?

1 Like

I only ran the Stan code once using one set of observations (n_1, N_1, n_2, N_2). These four observations were used to estimate \lambda_1, \lambda_2 using the Stan code.

I then checked the correlation of the posterior samples of \lambda_1 and \lambda_2 and compared that to the repeated simulations from the prior distributions.

I think this is comparing two different things. I don’t think you should expect to find the same correlation in this case, at least if you’re only passing one pair of observations. Someone else may have a more formal mathematical description of why this is the case, but the correlation between the vector of known parameters (as in the initial simulation) may be completely unrelated to the correlation of the posterior draws of any given pair. So, even if you were to pass multiple pairs of observations, you would need to look at the per-draw correlation between the two vectors of \lambda rather than the correlation of each pair across all draws.

Assuming you only have one pair of observations, I think you could either a) ignore the correlation or b) impose some informative prior. I don’t think freely estimating the correlation with a wide or uninformative prior will do much.

Edit: just to clarify my point, there are two issues with comparing the correlation from the simulation to the posterior samples in Stan. First, you’re only passing one pair of observations and so have no leverage to estimate the correlation from the data. If you passed multiple pairs or used an informative prior, then you could estimate that correlation. Second, the within-draw correlation between two vectors of parameters is different from the across-draw correlation between two parameters. So the baseline comparison you’re making isn’t aligned.

1 Like

You might find it easier to specify the priors using the mean/concentration parameterization (which our doc calls the “proportion distribution”): Continuous Distributions on [0, 1]

It’s hard to lay down a sensible prior on the standard parameters of a beta due to their correlation. There’s a discussion of why this is problematic in the fifth chapter of Gelman et al.'s Bayesian Data Analysis, where they suggest the alternative parameterization.

This isn’t technically a hierarchical model in the Bayesian sense. It’s just fitting both of the binomials independently.

No. The prior induces a strong correlation, but you’re then fitting the two values independently and then measuring posterior correlation. If there was a much larger group of trials, you’d find stronger effects.

Another example that might be simpler is a linear regression with a slope and intercept and only positive covariates. In that case, even if the prior correlation between slope and intercept is zero, they will be strongly correlated in the posterior.

In traditional stats lingo, you have moved from treating the correlation as a random effect to treating it as a fixed effect.

For transformed parameters, this can be a one-liner:

vector[M] lambda = inv_logit(lambda_raw);

You also don’t want to compute values multiple times. it’s better to set Sigma[2,2] = Sigma[1, 1] so as not to have to compute the square again. you an define Mu in a one-liner using row_vector[2] Mu = [mu, mu];.

You probably want to declare rho to have a lower-bound of -1 in case the data is consistent with negative posterior correlation.

(1) No.

(2) You have to look at the geometry you’re inducing in the posterior. Look at the pairs plus of parameters. There’s a chapter in the User’s Guide on problematic posteriors that’s a good place to start.

(3) Trying to make a hierarchical model for two parameters is pretty much going to have to lean strongly on the prior since there won’t be enough data to robustly estimate the population parameters. I’d suggest just skipping the hierarchical component altogether unless there’s some reason you want to model the correlation. In the likelihood, the two parameters are not correlated at all. You can force them to be correlated with a correlated prior and that can work, but to what end? Is there some kind of inference you want to do downstream with the correlations?

2 Likes

Thank you for the suggestions @Bob_Carpenter.

I don’t want to model the correlation but I am uncertain about the value of the correlation between \lambda_1 and \lambda_2. That is the only reason I have a prior on \rho. I don’t expect to be able to use the posterior estimates of \rho for any sort of inference.

For now, I am using the Stan code below:

data {
  int N_1;
  int N_2;
  int n_1;
  int n_2;
  int M; 
  real mu; 
  real<lower = 0> sigma;
}
parameters{
  vector[M] lambda_raw;
  real<lower=0, upper=1> rho;
}
transformed parameters{
  vector[M] lambda = inv_logit(lambda_raw);
}
model{
  matrix[2,2] Sigma;
  Sigma[1,1] = sigma^2;
  Sigma[2,2] = Sigma[1,1]; 
  Sigma[1,2] = rho*sigma^2;
  Sigma[2,1] = Sigma[1,2];
  row_vector[2] Mu = [mu,mu];
  target+= binomial_lpmf(n_1| N_1, lambda[1]) + binomial_lpmf(n_2| N_2, lambda[2]);
  //priors
  lambda_raw ~ multi_normal(Mu,Sigma);
  rho ~ normal(0.85,0.05);
}

I ran the model with 10,000 iterations and this gave 4 divergent transitions, so I just increased adapt_delta to 0.9. I thought this would be fine given the small number of divergent transitions.

What I am most interested in is the marginal posterior for \lambda_2. In stage 1, I observe n_1 successes out of N_1 trials, but I only move to stage 2 conditional on \lambda_1 meeting a minimum level of acceptance. E.g., I only observe the stage 2 data (n_2, N_2) if Pr(\lambda_1 > 0.95) = 0.9. Is this possible to model this in Stan?

I initially extracted \lambda_1 and \lambda_2

lambda1 = run_model@sim$samples[[1]]$`lambda[1]`[5001:10000]
lambda2 = run_model@sim$samples[[1]]$`lambda[2]`[5001:10000]

where run_model is the Stan model and plotted the histogram of

lambda2[lambda1>0.95]

This is not quite what I was looking for but it discards the \lambda_2 samples when \lambda_1 \leq 0.95. I think this is equivalent to the marginal for \lambda_2 when \lambda_1 is truncated above 0.95.

Your model is imposing a prior correlation on \lambda based on a hierarchical prior involving mu and sigma and rho. Even though it’s coded to look like a hierarchical model, it’s all just going to integrate out to give you a fixed prior over lambda. You can’t estimate Sigma because there’s only a single observation—nothing from which to gain any information about covariance at all!

Usually we fit a multi-normal with a sequence of more than N vectors for an N-dimensional multinormal. here you only have 1 pair (\lambda_1, \lambda_2). What’s going to happen is that all of this prior apparatus is just going to pull the \lambda variables gently toward the prior mean. You also really need a prior on sigma here, too, to make it more computationally stable.

The model you wrote down doesn’t have the conditioning of stage 2 data on stage 1. I’m unclear on what you mean that stage 2 only happens based on \lambda_1. There’s no \Pr[\lambda_1 > 0.95] in the data-generating process because \lambda_1 will have a single fixed value.

1 Like

Is it unstable to have the following prior:

\begin{pmatrix} \lambda_1 \\ \lambda_2 \end{pmatrix} \sim N \Bigg( \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \begin{bmatrix} \sigma^2 & \sigma^2 \rho\\ \sigma^2 \rho & \sigma^2 \end{bmatrix}\Bigg),

where I specify \mu_1, \mu_2, \sigma, \rho in the data block? I thought this would just be standard joint prior and hence stable. I only specified a prior on \rho (and entered the rest as data) to represent some uncertainty in \rho. Or perhaps it would be okay if all were entered as data but I must specify a prior on \sigma if I specify one on \rho and vice versa?

At the moment it does not but I would like to implement some conditioning. In practice, trial 1 is run until N_1 patients (with a disease) have been tested and we calculate n_1 (the number of times a diagnostic test is positive given the patient has the disease) so n_1/ N_1 is the maximum likelihood estimate of the diagnostic test sensitivity. We then only collect the stage 2 data because n_1/N_1 > s_{min}, a minimum allowed value. So the joint distribution (\lambda_1, \lambda_2) is not observed when n_1/N_1 \leq s_{min}. I would like to include this requirement in the model. When n_1/N_1 > s_{min} the stage 2 data is collected in a population that is not exchangeable with the first stage population, but we expect \lambda_1 and \lambda_2 to be positively correlated (hence the correlation using the joint prior).

Only moving to stage 2 if n_1/N_1 > s_{min} is saying we only move to stage 2 if the maximum likelihood estimate for \lambda_1 is greater than s_{min}. Instead of this we could estimate \lambda_1 (a parameter) using (n_1, N_1) and calculate quantities like \Pr(\lambda_1 > 0.95) and propose that we only move to stage 2 if \Pr(\lambda_1 > 0.95) >p. I wonder if this could be included in the model as well?

No, that should be fine if everything on the right-hand side is data. It doesn’t have to be data—I was just trying to point out that it won’t make a hierarchical prior if you only have one pair of \lambda values.

It really comes down to how the data was collected. You want to model the collection process directly so the model matches the data-generating process. But given that n_1, N_1 are observed, this probably won’t matter if s^\textrm{min} is also known.

I didn’t understand the reasoning that \lambda_1, \lambda_2 are positively correlated as you’re only giving a situation in which \lambda_2 is or is not observed. For instance, if I draw x_1, x_2 independently from a standard normal, but only observe x_2 if x_1 is above a threshold, that doesn’t make the variables correlated—it makes whether x_2 is observed conditional on the value of x_1.

You can’t calculate \Pr[\lambda_1 > 0.95] within a model—that’s a posterior calculation involving averages over posterior draws. Given that the condition is on n_1, N_2, it can be detached from \lambda_1 in that \lambda_1 < n_1 / N_2 for some draws because it might be consistent with the data.

1 Like

Okay. This makes sense. I will have to rethink about the problem and come up with the right model.

We are assuming that \lambda_1 and \lambda_2 are positively correlated. The same diagnostic test is applied in population 1 in stage 1 and population 2 in stage 2. If the sensitivity in stage 1 is “high” we are expecting it to be high in stage 2. I.e., if the following model is assumed

\begin{pmatrix} \lambda_1 \\ \lambda_2 \end{pmatrix} \sim N \Bigg( \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \begin{bmatrix} \sigma^2 & \sigma^2 \rho\\ \sigma^2 \rho & \sigma^2 \end{bmatrix}\Bigg)

and if \lambda_1 is sampled above its mean \mu_1, I would expect \lambda_2 to also be sampled above its mean (\mu_2) and not just randomly sampled ignoring how well the test performed in stage 1.

It actually sounds like adding the correlation structure is over-complicating things. Correct me if I have any of these details wrong



You’re applying the same test to two different groups. You’re assuming that there is some “true” sensitivity to the test (say, \lambda_0) which is estimated by the two independent trials with proportions \lambda_1 and \lambda_2. You only have data for one test in two populations. In a hypothetical world of M different tests, you would expect that the correlation between the pairs of \lambda across all M tests would be about 0.8. So you are trying to represent that correlation in the model.

If all of that is accurate, then your original model with hyperparameters a and b should be fine though, as @Bob_Carpenter mentioned, you don’t have much data to estimate those hyperparameters. You might also find it easier to, following his advice, use the mean/concentration parameterization. You could then specify priors for \lambda_0 and the spread of \lambda_1 and \lambda_2. If you did have M different tests, then you would have more leverage to estimate the hyperparameters.

With only one pair of observations and no additional information about the two different trials, you could even combine the two trials into one (n = n_1 + n_2 and N = N_1 + N_2; n ~ binomial(N, \lambda_0)), but that will depend on your other assumptions about the trials.

1 Like

Yes, this is one approach. You can assume that \lambda has a prior, p(\lambda), and construct a posterior distribution using data from stage 1, p(\lambda \mid n_1, N_1), and then use this posterior as the prior for stage 2. I.e., p(\lambda \mid n_1, N_1, n_2, N_2) \propto L(\lambda \mid n_2, N_2)p(\lambda \mid n_1, N_1). This is equivalent to combining the trials into one.

I was hoping to relax the assumption that the trials could be combined into one and hence assumed we could estimate a \lambda for each population where we can learning something about \lambda_2 from \lambda_1. But it does look like this will be difficult given that I only have a small amount of data.

I am not thinking about correlation between tests. I was assuming that the sensitivity of the diagnostic test would be correlated between the stages. For example, in stage 1 you might apply the diagnostic test to lab samples taken from patients, if the test has a high sensitivity you might then want to try out the test on patients directly (and not just on samples in a lab). In stage 2, you might apply the test to patients in a real-world setting. I shouldn’t really just assume the test was used on N_1 + N_2 patients from the same population. So I tried to estimate a \lambda for each stage and assume we learn something about \lambda_2 from \lambda_1 (i.e., not completely discarding the information from stage 1 but also not completely pooling it - which sounds like a hierarchical model is what I want).

1 Like