Finding clusters within random effects individuals

I have a problem where I measure repeated responses in condition A and in condition B for a set of individuals i = 1, ..., n. I am interested in learning about the effect of the condition in the population, for which fitting a hierarchical model with a random effect for individual would seem reasonable.

However, it is to be expected that the population is composed of sub-populations, although we don’t know what sub-population each individual belongs to. For example, it is likely that there is a population 1, of individuals with 0 effect of condition, and a population 2, where individuals show an effect of condition, with some variability across individuals. But we don’t know what individual belongs to which.

I am interested in learning about this sub-population structure, and about what sub-populations the individuals belong to. Lets take the example above. It seems to me that we can think of the problem as a mixture of distributions that the individuals come from, one has all its mass on 0 (population 1), and the other one has an effect that we have to estimate, together with its variability (population 2). So, we could set up the problem as estimating the parameters of the mixture distribution, where I would be interested in the mixing probabilities, and the parameters of population 2. I would also be interested in the probability that each individual belongs to one population or the other.

So this seems like a kind of Gaussian Mixture Model problem, but applying the GMM to the random effects, instead of the responses, as I’ve seen in other places. Is there some way of setting up this kind of problem in STAN? Is this some known type of problem with a name I can look into?

Just the problem above would be a good starting point, but I will later want to expand it to work with more complicated mixture distributions, such as 2 populations with different means and variability (not necessarily 0), or even more populations.

To get you started: for each individual construct two linear predictors lp_1 and lp_2, one without the effect and one with. Include a fitted parameter p that gives the probability that an arbitrary individual is in subpopulation 2. Then the likelihood for an individual will look like

target += log_mix(p, normal_lpdf(y[i] | lp_1[i], sigma), normal_lpdf(y[i] | lp_2[i], sigma));

The posterior probability that individual i is in the first population is

exp(log(p) + normal_lpdf(y[i] | lp_1[i], sigma)) / (exp(log(p) + normal_lpdf(y[i] | lp_1[i], sigma)) + exp(log1m(p) + normal_lpdf(y[i] | lp_2[i], sigma)))

Note that in this case you might find some issues with label-switching, particularly once you model nonzero effect sizes/variabilities in both groups. You can ameliorate these by constraining the sign of the difference in the group means to identify which group is which.

Edit: note also that in the population where condition has no effect, you still need to choose which condition for the other population should correspond to the intercept for the no-effect population. Perhaps you have domain knowledge that makes this obvious, but from first principles there’s no clear choice.

Thanks @jsocolar. I’m trying to implement what you suggested. However, there’s something that’s not fully clear to me from your response.

An important aspect of my data is that for each individual I make repeated measurements. In your example code you use i to index both the individual and the observation, but that would not be appropriate for my problem. So the line
target += log_mix(p, normal_lpdf(y[i] | lp_1[i], sigma), normal_lpdf(y[i] | lp_2[i], sigma));
in your code above would need to be modified to account for this. For example, something like:

for  (i in 1:N) {
    target += log_mix(p,  normal_lpdf(y[i] | lp_1[indId[i]], sigma), normal_lpdf(y[i] | lp_2[indId[i]], sigma));
}

where i indexes the observation (and not the individual as in your suggested code), indId is an array of length N with the identity of the individual from which we gather observation i, and lp_1 and lp_2 are arrays of length K, with K being the number of individuals.

Now, it seems like I would have to also modify the second line of code to compute the posterior of belonging to the first population. For this I would need y to be a vector of observations associated with the linear predictors of individual j as given by lp_1[j] and l[_2[j] and then the code would seem to work? Does all of this sound about right? Or maybe a different approach would be better with this comments I make?

Also, for getting the probability that individual j belongs to population 1, according to your suggestion I should add the second line to the generated quantities block?

Thanks in advance!

I implemented the model (to the best of my ability), and it doesn’t seem to perform well on simulated data. I simulated data with 2/3 of the neurons belonging to population 1 (with an effect) and 1/3 belonging to population 2 (no effect). but the mixture probability is consistently around 0.995. I don’t get any ill-behavior messages during fitting.

I’m using the following Stan code to define the model:

data {
  int<lower=1> nNeurons;  // Number of groups
  int<lower=1> N; // Number of total datapoints
  vector[N] Stimulation; // Stimulation indicator
  vector[N] Response; // Response data
  array[N] int<lower=1> neuronId; // Neuron indices for each observation
}

parameters {
  real baselineFR;      // Average baseline FR
  real<lower=0> meanStimEffect;      // Increment of FR with stimulation
  vector[nNeurons] neuronBaseFR;  // Neuron-specific baseline FR
  vector[nNeurons] neuronEffect;  // Neuron-specific effect of stimulation
  real<lower=0> sigmaBaselineFR;     // Standard deviation of baseline FR
  real<lower=0> sigmaEffect;     // Standard deviation of stimulation effect
  real<lower=0> sigmaResidual;  // Standard deviation of residuals
  real<lower=0,upper=1> p;  // The probability that an arbitrary individual is in subpopulation 2
}

model {
  // Priors
  baselineFR ~ normal(0, 30);
  sigmaBaselineFR ~ cauchy(0, 5);
  neuronBaseFR ~ normal(baselineFR, sigmaBaselineFR);
  meanStimEffect ~ normal(0, 30);
  sigmaEffect ~ cauchy(0, 5);
  sigmaResidual ~ cauchy(0, 5);
  p ~ beta(1, 1);  // Prior for the mixture weight, assuming equal chance for both subpopulations.
  for (i in 1:N) {
    // linear predictor with the effect
    real lp_1 = neuronBaseFR[neuronId[i]] + neuronEffect[neuronId[i]] * Stimulation[i];  
    // linear predictor without the effect
    real lp_2 = neuronBaseFR[neuronId[i]];
    // Mixture model for the response
    target += log_mix(p,
                      normal_lpdf(Response[i] | lp_1, sigmaResidual),
                      normal_lpdf(Response[i] | lp_2, sigmaResidual));
  }
}


In the code above, neuronBaseFR and neuronEffect are the intercept and slope used to generate the linear predictors lp_1 and lp_2.

I use the following code to generate a dataset in Python (note, individuals in my case are neurons):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
from cmdstanpy import CmdStanModel
import seaborn as sns
import os
import arviz as az

###############
# GENERATE THE DATA
###############

# Dataset parameters
nTrials = 20  # number of trials per condition
neurons1N = 20  # number of neurons with effect (population 1)
neurons2N = 10  # number of neurons without effect (population 2)
baselineFiring = 10  # mean baseline firing rate
neuronBaselineSd = 3  # variability in baseline firing rate
effect = 10  # mean effect of stimulation on population 1
residualSd = 3  # trial response variability

nNeurons = neurons1N + neurons2N
neuronType = ['1'] * neurons1N + ['2'] * neurons2N
typeEffects = {'1': effect, '2': 0}

# Generate the dataset
data = []
# Generate the baseline firing rates for each neuron
neuronBaselines = np.random.normal(0, neuronBaselineSd, nNeurons) + \
  baselineFiring
# Define the means for each condition and group
for neuron in range(nNeurons):
    for stimulation in [0, 1]:
        samples = np.random.normal(neuronBaselines[neuron] +
                                   typeEffects[neuronType[neuron]] * stimulation,
                                   residualSd, nTrials)
        data.extend(zip([neuron] * nTrials,
                        [neuronType[neuron]] * nTrials,
                        range(1, nTrials + 1),
                        [stimulation] * nTrials, samples))

# Create a DataFrame from the generated data
expData = pd.DataFrame(data, columns=['Neuron', 'Type', 'Trial',
                                      'Stim', 'Response'])

I get the following summary of the results with arziv:

                     mean      sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
baselineFR        10.302   0.494   9.375   11.241      0.010    0.007    2605.0    1424.0    1.0
meanStimEffect    23.620  17.784   0.009   56.464      0.360    0.255    1771.0    1005.0    1.0
neuronBaseFR[0]   11.616   0.646  10.425   12.788      0.012    0.008    3015.0    1487.0    1.0
neuronBaseFR[1]   12.359   0.658  11.069   13.538      0.012    0.008    3052.0    1763.0    1.0
neuronBaseFR[2]    8.824   0.673   7.609   10.099      0.012    0.009    2959.0    1767.0    1.0
...                  ...     ...     ...      ...        ...      ...       ...       ...    ...
neuronEffect[29]   0.414   0.960  -1.331    2.192      0.017    0.018    3284.0    1596.0    1.0
sigmaBaselineFR    2.476   0.363   1.868    3.189      0.007    0.006    2571.0    1588.0    1.0
sigmaEffect       19.473  87.730   0.015   49.762      2.246    1.589    3465.0     905.0    1.0
sigmaResidual      3.036   0.065   2.910    3.155      0.001    0.001    3461.0    1321.0    1.0
p                  0.995   0.004   0.987    1.000      0.000    0.000    1805.0    1176.0    1.0


The log_mix statement still needs to be applied per individual. So let i index the individuals, and let the likelihoods of the mixture components be given by the sum of the log probabilities for all observations corresponding to the ith individual.

2 Likes

Thanks. For completeness, the below is the script I ended up with. All individuals have the same number of observations in this example so that the hierarchical structure can be encoded as a matrix, where each row is an individual and each column an observation.

data {
  int<lower=1> nNeurons;  // Number of groups
  int<lower=1> nObsPerNeuron; // Number of total datapoints
  array[nNeurons, nObsPerNeuron] real Response; // Observation indices for each neuron
  array[nNeurons, nObsPerNeuron] int<lower=0, upper=1> Stimulation; // Stimulation indicator
}

parameters {
  // Baseline firing rate parameters
  real<lower=0> interceptPop;
  real<lower=0> interceptSigma;
  vector[nNeurons] interceptNrn;
  // Stimulation effect parameters
  real<lower=0> effectPop;
  real<lower=0> effectSigma;
  vector[nNeurons] effectNrn;
  real<lower=0> residualSigma;
  // Mixture populations parameter
  real<lower=0,upper=1> pMix;
}

model {
  // Priors
  interceptPop ~ normal(10, 5);
  //interceptSigma ~ cauchy(0, 5);
  interceptSigma ~ gamma(1.4, 1);
  interceptNrn ~ normal(interceptPop, interceptSigma);
  effectPop ~ normal(0, 3);
  //effectSigma ~ cauchy(0, 5);
  effectSigma ~ gamma(1.4, 1);
  effectNrn ~ normal(effectPop, effectSigma);
  residualSigma ~ normal(0, 3);
  pMix ~ beta(1, 1);
  // Likelihood
  // Loop over neurons
  for (i in 1:nNeurons) {
    // Likelihood under population 1 for this neuron responses
    real log_prob_1 = normal_lpdf(to_vector(Response[i,:]) | interceptNrn[i] +
      effectNrn[i] * to_vector(Stimulation[i,:]), residualSigma);
    // Likelihood under population 2 for this neuron responses
    real log_prob_2 = normal_lpdf(to_vector(Response[i,:]) | interceptNrn[i], residualSigma);
    // Mixture model for the response
    target += log_mix(pMix, log_prob_1, log_prob_2);
  }
}

generated quantities {
  // Probability of belonging to population 1 for each neuron
  vector[nNeurons] pNrn; 
  for (i in 1:nNeurons) {
    // Evaluate likelihoods
    real log_prob_1 = normal_lpdf(to_vector(Response[i,:]) | interceptNrn[i] +
      effectNrn[i] * to_vector(Stimulation[i,:]), residualSigma);
    real log_prob_2 = normal_lpdf(to_vector(Response[i,:]) | interceptNrn[i], residualSigma);
    // Evaluate posterior probability of belonging to population 1
    pNrn[i] = exp(log(pMix) + log_prob_1) /
      (exp(log(pMix) + log_prob_1) + exp(log1m(pMix) + log_prob_2));
  }
}