Likelihood for mixture of two hurdles

Hi everyone,

Biologist by training here, so will be nice to have the experts see if what I’m doing is reasonable.

I am trying to model antibody sequencing count data of a protein. The log(N + 1) transformation of the count looks like a typical hurdle model.

I want to model the transformed data as a mixture of two hurdles. One represents the true expression of the protein and the other represents unspecific binding of the antibody.

In other words:

  • Count == 0 arises in one scenario, when the cell doesn’t express the protein and has no unspecific binding.
  • Count != 0 arises in three scenarios:
    (i) Cell expresses the protein, and there is unspecific binding
    (ii) Cell expresses the protein, and there is no unspecific binding
    (iii) Cell doesn’t express the protein, but has unspecific binding

My simulation looks like:


N <- 500 ##500 cells

prob0_true <- .2 ## 20% of cells have true 0 count
prob0_unspecific <- .7 ## 70% cells have no unspecific binding

mu_true <- 6 ## normal dist. of true expression
sigma_true <- 2
mu_unspecific <- 4 ## normal dist. of unspecific binding
sigma_unspecific <- 1

## 0 arise when cells are true negative (20%) and when there's no unspecific binding (70%)
count <- rep(NA, N)
count_true <- rbinom(N, 1, 1 - prob0_true) * rnorm(N, mu_true, sigma_true) 
count_unspecific <- rbinom(N, 1, 1 - prob0_unspecific) * rnorm(N, mu_unspecific, sigma_unspecific)
count <- count_true + count_unspecific

plot(density(count))

My stan code:

data{
  int N;
  array[N] real<lower=0> count;
}

parameters{
  real<lower=0, upper=1> theta1;
  real<lower=0, upper=1> theta2;
  
  real mu1;
  real mu2;
  
  real<lower=0> sigma1;
  real<lower=0> sigma2;
}

model{
  array[3] real lp;
  
  theta1 ~ beta(.5, .5);
  theta2 ~ beta(.5, .5);
  
  mu1 ~ normal(5, 2);
  mu2 ~ normal(5, 2);
  
  sigma1 ~ normal(0, 1);
  sigma2 ~ normal(0, 1);
  
  for(i in 1:N){
    if(count[i] == 0){
      target += bernoulli_lpmf(1 | theta1) + bernoulli_lpmf(1 | theta2);
    }
    else{

      lp[1] = bernoulli_lpmf(0 | theta1) + bernoulli_lpmf(1 | theta2) + normal_lpdf(count[i] | mu1, sigma1);
      lp[2] = bernoulli_lpmf(1 | theta1) + bernoulli_lpmf(0 | theta2) + normal_lpdf(count[i] | mu2, sigma2);
      lp[3] = bernoulli_lpmf(0 | theta1) + bernoulli_lpmf(0 | theta2) +
              normal_lpdf(count[i] | mu1, sigma1) + normal_lpdf(count[i] | mu2, sigma2);
              
      target += log_sum_exp(lp);
    }
  }
}

I am wondering if my approach to specifying the three scenarios where count != 0 in the for loop is correct. That is, calculating the log-likelihood of each scenario in lp and taking their log_sum_exp.

Thanks so much in advance,
A.L.

if you want to run a mixture model for those three, each of them should be contributed with some weight, right?

so, the parameters should include the weight, and then the lp should also contain log(weight). You can find such example in stan manual