Hi everyone,
Biologist by training here, so will be nice to have the experts see if what I’m doing is reasonable.
I am trying to model antibody sequencing count data of a protein. The log(N + 1) transformation of the count looks like a typical hurdle model.
I want to model the transformed data as a mixture of two hurdles. One represents the true expression of the protein and the other represents unspecific binding of the antibody.
In other words:
Count == 0
arises in one scenario, when the cell doesn’t express the protein and has no unspecific binding.Count != 0
arises in three scenarios:
(i) Cell expresses the protein, and there is unspecific binding
(ii) Cell expresses the protein, and there is no unspecific binding
(iii) Cell doesn’t express the protein, but has unspecific binding
My simulation looks like:
N <- 500 ##500 cells
prob0_true <- .2 ## 20% of cells have true 0 count
prob0_unspecific <- .7 ## 70% cells have no unspecific binding
mu_true <- 6 ## normal dist. of true expression
sigma_true <- 2
mu_unspecific <- 4 ## normal dist. of unspecific binding
sigma_unspecific <- 1
## 0 arise when cells are true negative (20%) and when there's no unspecific binding (70%)
count <- rep(NA, N)
count_true <- rbinom(N, 1, 1 - prob0_true) * rnorm(N, mu_true, sigma_true)
count_unspecific <- rbinom(N, 1, 1 - prob0_unspecific) * rnorm(N, mu_unspecific, sigma_unspecific)
count <- count_true + count_unspecific
plot(density(count))
My stan code:
data{
int N;
array[N] real<lower=0> count;
}
parameters{
real<lower=0, upper=1> theta1;
real<lower=0, upper=1> theta2;
real mu1;
real mu2;
real<lower=0> sigma1;
real<lower=0> sigma2;
}
model{
array[3] real lp;
theta1 ~ beta(.5, .5);
theta2 ~ beta(.5, .5);
mu1 ~ normal(5, 2);
mu2 ~ normal(5, 2);
sigma1 ~ normal(0, 1);
sigma2 ~ normal(0, 1);
for(i in 1:N){
if(count[i] == 0){
target += bernoulli_lpmf(1 | theta1) + bernoulli_lpmf(1 | theta2);
}
else{
lp[1] = bernoulli_lpmf(0 | theta1) + bernoulli_lpmf(1 | theta2) + normal_lpdf(count[i] | mu1, sigma1);
lp[2] = bernoulli_lpmf(1 | theta1) + bernoulli_lpmf(0 | theta2) + normal_lpdf(count[i] | mu2, sigma2);
lp[3] = bernoulli_lpmf(0 | theta1) + bernoulli_lpmf(0 | theta2) +
normal_lpdf(count[i] | mu1, sigma1) + normal_lpdf(count[i] | mu2, sigma2);
target += log_sum_exp(lp);
}
}
}
I am wondering if my approach to specifying the three scenarios where count != 0
in the for loop is correct. That is, calculating the log-likelihood of each scenario in lp
and taking their log_sum_exp
.
Thanks so much in advance,
A.L.