Modelling the indicator that a sum of Bernoulli random variables is positive

I (a fairly new Stan user) am trying to generate parameter estimates for the following scenario:

  • I have N observations and P predictors. Each predictor is binary, representing whether a certain activity was partaken in
  • Each predictor has an associated, unknown risk \theta_j \in [0, 1] for j \in \{1, ..., P\}
  • The event that an incident occurs during activity j for observation i is I_{i, j} \sim \text{Ber}(X_{i,j} \cdot \theta_j)
  • The total number of incidents for observation i (which is not recorded) is Z_i = \sum_{j=1}^N I_{i, j}
  • Finally, the variable we have available is Y_i = \mathbb{1}\{Z_i > 0\}

I’ve calculated the pmf of Y_i to be

p_{Y_i}(y) = \begin{cases} 1 - \prod_{j=1}^P (1-X_{i,j} \cdot \theta_j) & y =1 \\ \prod_{j=1}^P (1-X_{i,j} \cdot \theta_j) & y = 0 \end{cases}

This is where I start to get stuck. If my understanding is correct, I want to increment target by the logarithm of this function. Taking the logarithm of the case y=0 is simple, but things get messy for y=1. Are there any tips or advice on how I could best implement this model in Stan? Moreover, as my desired dataset is quite large, an efficient solution would be appreciated.

For reference, my Stan code is

data {
  int<lower=0> N;                            // number of observations
  int<lower=0> P;                            // number of activities
  matrix[N, P] X;                            // activity occurrences
  vector<lower=0, upper=1>[N] y;             // at least one incident
}
parameters {
  vector<lower=0, upper=1>[P] theta;         // activity risks
}
model {
  theta ~ uniform(0, 1);                     // prior
  // How to modify target?
}

As for data, I am currently prototyping the model, so I am using random X and \theta.

EDIT

I should point out that I am aware you can do these computations rather directly as

model {
  real s;
  real t;
  theta ~ uniform(0, 1);                     // prior
  for (i in 1:N) {                           // likelihood
    s = 1 - 2 * y[i];
    t = 1;
    for (j in 1:P) {
      t *= 1 - X[i, j] * theta[j];
    }
    t += y[i];
    target += log(t);
  }
}

I just worry that this approach will be too slow for large models and risk numerical precision errors.

You can do the y=0 calculation and then use log1m_exp() function to convert it to the y=1 case. These composed functions are numerically stable.
For binary predictors, I’d use int type instead of matrix (which is continuous) but that’s a matter of taste.

data {
  int<lower=0> N;                            // number of observations
  int<lower=0> P;                            // number of activities
  int<lower=0, upper=1> X[N, P];             // activity occurrences
  int<lower=0, upper=1> y[N];                // at least one incident
}
parameters {
  vector<lower=0, upper=1>[P] theta;         // activity risks
}
model {
  theta ~ uniform(0, 1);                     // prior
  for (n in 1:N) {
    real s = 0.0;
    for (p in 1:P) {
      if (X[n,p] == 1) {
        s += log1m(theta[p]);
      }
    }
    if (y[n] == 1) {
      target += log1m_exp(s);
    } else {
      target += s;
    }
  }
}

Thank you @nhuurre. That was the piece that I was missing.