# Modelling the indicator that a sum of Bernoulli random variables is positive

I (a fairly new Stan user) am trying to generate parameter estimates for the following scenario:

• I have N observations and P predictors. Each predictor is binary, representing whether a certain activity was partaken in
• Each predictor has an associated, unknown risk \theta_j \in [0, 1] for j \in \{1, ..., P\}
• The event that an incident occurs during activity j for observation i is I_{i, j} \sim \text{Ber}(X_{i,j} \cdot \theta_j)
• The total number of incidents for observation i (which is not recorded) is Z_i = \sum_{j=1}^N I_{i, j}
• Finally, the variable we have available is Y_i = \mathbb{1}\{Z_i > 0\}

I’ve calculated the pmf of Y_i to be

p_{Y_i}(y) = \begin{cases} 1 - \prod_{j=1}^P (1-X_{i,j} \cdot \theta_j) & y =1 \\ \prod_{j=1}^P (1-X_{i,j} \cdot \theta_j) & y = 0 \end{cases}

This is where I start to get stuck. If my understanding is correct, I want to increment target by the logarithm of this function. Taking the logarithm of the case y=0 is simple, but things get messy for y=1. Are there any tips or advice on how I could best implement this model in Stan? Moreover, as my desired dataset is quite large, an efficient solution would be appreciated.

For reference, my Stan code is

data {
int<lower=0> N;                            // number of observations
int<lower=0> P;                            // number of activities
matrix[N, P] X;                            // activity occurrences
vector<lower=0, upper=1>[N] y;             // at least one incident
}
parameters {
vector<lower=0, upper=1>[P] theta;         // activity risks
}
model {
theta ~ uniform(0, 1);                     // prior
// How to modify target?
}


As for data, I am currently prototyping the model, so I am using random X and \theta.

EDIT

I should point out that I am aware you can do these computations rather directly as

model {
real s;
real t;
theta ~ uniform(0, 1);                     // prior
for (i in 1:N) {                           // likelihood
s = 1 - 2 * y[i];
t = 1;
for (j in 1:P) {
t *= 1 - X[i, j] * theta[j];
}
t += y[i];
target += log(t);
}
}


I just worry that this approach will be too slow for large models and risk numerical precision errors.

You can do the y=0 calculation and then use log1m_exp() function to convert it to the y=1 case. These composed functions are numerically stable.
For binary predictors, I’d use int type instead of matrix (which is continuous) but that’s a matter of taste.

data {
int<lower=0> N;                            // number of observations
int<lower=0> P;                            // number of activities
int<lower=0, upper=1> X[N, P];             // activity occurrences
int<lower=0, upper=1> y[N];                // at least one incident
}
parameters {
vector<lower=0, upper=1>[P] theta;         // activity risks
}
model {
theta ~ uniform(0, 1);                     // prior
for (n in 1:N) {
real s = 0.0;
for (p in 1:P) {
if (X[n,p] == 1) {
s += log1m(theta[p]);
}
}
if (y[n] == 1) {
target += log1m_exp(s);
} else {
target += s;
}
}
}


Thank you @nhuurre. That was the piece that I was missing.