Add an unknown discrete parameter following a Bernoulli distribution

Hello everyone,
I’d like to model the intra-host viral load of each individual in my population. However, within my population, I have infected and uninfected individuals, but I can’t distinguish between them. So I want to estimate the infection status of each of my individuals and the probability of being infected in my population. To do this, I draw a probability of being infected Pinf~ Beta(2,2) and I draw in my infectious status for each observation to_vector(infected) ~ Bernoulli(Pinf).

However, I notice that Stan can’t model discrete parameters and that you have to ‘marginalize discrete unknowns out of the posterior distribution’, but I don’t know how to do that with a process as simple as mine (all the posts I’ve been able to find model a Poisson model, which is much more complex than my problem here).

Thanks in advance for the help!

Maxime

functions {
      // Ctfun returns dCt, the delta Ct below Vinf:
    real CtVinffun(real t, real tinf, int lod, int Vinf, real tp, real Tp, real Tc, real Vp){
      
		if (t <= tinf)
		  return(Vinf); // Ct = 0
	  
      // Viral load rises before peak: 
        else if (t >= tinf && t <= tp)
          return(Vp*(t-tinf)/Tp); // Tp = tp-tinf
          
      // Viral load falls after peak: 
        else //(t > tp)
           return(Vp+(lod-Vp)*(t-tp)/Tc); // Tc = tc-tp
    }
}


data{
  int<lower=0> N; // Number of concatenated data points
  int<lower=0> n_id; // Number of individuals 
  int lod; // Limit of detection of Ct (Ct=10)
  int Vinf; // Ct delta from lod at infection (Ct=0) 
  int<lower=0> id[N]; // Vector marking which datum belongs to which id
  real t[N];  // Vector marking the time for each data point 
  real<lower=10, upper=50> y[N];  // Concatenated data vector 
  int censor[N]; // vector of censor indication
  real tSS[n_id]; // Vector of individual time of symptom onset 
  int nb_random; // Total number of parameters with a random effect
}

transformed data {
  
  real log_is_lod[N]; // log of "is the data is at the LOD value ? : 0 OR 1"
  real log_is_obs[N]; // log of " if the data is below the LOD value ? : 0 OR 1"
  
  for(i in 1:N){
    if(censor[i]==1){
      log_is_obs[i]=log(0);
      log_is_lod[i]=log(1);
    } 
    else{ 
      log_is_obs[i]=log(1);
      log_is_lod[i]=log(0);
    }
  }
}


parameters{
  
  // parameters of Vp 
  real<lower=10, upper=50> mu_Vp;
  
  // parameters of proliferation phase
  real<lower=0> mu_Tp;
  
  // parameters of clearance phase
  real<lower=0> mu_Tc;
  
  // parameters incubation period
  real<lower=0, upper=14> mu_Tincub;
  
  real<lower=0> sigma;
  
  matrix[n_id,nb_random] eta_tilde; // random effect for Vp, Tp, Tc, and Tincub
  
  real<lower=0> eta_omega[nb_random];
  
  // Proportion of infected individuals in the population
  real Pinf; 
  
  // Infection status for each individual
  int infected[n_id];
}


transformed parameters{
  
  real<lower=0, upper=14> Tincub[n_id]; // Incubation period cannot exceed 14 days
  
  matrix[n_id,nb_random] eta; // matrix of random effects for Vp, Tp, Tc, and Tincub
  
  real<lower=10, upper=50> Vp[n_id];
  real<lower=0> Tp[n_id];
  real<lower=0> Tc[n_id];
  real tp[n_id];
  real tinf[n_id];
  
  real<upper=50> Ct[N];
  
  real num_arg[N,2];
  
  for(j in 1:nb_random){
    eta[,j] = eta_tilde[,j]*eta_omega[j];
  }
  
  for(i in 1:n_id){
    
    Vp[i] = mu_Vp*exp(eta[i,1]);
    
    Tp[i] = mu_Tp*exp(eta[i,2]);
    
    Tc[i] = mu_Tc*exp(eta[i,3]);
    
    Tincub[i] = mu_Tincub*exp(eta[i,4]);
    
    tinf[i] = tSS[i] - Tincub[i];

    tp[i] = tinf[i] + Tp[i];
    
  }
  
  // Looping on each observation of the data set
  for(i in 1:N){
    
    // Prediction of the model
    Ct[i] = CtVinffun(t[i], tinf[id[i]], lod, Vinf, tp[id[i]], Tp[id[i]], Tc[id[i]], Vp[id[i]]);
    
    // LL for infected individuals //
    if (infected[id[i]] == 1){
      if (t[i] < tinf[id[i]]){ // Observations before the start of infection
        
        num_arg[i,1] = log_is_obs[i] + log(0.0002); // likelihood that Ctobs value < LOD (false positive PCR test)
        num_arg[i,2] = log_is_lod[i] + log(0.9998); // likelihood that Ctobs value == LOD (true negative PCR test)
        
      } else{
        
        num_arg[i,1] = log_is_obs[i] + normal_lpdf(y[i] | Ct[i], sigma); // likelihood that Ctobs value < LOD (positive PCR test)
        num_arg[i,2] = log_is_lod[i] + normal_lcdf(10 | Ct[i], sigma); // likelihood that Ctobs value == LOD (negative PCR test)
      
      }
    // If the observation is not at the lod value, num_arg[i,2] is equal to -inf and will not be taken in the likelihood : target += log_sum_exp(num_arg[i]);
    // Because log(exp(log(0)) + exp(num_arg[i,1])) = num_arg[i,1]
    
    // LL for non-infected individuals //
    } else{ 
        num_arg[i,1] = log_is_obs[i] + log(0.0002); // likelihood that Ctobs value < LOD (false positive PCR test)
        num_arg[i,2] = log_is_lod[i] + log(0.9998); // likelihood that Ctobs value == LOD (negative PCR test)

    }
  }
}


model{
  // Priors //
  
  mu_Vp ~ normal(25, 0.5); // hierarchical mean (mu)
  mu_Tp ~ normal(6, 0.25); // hierarchical mean (mu)
  mu_Tc ~ normal(15, 0.5); // hierarchical mean (mu)
  mu_Tincub ~ normal(5, 0.25); // hierarchical mean (mu)
  
  Pinf ~ beta(2,2); // probability to be infected => proportion of infected individuals in the population
  to_vector(infected) ~ bernoulli(Pinf); // is the individual is infected
  
  sigma ~ std_normal(); // mesurement error
  
  to_vector(eta_tilde) ~ std_normal();
  
  to_vector(eta_omega) ~ normal(0.15,0.05); // variance of random effect 95% of density [0.05;0.25]
  
  // Likelihood (looped on each observation) //
  for(i in 1:N){
    target += log_sum_exp(num_arg[i]);
  }
}
1 Like

Much like your current code, you can use log_sum_exp() to achieve this. The code below shows a very simplified version of it, where I assume you’ve calculated the person-specific log-likelihoods assuming they are infected and not infected.

for(i in 1:N){
  vector[2] lp;
  lp[1] = log(Pinf) + ll_if_infected[i];
  lp[2] = log(1 - Pinf) + ll_if_not_infected[i];
  
  target += log_sum_exp(lp);
}
2 Likes

And a few additional, unconnected thoughts…

You will likely want to use the updated array syntax. Rather than

real num_arg[N,2];

use

array[N,2] real num_arg;

I don’t have a perfect understanding of your model (so ignore this if I’m missing something crucial), but it looks like you can avoid a lot of unnecessary computation by calculating one value per person in num_arg. The code below is equivalent to an if statement: if censor[i] == 1 then it uses the log-likelihood associated with log_is_lod and it uses the likelihood for log_is_obs if censor[i]!=1. So you don’t need to calculate two values for each n of num_arg. You can also pre-calculate log(0.0002) and log(0.9998) in the transformed data block.

So it might look something like this. (note that I’ve removed the marginalization over lod/obs but added the marginalization over infected status).

transformed data{
    real log_0002 = log(0.0002);
    real log_9998 = log(0.9998);
    array[N] int use_obs_ll;  // 0 = lod; 1 = obs
    for(i in 1:N){
        if(censor[i] == 1){
            use_obs_ll[i] = 0;
        } else{
            use_obs_ll[i] = 1;
        }
    }
}
transformed parameters{
    // first row stores log-likelihoods if infected
    // second row stores log-likelihoods if not infected
    matrix[2,N] num_arg;

    num_arg[1,] = log(Pinf);
    num_arg[2,] = log(1 - Pinf);

    for(n in 1:N){
        if(t[i] < tinf[id[i]]){
            // Observations before the start of infection
            if(use_obs_ll[n] == 1){
                // Data are below the LOD value
                num_arg[1,n] += log_0002;
                num_arg[2,n] += log_0002;
            } else{
                // Data are at the LOD value
                num_arg[1,n] += log_9998;
                num_arg[2,n] += log_9998;
            }
        } else{
            // Observations after the start of infection
            if(use_obs_ll[n] == 1){
                // Data are below the LOD value
                num_arg[1,n] += normal_lpdf(y[i] | Ct[i], sigma);
                num_arg[2,n] += log_0002;
            } else{
                // Data are at the LOD value
                num_arg[1,n] +=normal_lcdf(10 | Ct[i], sigma);
                num_arg[2,n] += log_9998;
            }
        }
    }
}
model{
    for(n in 1:N){
        target += log_sum_exp(num_arg[,n]);
    }
}
2 Likes

I’ll just add that you can simplify the code further and get a bit more performance using the log_mix builtin, i.e. @simonbrauer’s code is equivalent to:

for(i in 1:N){
  target += log_mix(Pinf, ll_if_infected[i], ll_if_not_infectedi[i]);
}
3 Likes

Always use log1m(Pinf) in these cases—it’s more stable for small values of Pinf.

You can simplify the increments to

num_arg[ , n] += log_0002;

And then vectorize all of this:

  for (i in 1:n_id) {
    Vp[i] = mu_Vp*exp(eta[i,1]);
    Tp[i] = mu_Tp*exp(eta[i,2]);
    Tc[i] = mu_Tc*exp(eta[i,3]);
    Tincub[i] = mu_Tincub*exp(eta[i,4]);
    tinf[i] = tSS[i] - Tincub[i];
    tp[i] = tinf[i] + Tp[i];
  }

to much more efficient code:

Vp = mu_VP * exp(eta[ , 1]);
...
tp = tinf + Tp;

You can vectorize CtVinffun so it’s a one-liner, which will probably let you vectorize within the function. Anything that can get converted to vectorized operations will be a win if it involves parameters as we can usually speed up the autodiff.

If you precompute your tests, like (infected[id[I]] == 1) (equivalently just (infected[id[I]]) if it can only take values 0 and 1), you can vectorize all of the loops that branch on these conditions by just operating on the indices. For instance if tin

If you’re not doing custom inits and you have this:

mu_Vp ~ normal(25, 0.5);

then you want to declare mu_Vp as

real<offset=25, multiplier=0.5> mu_Vp;

which will standardize the scale and make our default inits and adaptation work much better in early iterations. This is even more important for the narrowly constrained ones like eta_omega.

You can move a bunch of things you declared as transformed parameters, like num_arg, into local variables in the model to save output space and make the posterior analysis faster.

P.S. I’d recommend more horizontal space around operators and braces and around operators, and less vertical space between lines. We have a style guide at the end of the User’s Guide.

3 Likes

Thank you all for your answers! I’ve now understood how to marginalise latent classes in the posterior with Stan! ;)
The model works very well even with a rather sparse dataset (very fast run, well-mixed chains, unbiased estimates). However, the model doesn’t correspond exactly to what I’d like to model. In fact, I’d like to have the same probability of infection (Pinf) per individual, for all his observations. But with the model you’ve proposed, this Pinf varies from one observation to another within the same individual, which isn’t logical - either the individual is really infected, or he’s not. So I wrote this model to take into account a single Pinf for all the tests of an individual. It runs, but very slowly on the same data sets, with same priors, and I get rather unstable results with more or less significant biases. Not all individuals have the same number of observations. This information is contained in the vector Ntest[n_id], so we only loop over the data for each individual. Could you have a look, it’s likely that I’ve written the model in a way that doesn’t conform to Stan, which could explain the current situation.

functions {
    // Ctfun returns dCt, the delta Ct below Vinf:
    real CtVinffun(real t, real tinf, int lod, int Vinf, real tp, real Tp, real Tc, real Vp){
		if (t <= tinf)
		  return(Vinf); // Ct = 0
        else if (t >= tinf && t <= tp)
          return(Vp*(t-tinf)/Tp); // Viral load rises before peak: Tp = tp-tinf
        else // Viral load falls after peak: (t > tp)
           return(Vp+(lod-Vp)*(t-tp)/Tc); // Tc = tc-tp
    }
}


data{
  int<lower=0> N; // Number of concatenated data points
  int<lower=0> n_id; // Number of individuals 
  int lod; // Limit of detection of Ct (Ct=10)
  int Vinf; // Ct delta from lod at infection (Ct=0) 
  int<lower=0> id[N]; // Vector marking which datum belongs to which id
  real tSS[n_id]; // Vector of individual time of symptom onset 
  int nb_random; // Total number of parameters with a random effect
  int Ntest[n_id]; // Number of observations for each individual
  int max_Ntest; // Number of maximum of repeated test in the population
  matrix[n_id,max_Ntest] time; // array of dim n_id,max_Ntest containing time of observations 
  matrix[n_id, max_Ntest] y;  // array of dim n_id,max_Ntest containing Ct of observations
  matrix[n_id,max_Ntest] censor; // array of dim n_id,max_Ntest containing censor indication
}

transformed data{
    real log_0002 = log(0.0002); // log proba of false positive
    real log_9998 = log(0.9998); // log proba of true negative
    matrix[n_id,max_Ntest] use_obs_ll;  // 0 = lod; 1 = obs
    for(i in 1:n_id){
      for (n in 1:Ntest[i]){
        if(censor[i,n] == 1){
            use_obs_ll[i,n] = 0;
        } else{
            use_obs_ll[i,n] = 1;
        }
      }
    }
}


parameters{
  real mu_Vp; // fixed effect of Vp 
  real<lower=0> mu_Tp; // fixed effect of Tp
  real<lower=0> mu_Tc; // fixed effect of Tc
  real<lower=0, upper=14> mu_Tincub; // fixed effect of Tincub
  real<lower=0> sigma; // additive measurement error
  matrix[n_id,nb_random] eta_tilde; // Gaussian to center random effect to 0
  real<lower=0> eta_omega[nb_random]; // Variance of random effect for Vp, Tp, Tc, and Tincub
  real<lower=0, upper=1> Pinf; // Proportion of infected individuals in the population
}


transformed parameters{
  real<lower=0, upper=14> Tincub[n_id]; // Incubation period cannot exceed 14 days
  matrix[n_id,nb_random] eta; // matrix of random effects for Vp, Tp, Tc, and Tincub
  real Vp[n_id]; // Vector of individual Vp
  real<lower=0> Tp[n_id]; // Vector of individual Tp
  real<lower=0> Tc[n_id]; // Vector of individual Tc
  row_vector[n_id] tp; // Vector of individual tp
  row_vector[n_id] tinf; // Vector of individual tinf
  matrix<upper=50>[n_id,max_Ntest] Ct; // array of dim n_id,N_max_Tests containing Ct pred of each observation
  real num_arg[n_id, max_Ntest, 2]; // array containing individual LL contribution
  num_arg =rep_array(0, n_id, max_Ntest, 2); // Initializing num_arg to 0 (defaut was NaN)
  Ct =  to_matrix(rep_array(0, n_id, max_Ntest)); // // Initializing Ct to 0 (defaut was NaN)
  matrix[n_id,2] ll_Pinf; // matrix of two columns : Pinf and 1-Pinf for each individual
  ll_Pinf[,1] = to_vector(rep_array(log(Pinf), n_id)); // filling the matrix
  ll_Pinf[,2] = to_vector(rep_array(log1m(Pinf), n_id)); // filling the matrix
  for(j in 1:nb_random){
    eta[,j] = eta_tilde[,j]*eta_omega[j];
  }
  Vp = to_array_1d(mu_Vp * exp(eta[ ,1]));
  Tp = to_array_1d(mu_Tp * exp(eta[ ,2]));
  Tc = to_array_1d(mu_Tc * exp(eta[ ,3]));
  Tincub = to_array_1d(mu_Tincub * exp(eta[ ,4]));
  tinf = to_row_vector(tSS) + to_row_vector(Tincub);
  tp = to_row_vector(tinf) + to_row_vector(Tp);
    for(i in 1:n_id){  // Looping on each individual
      for (n in 1:Ntest[i]){  // Each observation of the individual
        Ct[i,n] = CtVinffun(time[i,n], tinf[i], lod, Vinf, tp[i], Tp[i], Tc[i], Vp[i]); // Prediction of the model
        if(time[i,n] < tinf[i]){ // Observations before the start of infection
            if(use_obs_ll[i,n] == 1){  // Data are uncensored
                num_arg[i,n,1] += log_0002 + ll_Pinf[i,1]; // proba of false positive
                num_arg[i,n,2] += log_0002 + ll_Pinf[i,2]; // proba of false positive
            } else{// Data are censored
                num_arg[i,n,1] += log_9998 + ll_Pinf[i,1]; // proba of true negative
                num_arg[i,n,2] += log_9998 + ll_Pinf[i,2]; // proba of true negative
            }
        } else{ // Observations after the start of infection
            if(use_obs_ll[i,n] == 1){ // Data are uncensored
                num_arg[i,n,1] += normal_lpdf(y[i,n] | Ct[i,n], sigma) + ll_Pinf[i,1]; // If infected 
                num_arg[i,n,2] += log_0002 + ll_Pinf[i,2]; // If not infected, proba of false positive 
            } else{ // Data are censored
                num_arg[i,n,1] += normal_lcdf(10 | Ct[i,n], sigma) + ll_Pinf[i,1]; // If infected 
                num_arg[i,n,2] += log_9998 + ll_Pinf[i,2]; // If not infected, proba of true negative 
            }
        }
    }
  }
}

model{
  // Priors //
  mu_Vp ~ normal(25, 0.5); // hierarchical mean (mu)
  mu_Tp ~ normal(6, 0.25); // hierarchical mean (mu)
  mu_Tc ~ normal(15, 0.5); // hierarchical mean (mu)
  mu_Tincub ~ normal(5, 0.25); // hierarchical mean (mu)
  Pinf ~ beta(2,2); // probability to be infected => proportion of infected individuals in the population
  sigma ~ std_normal(); // mesurement error
  to_vector(eta_tilde) ~ std_normal();
  to_vector(eta_omega) ~ normal(0.15,0.05); // variance of random effect 95% of density [0.05;0.25]
  for(i in 1:n_id){ // Likelihood (looped on each observation of each individual)
    for (n in 1:Ntest[i]){
      target += log_sum_exp(num_arg[i,n,1], num_arg[i,n,2]);
    }
  }
}

PS : @Bob_Carpenter I’d like to thank you for all your advice and tips on optimising my code. I’ll look into it, but only once I’ve got a model that works the way I want it to.

Thank you in advance,
Maxime

Having things like your CtVinffun can be problematic for sampling unless you can guarantee the results are continuously differentiable.

I would try to write censoring so it doesn’t use dummy values. You don’t need use_obs_ll as it’s just 1 - censor.

That’s the right thing to do. There’s a chapter in the User’s Guide on ragged data—we keep meaning to add it as a built-in, but it’s hard to design, code, and plumb through all of our I/O.

1 Like

Finally, it seems that the problem stemmed from the definition of the infection start time, tinf tinf = tSS - Tincub (and not with a +), the infection comes before the symptoms, not after.

After correcting this typo in the script, the model runs very well. I have unbiased results, with a very good mix of chains with simulated datasets of 100 infected individuals and 100 uninfected individuals. The framework is very sparse with 75% of individuals with 1 observation, 20% with 2 observations and 5% with 3 or 4 observations. Despite this framework, the inference is satisfactory.

My final concern is the feasibility of using the model in my context, particularly in terms of computation time. I would like to use this model on a dataset of 1 million individuals (~1.3 million data points), as sparse as simulated previously. For 200 individuals, the current script takes between 15 and 25 minutes (see histogram) to run on a computer centre (200 Warm-Up, 200 sampling, adapt_delta=0.95 and max_treedepth=15). The 200 iterations of the W-U take 80% of the calculation time and the remaining 20% for the sampling, is this normal? Now I’m running on a dataset of 10,000 infected and 10,000 uninfected and the model seems to run forever (in 15 hours, I still haven’t done 40 iterations of the W-U). So I can’t imagine how long it would take for 1 million individuals.

Apart from vectorising the code and intra-chain parallelisation (I only have a maximum of 16 cores available for 4 chains, so divided by 4), is my model still feasible on a dataset of 1 million in Stan?

I am adding @charlesm93 to the discussion, with whom I had already chatted a few months ago on this subject. Thanks again @Bob_Carpenter for your valuable advice.

This all depends on your model. I would say, though, that the max_treedepth=15 is going to be problematic. That’s going to allow 32K leapfrog iterations, each requiring a log density and gradient. In most cases, the root cause is some bad geometry in hierarchical models which a non-centered parameterization can fix. In other cases, it’s much more complicated.

I would recommend trying the PyMC sampler Nutpie—it’s specifically faster than Stan in warmup. the other thing to try would be to run Pathfinder, our variational inference, and use draws from that to initialize MCMC. Pathfinder can fail during optimization if the geometry’s too tricky.

At the very least, I would parameterize things like mu_Vp, which has a normal(25, 0.5) distribution as

real<offset=25, multiplier=0.5> mu_Vp;

This will cause initialization to be in the range (-24, 26) instead of (-2, 2). This can help immensely with initialization if the prior’s reasonable.

How did you derive the 0.0002 and 0.9998 numbers?

The other thing to do with 1M individuals is to try to group them somehow and reduce everything to sufficient statistics. This may or may not be possible depending on the size of your model.

Sounds like that’s a hard no unless you can reparameterize and do some data reduction. For example, you can do some data reduction this way:

to_vector(infected) ~ bernoulli(Pinf);

is equivalent to

sum(infected) ~ binomial(n_id, Pinf);

But I’m really confused how you have integer parameters!