Add an unknown discrete parameter following a Bernoulli distribution

Germante · March 11, 2025, 10:08am

Hello everyone,
I’d like to model the intra-host viral load of each individual in my population. However, within my population, I have infected and uninfected individuals, but I can’t distinguish between them. So I want to estimate the infection status of each of my individuals and the probability of being infected in my population. To do this, I draw a probability of being infected Pinf~ Beta(2,2) and I draw in my infectious status for each observation to_vector(infected) ~ Bernoulli(Pinf).

However, I notice that Stan can’t model discrete parameters and that you have to ‘marginalize discrete unknowns out of the posterior distribution’, but I don’t know how to do that with a process as simple as mine (all the posts I’ve been able to find model a Poisson model, which is much more complex than my problem here).

Thanks in advance for the help!

Maxime

functions {
      // Ctfun returns dCt, the delta Ct below Vinf:
    real CtVinffun(real t, real tinf, int lod, int Vinf, real tp, real Tp, real Tc, real Vp){
      
		if (t <= tinf)
		  return(Vinf); // Ct = 0
	  
      // Viral load rises before peak: 
        else if (t >= tinf && t <= tp)
          return(Vp*(t-tinf)/Tp); // Tp = tp-tinf
          
      // Viral load falls after peak: 
        else //(t > tp)
           return(Vp+(lod-Vp)*(t-tp)/Tc); // Tc = tc-tp
    }
}


data{
  int<lower=0> N; // Number of concatenated data points
  int<lower=0> n_id; // Number of individuals 
  int lod; // Limit of detection of Ct (Ct=10)
  int Vinf; // Ct delta from lod at infection (Ct=0) 
  int<lower=0> id[N]; // Vector marking which datum belongs to which id
  real t[N];  // Vector marking the time for each data point 
  real<lower=10, upper=50> y[N];  // Concatenated data vector 
  int censor[N]; // vector of censor indication
  real tSS[n_id]; // Vector of individual time of symptom onset 
  int nb_random; // Total number of parameters with a random effect
}

transformed data {
  
  real log_is_lod[N]; // log of "is the data is at the LOD value ? : 0 OR 1"
  real log_is_obs[N]; // log of " if the data is below the LOD value ? : 0 OR 1"
  
  for(i in 1:N){
    if(censor[i]==1){
      log_is_obs[i]=log(0);
      log_is_lod[i]=log(1);
    } 
    else{ 
      log_is_obs[i]=log(1);
      log_is_lod[i]=log(0);
    }
  }
}


parameters{
  
  // parameters of Vp 
  real<lower=10, upper=50> mu_Vp;
  
  // parameters of proliferation phase
  real<lower=0> mu_Tp;
  
  // parameters of clearance phase
  real<lower=0> mu_Tc;
  
  // parameters incubation period
  real<lower=0, upper=14> mu_Tincub;
  
  real<lower=0> sigma;
  
  matrix[n_id,nb_random] eta_tilde; // random effect for Vp, Tp, Tc, and Tincub
  
  real<lower=0> eta_omega[nb_random];
  
  // Proportion of infected individuals in the population
  real Pinf; 
  
  // Infection status for each individual
  int infected[n_id];
}


transformed parameters{
  
  real<lower=0, upper=14> Tincub[n_id]; // Incubation period cannot exceed 14 days
  
  matrix[n_id,nb_random] eta; // matrix of random effects for Vp, Tp, Tc, and Tincub
  
  real<lower=10, upper=50> Vp[n_id];
  real<lower=0> Tp[n_id];
  real<lower=0> Tc[n_id];
  real tp[n_id];
  real tinf[n_id];
  
  real<upper=50> Ct[N];
  
  real num_arg[N,2];
  
  for(j in 1:nb_random){
    eta[,j] = eta_tilde[,j]*eta_omega[j];
  }
  
  for(i in 1:n_id){
    
    Vp[i] = mu_Vp*exp(eta[i,1]);
    
    Tp[i] = mu_Tp*exp(eta[i,2]);
    
    Tc[i] = mu_Tc*exp(eta[i,3]);
    
    Tincub[i] = mu_Tincub*exp(eta[i,4]);
    
    tinf[i] = tSS[i] - Tincub[i];

    tp[i] = tinf[i] + Tp[i];
    
  }
  
  // Looping on each observation of the data set
  for(i in 1:N){
    
    // Prediction of the model
    Ct[i] = CtVinffun(t[i], tinf[id[i]], lod, Vinf, tp[id[i]], Tp[id[i]], Tc[id[i]], Vp[id[i]]);
    
    // LL for infected individuals //
    if (infected[id[i]] == 1){
      if (t[i] < tinf[id[i]]){ // Observations before the start of infection
        
        num_arg[i,1] = log_is_obs[i] + log(0.0002); // likelihood that Ctobs value < LOD (false positive PCR test)
        num_arg[i,2] = log_is_lod[i] + log(0.9998); // likelihood that Ctobs value == LOD (true negative PCR test)
        
      } else{
        
        num_arg[i,1] = log_is_obs[i] + normal_lpdf(y[i] | Ct[i], sigma); // likelihood that Ctobs value < LOD (positive PCR test)
        num_arg[i,2] = log_is_lod[i] + normal_lcdf(10 | Ct[i], sigma); // likelihood that Ctobs value == LOD (negative PCR test)
      
      }
    // If the observation is not at the lod value, num_arg[i,2] is equal to -inf and will not be taken in the likelihood : target += log_sum_exp(num_arg[i]);
    // Because log(exp(log(0)) + exp(num_arg[i,1])) = num_arg[i,1]
    
    // LL for non-infected individuals //
    } else{ 
        num_arg[i,1] = log_is_obs[i] + log(0.0002); // likelihood that Ctobs value < LOD (false positive PCR test)
        num_arg[i,2] = log_is_lod[i] + log(0.9998); // likelihood that Ctobs value == LOD (negative PCR test)

    }
  }
}


model{
  // Priors //
  
  mu_Vp ~ normal(25, 0.5); // hierarchical mean (mu)
  mu_Tp ~ normal(6, 0.25); // hierarchical mean (mu)
  mu_Tc ~ normal(15, 0.5); // hierarchical mean (mu)
  mu_Tincub ~ normal(5, 0.25); // hierarchical mean (mu)
  
  Pinf ~ beta(2,2); // probability to be infected => proportion of infected individuals in the population
  to_vector(infected) ~ bernoulli(Pinf); // is the individual is infected
  
  sigma ~ std_normal(); // mesurement error
  
  to_vector(eta_tilde) ~ std_normal();
  
  to_vector(eta_omega) ~ normal(0.15,0.05); // variance of random effect 95% of density [0.05;0.25]
  
  // Likelihood (looped on each observation) //
  for(i in 1:N){
    target += log_sum_exp(num_arg[i]);
  }
}

simonbrauer · March 11, 2025, 1:59pm

Much like your current code, you can use log_sum_exp() to achieve this. The code below shows a very simplified version of it, where I assume you’ve calculated the person-specific log-likelihoods assuming they are infected and not infected.

for(i in 1:N){
  vector[2] lp;
  lp[1] = log(Pinf) + ll_if_infected[i];
  lp[2] = log(1 - Pinf) + ll_if_not_infected[i];
  
  target += log_sum_exp(lp);
}

simonbrauer · March 11, 2025, 2:12pm

And a few additional, unconnected thoughts…

You will likely want to use the updated array syntax. Rather than

real num_arg[N,2];

use

array[N,2] real num_arg;

I don’t have a perfect understanding of your model (so ignore this if I’m missing something crucial), but it looks like you can avoid a lot of unnecessary computation by calculating one value per person in num_arg. The code below is equivalent to an if statement: if censor[i] == 1 then it uses the log-likelihood associated with log_is_lod and it uses the likelihood for log_is_obs if censor[i]!=1. So you don’t need to calculate two values for each n of num_arg. You can also pre-calculate log(0.0002) and log(0.9998) in the transformed data block.

Germante:

        num_arg[i,1] = log_is_obs[i] + log(0.0002); // likelihood that Ctobs value < LOD (false positive PCR test)
        num_arg[i,2] = log_is_lod[i] + log(0.9998); // likelihood that Ctobs value == LOD (negative PCR test)

So it might look something like this. (note that I’ve removed the marginalization over lod/obs but added the marginalization over infected status).

transformed data{
    real log_0002 = log(0.0002);
    real log_9998 = log(0.9998);
    array[N] int use_obs_ll;  // 0 = lod; 1 = obs
    for(i in 1:N){
        if(censor[i] == 1){
            use_obs_ll[i] = 0;
        } else{
            use_obs_ll[i] = 1;
        }
    }
}
transformed parameters{
    // first row stores log-likelihoods if infected
    // second row stores log-likelihoods if not infected
    matrix[2,N] num_arg;

    num_arg[1,] = log(Pinf);
    num_arg[2,] = log(1 - Pinf);

    for(n in 1:N){
        if(t[i] < tinf[id[i]]){
            // Observations before the start of infection
            if(use_obs_ll[n] == 1){
                // Data are below the LOD value
                num_arg[1,n] += log_0002;
                num_arg[2,n] += log_0002;
            } else{
                // Data are at the LOD value
                num_arg[1,n] += log_9998;
                num_arg[2,n] += log_9998;
            }
        } else{
            // Observations after the start of infection
            if(use_obs_ll[n] == 1){
                // Data are below the LOD value
                num_arg[1,n] += normal_lpdf(y[i] | Ct[i], sigma);
                num_arg[2,n] += log_0002;
            } else{
                // Data are at the LOD value
                num_arg[1,n] +=normal_lcdf(10 | Ct[i], sigma);
                num_arg[2,n] += log_9998;
            }
        }
    }
}
model{
    for(n in 1:N){
        target += log_sum_exp(num_arg[,n]);
    }
}

martinmodrak · March 11, 2025, 3:18pm

I’ll just add that you can simplify the code further and get a bit more performance using the log_mix builtin, i.e. @simonbrauer’s code is equivalent to:

for(i in 1:N){
  target += log_mix(Pinf, ll_if_infected[i], ll_if_not_infectedi[i]);
}

Bob_Carpenter · March 11, 2025, 9:25pm

Always use log1m(Pinf) in these cases—it’s more stable for small values of Pinf.

You can simplify the increments to

num_arg[ , n] += log_0002;

And then vectorize all of this:

  for (i in 1:n_id) {
    Vp[i] = mu_Vp*exp(eta[i,1]);
    Tp[i] = mu_Tp*exp(eta[i,2]);
    Tc[i] = mu_Tc*exp(eta[i,3]);
    Tincub[i] = mu_Tincub*exp(eta[i,4]);
    tinf[i] = tSS[i] - Tincub[i];
    tp[i] = tinf[i] + Tp[i];
  }

to much more efficient code:

Vp = mu_VP * exp(eta[ , 1]);
...
tp = tinf + Tp;

You can vectorize CtVinffun so it’s a one-liner, which will probably let you vectorize within the function. Anything that can get converted to vectorized operations will be a win if it involves parameters as we can usually speed up the autodiff.

If you precompute your tests, like (infected[id[I]] == 1) (equivalently just (infected[id[I]]) if it can only take values 0 and 1), you can vectorize all of the loops that branch on these conditions by just operating on the indices. For instance if tin

If you’re not doing custom inits and you have this:

mu_Vp ~ normal(25, 0.5);

then you want to declare mu_Vp as

real<offset=25, multiplier=0.5> mu_Vp;

which will standardize the scale and make our default inits and adaptation work much better in early iterations. This is even more important for the narrowly constrained ones like eta_omega.

You can move a bunch of things you declared as transformed parameters, like num_arg, into local variables in the model to save output space and make the posterior analysis faster.

P.S. I’d recommend more horizontal space around operators and braces and around operators, and less vertical space between lines. We have a style guide at the end of the User’s Guide.

Germante · March 14, 2025, 9:10pm

Thank you all for your answers! I’ve now understood how to marginalise latent classes in the posterior with Stan! ;)
The model works very well even with a rather sparse dataset (very fast run, well-mixed chains, unbiased estimates). However, the model doesn’t correspond exactly to what I’d like to model. In fact, I’d like to have the same probability of infection (Pinf) per individual, for all his observations. But with the model you’ve proposed, this Pinf varies from one observation to another within the same individual, which isn’t logical - either the individual is really infected, or he’s not. So I wrote this model to take into account a single Pinf for all the tests of an individual. It runs, but very slowly on the same data sets, with same priors, and I get rather unstable results with more or less significant biases. Not all individuals have the same number of observations. This information is contained in the vector Ntest[n_id], so we only loop over the data for each individual. Could you have a look, it’s likely that I’ve written the model in a way that doesn’t conform to Stan, which could explain the current situation.

functions {
    // Ctfun returns dCt, the delta Ct below Vinf:
    real CtVinffun(real t, real tinf, int lod, int Vinf, real tp, real Tp, real Tc, real Vp){
		if (t <= tinf)
		  return(Vinf); // Ct = 0
        else if (t >= tinf && t <= tp)
          return(Vp*(t-tinf)/Tp); // Viral load rises before peak: Tp = tp-tinf
        else // Viral load falls after peak: (t > tp)
           return(Vp+(lod-Vp)*(t-tp)/Tc); // Tc = tc-tp
    }
}


data{
  int<lower=0> N; // Number of concatenated data points
  int<lower=0> n_id; // Number of individuals 
  int lod; // Limit of detection of Ct (Ct=10)
  int Vinf; // Ct delta from lod at infection (Ct=0) 
  int<lower=0> id[N]; // Vector marking which datum belongs to which id
  real tSS[n_id]; // Vector of individual time of symptom onset 
  int nb_random; // Total number of parameters with a random effect
  int Ntest[n_id]; // Number of observations for each individual
  int max_Ntest; // Number of maximum of repeated test in the population
  matrix[n_id,max_Ntest] time; // array of dim n_id,max_Ntest containing time of observations 
  matrix[n_id, max_Ntest] y;  // array of dim n_id,max_Ntest containing Ct of observations
  matrix[n_id,max_Ntest] censor; // array of dim n_id,max_Ntest containing censor indication
}

transformed data{
    real log_0002 = log(0.0002); // log proba of false positive
    real log_9998 = log(0.9998); // log proba of true negative
    matrix[n_id,max_Ntest] use_obs_ll;  // 0 = lod; 1 = obs
    for(i in 1:n_id){
      for (n in 1:Ntest[i]){
        if(censor[i,n] == 1){
            use_obs_ll[i,n] = 0;
        } else{
            use_obs_ll[i,n] = 1;
        }
      }
    }
}


parameters{
  real mu_Vp; // fixed effect of Vp 
  real<lower=0> mu_Tp; // fixed effect of Tp
  real<lower=0> mu_Tc; // fixed effect of Tc
  real<lower=0, upper=14> mu_Tincub; // fixed effect of Tincub
  real<lower=0> sigma; // additive measurement error
  matrix[n_id,nb_random] eta_tilde; // Gaussian to center random effect to 0
  real<lower=0> eta_omega[nb_random]; // Variance of random effect for Vp, Tp, Tc, and Tincub
  real<lower=0, upper=1> Pinf; // Proportion of infected individuals in the population
}


transformed parameters{
  real<lower=0, upper=14> Tincub[n_id]; // Incubation period cannot exceed 14 days
  matrix[n_id,nb_random] eta; // matrix of random effects for Vp, Tp, Tc, and Tincub
  real Vp[n_id]; // Vector of individual Vp
  real<lower=0> Tp[n_id]; // Vector of individual Tp
  real<lower=0> Tc[n_id]; // Vector of individual Tc
  row_vector[n_id] tp; // Vector of individual tp
  row_vector[n_id] tinf; // Vector of individual tinf
  matrix<upper=50>[n_id,max_Ntest] Ct; // array of dim n_id,N_max_Tests containing Ct pred of each observation
  real num_arg[n_id, max_Ntest, 2]; // array containing individual LL contribution
  num_arg =rep_array(0, n_id, max_Ntest, 2); // Initializing num_arg to 0 (defaut was NaN)
  Ct =  to_matrix(rep_array(0, n_id, max_Ntest)); // // Initializing Ct to 0 (defaut was NaN)
  matrix[n_id,2] ll_Pinf; // matrix of two columns : Pinf and 1-Pinf for each individual
  ll_Pinf[,1] = to_vector(rep_array(log(Pinf), n_id)); // filling the matrix
  ll_Pinf[,2] = to_vector(rep_array(log1m(Pinf), n_id)); // filling the matrix
  for(j in 1:nb_random){
    eta[,j] = eta_tilde[,j]*eta_omega[j];
  }
  Vp = to_array_1d(mu_Vp * exp(eta[ ,1]));
  Tp = to_array_1d(mu_Tp * exp(eta[ ,2]));
  Tc = to_array_1d(mu_Tc * exp(eta[ ,3]));
  Tincub = to_array_1d(mu_Tincub * exp(eta[ ,4]));
  tinf = to_row_vector(tSS) + to_row_vector(Tincub);
  tp = to_row_vector(tinf) + to_row_vector(Tp);
    for(i in 1:n_id){  // Looping on each individual
      for (n in 1:Ntest[i]){  // Each observation of the individual
        Ct[i,n] = CtVinffun(time[i,n], tinf[i], lod, Vinf, tp[i], Tp[i], Tc[i], Vp[i]); // Prediction of the model
        if(time[i,n] < tinf[i]){ // Observations before the start of infection
            if(use_obs_ll[i,n] == 1){  // Data are uncensored
                num_arg[i,n,1] += log_0002 + ll_Pinf[i,1]; // proba of false positive
                num_arg[i,n,2] += log_0002 + ll_Pinf[i,2]; // proba of false positive
            } else{// Data are censored
                num_arg[i,n,1] += log_9998 + ll_Pinf[i,1]; // proba of true negative
                num_arg[i,n,2] += log_9998 + ll_Pinf[i,2]; // proba of true negative
            }
        } else{ // Observations after the start of infection
            if(use_obs_ll[i,n] == 1){ // Data are uncensored
                num_arg[i,n,1] += normal_lpdf(y[i,n] | Ct[i,n], sigma) + ll_Pinf[i,1]; // If infected 
                num_arg[i,n,2] += log_0002 + ll_Pinf[i,2]; // If not infected, proba of false positive 
            } else{ // Data are censored
                num_arg[i,n,1] += normal_lcdf(10 | Ct[i,n], sigma) + ll_Pinf[i,1]; // If infected 
                num_arg[i,n,2] += log_9998 + ll_Pinf[i,2]; // If not infected, proba of true negative 
            }
        }
    }
  }
}

model{
  // Priors //
  mu_Vp ~ normal(25, 0.5); // hierarchical mean (mu)
  mu_Tp ~ normal(6, 0.25); // hierarchical mean (mu)
  mu_Tc ~ normal(15, 0.5); // hierarchical mean (mu)
  mu_Tincub ~ normal(5, 0.25); // hierarchical mean (mu)
  Pinf ~ beta(2,2); // probability to be infected => proportion of infected individuals in the population
  sigma ~ std_normal(); // mesurement error
  to_vector(eta_tilde) ~ std_normal();
  to_vector(eta_omega) ~ normal(0.15,0.05); // variance of random effect 95% of density [0.05;0.25]
  for(i in 1:n_id){ // Likelihood (looped on each observation of each individual)
    for (n in 1:Ntest[i]){
      target += log_sum_exp(num_arg[i,n,1], num_arg[i,n,2]);
    }
  }
}

PS : @Bob_Carpenter I’d like to thank you for all your advice and tips on optimising my code. I’ll look into it, but only once I’ve got a model that works the way I want it to.

Thank you in advance,
Maxime

Bob_Carpenter · March 18, 2025, 6:28pm

Having things like your CtVinffun can be problematic for sampling unless you can guarantee the results are continuously differentiable.

I would try to write censoring so it doesn’t use dummy values. You don’t need use_obs_ll as it’s just 1 - censor.

That’s the right thing to do. There’s a chapter in the User’s Guide on ragged data—we keep meaning to add it as a built-in, but it’s hard to design, code, and plumb through all of our I/O.

Germante · March 19, 2025, 11:37am

Finally, it seems that the problem stemmed from the definition of the infection start time, tinf tinf = tSS - Tincub (and not with a +), the infection comes before the symptoms, not after.

After correcting this typo in the script, the model runs very well. I have unbiased results, with a very good mix of chains with simulated datasets of 100 infected individuals and 100 uninfected individuals. The framework is very sparse with 75% of individuals with 1 observation, 20% with 2 observations and 5% with 3 or 4 observations. Despite this framework, the inference is satisfactory.

My final concern is the feasibility of using the model in my context, particularly in terms of computation time. I would like to use this model on a dataset of 1 million individuals (~1.3 million data points), as sparse as simulated previously. For 200 individuals, the current script takes between 15 and 25 minutes (see histogram) to run on a computer centre (200 Warm-Up, 200 sampling, adapt_delta=0.95 and max_treedepth=15). The 200 iterations of the W-U take 80% of the calculation time and the remaining 20% for the sampling, is this normal? Now I’m running on a dataset of 10,000 infected and 10,000 uninfected and the model seems to run forever (in 15 hours, I still haven’t done 40 iterations of the W-U). So I can’t imagine how long it would take for 1 million individuals.

Apart from vectorising the code and intra-chain parallelisation (I only have a maximum of 16 cores available for 4 chains, so divided by 4), is my model still feasible on a dataset of 1 million in Stan?

I am adding @charlesm93 to the discussion, with whom I had already chatted a few months ago on this subject. Thanks again @Bob_Carpenter for your valuable advice.

Bob_Carpenter · March 19, 2025, 9:45pm

This all depends on your model. I would say, though, that the max_treedepth=15 is going to be problematic. That’s going to allow 32K leapfrog iterations, each requiring a log density and gradient. In most cases, the root cause is some bad geometry in hierarchical models which a non-centered parameterization can fix. In other cases, it’s much more complicated.

I would recommend trying the PyMC sampler Nutpie—it’s specifically faster than Stan in warmup. the other thing to try would be to run Pathfinder, our variational inference, and use draws from that to initialize MCMC. Pathfinder can fail during optimization if the geometry’s too tricky.

At the very least, I would parameterize things like mu_Vp, which has a normal(25, 0.5) distribution as

real<offset=25, multiplier=0.5> mu_Vp;

This will cause initialization to be in the range (-24, 26) instead of (-2, 2). This can help immensely with initialization if the prior’s reasonable.

How did you derive the 0.0002 and 0.9998 numbers?

The other thing to do with 1M individuals is to try to group them somehow and reduce everything to sufficient statistics. This may or may not be possible depending on the size of your model.

Sounds like that’s a hard no unless you can reparameterize and do some data reduction. For example, you can do some data reduction this way:

to_vector(infected) ~ bernoulli(Pinf);

is equivalent to

sum(infected) ~ binomial(n_id, Pinf);

But I’m really confused how you have integer parameters!

Germante · April 22, 2025, 3:56pm

Thank you Bob for your clarifications and suggestions. We have since opted to reduce our dataset to sufficient statistics because we have so many individuals with only one observation with the same value at the same time (and so their contribution to their likelihood is strictly the same conditional on random effects).
I implemented a new model by including a ‘weight’ in my likelihood corresponding to the number of individuals with the same data at the same time.
I wrote it like this :

data{
  int weight[n_id]; // Weight in likelihood for each observation
} 
model{
  for(i in 1:n_id){ // Likelihood (looped on each observation of each individual)
    for (n in 1:Ntest[i]){
      target += weight[i] * log_sum_exp(num_arg[i,n,1], num_arg[i,n,2]); // Weighting LL with the number of individuals with the same observations
    }
  }
}

Since then, I’ve had major fitting problems, with chances that don’t mix at all and each stagnate in its own area.
As I’m working with a hierarchical model, I’m wondering about the impact of this weighting and the reduction in the number of unique individuals in my dataset on the estimate of the standard deviation of my random effects (eta_omega). Should I change the definition of the individual parameters to take account of the weighting like this?

  for (i in 1:n_id){
    for (j in 1:nb_random){
      eta[i,j] = eta_tilde[i,j] * eta_omega[j] / weight[i];
    }
  }

1.

1 ~ bernoulli(theta);
0 ~ bernoulli(theta);
1 ~ bernouli(theta);

2.

2 ~ binomial(3, theta);

3.

target += bernoulli_lupmf(1 | theta);
target += bernoulli_lupmf(0 | theta);
target += bernoulli_lupmf(1 | theta);

4.

target += 2 * bernoulli_lupmf(1 | theta);
target += 1 * bernoulli_lupmf(0 | theta);

The u in lupmf indicates that constants should be dropped to match what happens in the ~ notation.

Germante · April 25, 2025, 3:43pm

Sorry Bob, the title of the post no longer corresponds exactly to my current problem. I do have a model in which I reconstruct the viral load and take into account whether or not an individual is infected (in particular by comparing the observed viral load with that predicted by the model).

It’s a linear mixed-effects model, and it works very well. However, it already takes time to run on 1,000 individuals, but my aim is to run this model on as many individuals as possible from my dataset of 700,000 individuals. As you suggested, I have a huge number of individuals (500,000) with only 1 observation, at the same time and with the same viral load value, so I’d like to use sufficient statistics to reduce my data. Data reduction is crucial, I could reduce my number of individuals by 20 and the number of observations by 7 in total.

I’ve tried to implement it in the way shown below, but either my \sigma error term shrinks to 0 (its simulation value is 2).

However, without using data reduction by keeping only distinct individuals on a small dataset of 1,000 individuals, i.e. the weight of each observation of each individual is 1 in the contribution to the LL, the model still runs well and I have unbiased estimates, as well as chains that mix.

The questions that come to mind:

What is the impact on the estimation of the standard deviations of the random effects eta_omega? Because with data reduction, we aggregate a huge number of individuals. Do we need to do something specific to control all this?
I haven’t really understood why it’s important to use a u function (in my case normal_lupdf), could it be the source of my inference issues? (Update : the use of normal_lupdf() did not solve my problem)

Again, data reduction is a most have to model the observations of my dataset. I thank you, or anyone else in advance for your help !

The model is written as follows:

functions {
    // Ctfun returns dCt, the delta Ct below Vinf:
    real CtVinffun(real t, real tinf, int lod, int Vinf, real tp, real Tp, real Tc, real Vp){
		if (t <= tinf)
		  return(Vinf); // Ct = 0
        else if (t >= tinf && t <= tp)
          return(Vp*(t-tinf)/Tp); // Viral load rises before peak: Tp = tp-tinf
        else // Viral load falls after peak: (t > tp)
           return(Vp+(lod-Vp)*(t-tp)/Tc); // Tc = tc-tp
    }
}

data{
  int<lower=0> N; // Number of concatenated data points
  int<lower=0> n_id; // Number of individuals / patterns
  int lod; // Limit of detection of Ct (Ct=10)
  int Vinf; // Ct delta from lod at infection (Ct=0) 
  int<lower=0> id[N]; // Vector marking which datum belongs to which id
  real tSS[n_id]; // Vector of individual time of symptom onset 
  int nb_random; // Total number of parameters with a random effect
  int Ntest[n_id]; // Number of observations for each individual
  int weight[n_id]; // Weight in likelihood for each observations
  int max_Ntest; // Number of maximum of repeated test in the population
  matrix[n_id, max_Ntest] time; // array of dim n_id,max_Ntest containing time of observations 
  matrix[n_id, max_Ntest] y;  // array of dim n_id,max_Ntest containing Ct of observations
  matrix[n_id, max_Ntest] censor; // array of dim n_id,max_Ntest containing censor indication
}

transformed data{
    real log_0002 = log(0.0002); // log proba of false positive
    real log_9998 = log(0.9998); // log proba of true negative
    matrix[n_id,max_Ntest] use_obs_ll;  // 0 = lod; 1 = obs
    for(i in 1:n_id){
      for (n in 1:Ntest[i]){
        if(censor[i,n] == 1){
            use_obs_ll[i,n] = 0;
        } else{
            use_obs_ll[i,n] = 1;
        }
      }
    }
}

parameters{
  real<upper=50> mu_Vp; // fixed effect of Vp 
  real<lower=0> mu_Tp; // fixed effect of Tp
  real<lower=0> mu_Tc; // fixed effect of Tc
  real<lower=0, upper=14> mu_Tincub; // fixed effect of Tincub
  real<lower=0> sigma; // additive measurement error
  matrix[n_id,nb_random] eta_tilde; // Gaussian to center random effect to 0
  real<lower=0> eta_omega[nb_random]; // Variance of random effect for Vp, Tp, Tc, and Tincub
  real<lower=0, upper=1> Pinf; // Proportion of infected individuals in the population
}

transformed parameters{
  real<lower=0, upper=14> Tincub[n_id]; // Incubation period cannot exceed 14 days
  matrix[n_id,nb_random] eta; // matrix of random effects for Vp, Tp, Tc, and Tincub
  real<upper=50> Vp[n_id]; // Vector of individual Vp
  real<lower=0> Tp[n_id]; // Vector of individual Tp
  real<lower=0> Tc[n_id]; // Vector of individual Tc
  row_vector[n_id] tp; // Vector of individual tp
  row_vector[n_id] tinf; // Vector of individual tinf
  matrix<upper=50>[n_id,max_Ntest] Ct; // array of dim n_id,N_max_Tests containing Ct pred of each observation
  real num_arg[n_id, max_Ntest, 2]; // array containing individual LL contribution
  num_arg = rep_array(0, n_id, max_Ntest, 2); // Initializing num_arg to 0 (defaut was NaN)
  Ct =  to_matrix(rep_array(0, n_id, max_Ntest)); // // Initializing Ct to 0 (defaut was NaN)
  matrix[n_id,2] ll_Pinf; // matrix of two columns : Pinf and 1-Pinf for each individual
  ll_Pinf[,1] = to_vector(rep_array(log(Pinf), n_id)); // filling the matrix with Pinf
  ll_Pinf[,2] = to_vector(rep_array(log1m(Pinf), n_id)); // filling the matrix with 1-Pinf
  for(j in 1:nb_random){
    eta[,j] = eta_tilde[,j]*eta_omega[j];
  }
  Vp = to_array_1d(mu_Vp * exp(eta[ ,1]));
  Tp = to_array_1d(mu_Tp * exp(eta[ ,2]));
  Tc = to_array_1d(mu_Tc * exp(eta[ ,3]));
  Tincub = to_array_1d(mu_Tincub * exp(eta[ ,4]));
  tinf = to_row_vector(tSS) - to_row_vector(Tincub);
  tp = to_row_vector(tinf) + to_row_vector(Tp);
    for(i in 1:n_id){  // Looping on each individual
      for (n in 1:Ntest[i]){  // Each observation of the individual
        Ct[i,n] = CtVinffun(time[i,n], tinf[i], lod, Vinf, tp[i], Tp[i], Tc[i], Vp[i]); // Prediction of the model
        if(time[i,n] < tinf[i]){ // Observations before the start of infection
            if(use_obs_ll[i,n] == 1){  // Data are uncensored
                num_arg[i,n,1] += log_0002 + ll_Pinf[i,1]; // proba of false positive
                num_arg[i,n,2] += log_0002 + ll_Pinf[i,2]; // proba of false positive
            } else{// Data are censored
                num_arg[i,n,1] += log_9998 + ll_Pinf[i,1]; // proba of true negative
                num_arg[i,n,2] += log_9998 + ll_Pinf[i,2]; // proba of true negative
            }
        } else{ // Observations after the start of infection
            if(use_obs_ll[i,n] == 1){ // Data are uncensored
                num_arg[i,n,1] += normal_lpdf(y[i,n] |  Ct[i,n], sigma) + ll_Pinf[i,1]; // If infected 
                num_arg[i,n,2] += log_0002 + ll_Pinf[i,2]; // If not infected, proba of false positive 
            } else{ // Data are censored
                num_arg[i,n,1] += normal_lcdf(10 |  Ct[i,n], sigma) + ll_Pinf[i,1]; // If infected 
                num_arg[i,n,2] += log_9998 + ll_Pinf[i,2]; // If not infected, proba of true negative 
            }
            
        }
    }
  }
}

model{
  // Priors //
  mu_Vp ~ normal(25, 2); // hierarchical mean (mu)
  mu_Tp ~ normal(6, 1); // hierarchical mean (mu)
  mu_Tc ~ normal(15, 2); // hierarchical mean (mu)
  mu_Tincub ~ normal(5, 1); // hierarchical mean (mu)
  Pinf ~ beta(2,2); // probability to be infected => proportion of infected individuals in the population
  sigma ~ normal(0,3); // mesurement error
  to_vector(eta_tilde) ~ std_normal();
  to_vector(eta_omega) ~ normal(0.15,0.2); // variance of random effect 95% of density [0.05;0.25]
  for(i in 1:n_id){ // Likelihood (looped on each observation of each individual)
    for (n in 1:Ntest[i]){
      target += weight[i] * log_sum_exp(num_arg[i,n,1], num_arg[i,n,2]); // Weighting LL with the number of individuals with the same pattern
    }
  }
}

Bob_Carpenter · April 25, 2025, 6:11pm

Sorry about not understanding the statistical question you’re asking. Since I still don’t get it, let me answer what I can.

Top level, even without understanding what you’re specifically trying to do with all this, I would still strongly urge you to go back to a test case where you have a working version that doesn’t aggregate weights, then you can aggregate weights to make sure you get the same answer. You can do this with simulated data. Then you can convince yourself that the aggregation is or isn’t the problem.

Stan jointly estimates everything, so it’s not like we estimate the standard deviations first and then the effects.

Are you using “data reduction” to mean your use of weighted counts here? That doesn’t actually reduce the amount of information in the data, it just compresses it.

Control all of what?

The difference is that normal_lupdf(y | mu, sigma) drops the normalizing constants when they don’t depend on parameters. For instance, if sigma is data rather than a parameter, we drop the -log(sigma) term in log density, and we always drop the -log(sqrt(2 * pi)) term. Without the u, you get all the normalizing terms.

If censor truly takes on values 1 or 0 integers, then it’d be better to code it this way:

array[n_id, max-Ntest] int<lower=0, upper=1> censor;

That will do the error checking on dad load and it also documents the possible values.

Then you can define use_obs_ll by setting

use_obs_ll[m, n] = 1 - use_obs_ll[m, n];

Topic		Replies	Views
Fitting a Bernoulli model conditionally to data Modeling fitting-issues , specification	0	88	May 1, 2024
Bernoulli process with observational error Modeling rstan , techniques	4	482	December 8, 2023
Sampling a discrete parameter from a bernoulli prior Modeling	1	643	March 14, 2020
Multivariate bernoulli model Modeling specification	6	1075	February 26, 2021
Rstan: logistic regression Modeling	6	586	October 18, 2018

Add an unknown discrete parameter following a Bernoulli distribution

1.

2.

3.

4.

Related topics