Piecewise linear regression with horizontal shifts

Hi all,

I’ve been trying to simplify and modify the model described here, see section “2.4 Viral load and infectiousness time course”.

From what I understand, this is a piecewise linear model with time series data from different individuals. The model is characterized by a steep linear increase followed by a slower linear decrease, with the peak (/intercept) coinciding with day 0. One special characteristic is that only the number of days between an individual’s data points are known but not where exactly they are placed in time, so the model has a shift parameter for each time-series to estimate its positioning on the x-axis.

I’ve edited the model by removing all non-essential parameters that can affect the slopes and intercept (like age, gender, hospitalisation etc.). Also, I have replaced the imputation of censored data points by integrating them out (see here, section “Integrating out Censored Values”) and have modified some priors slightly.

I am simulating and trying to fit fake data. This works reasonably well (although Rhat values are not that great), however, I have noticed that the model does not place many data points around day 0. You can see this in the plots below, which show, from top to bottom: 1) the estimated trajectories for each individual in blue and the mean trajectory in red, 2) the estimated positioning of the corresponding data points, 3) the true trajectories and 4) the true positioning of the data points.

When I make histograms of the estimated placements in time vs the true ones, you can clearly see the dip around day 0 in the estimation.

I am wondering why this happens and if there is anything I can try to mitigate the problem. Is this just a hard area to fit because of the sudden change from up-slope to down-slope. I.e., small time shifts can lead to big changes in the likelihood? Also, any other critisism of the model is welcome (see stan code below).

Thanks in advance!


data {
  // Data for time course analysis
  int N_DAY_pos;                                     // number of data points with log10 viral load > 2 (i.e. positive PCRs)
  vector<lower=2>[N_DAY_pos] Y_DAY_pos;              // outcome log10 viral load with values > 2
  vector[N_DAY_pos] X_DAY_pos;                       // day of measurement in time series for positive PCRs (with day 0 assigned to an individual's first PCR)
  int G;                                             // number of individuals
  int G_neg;                                         // number of individuals with negative PCR tests (i.e. censored data points)
  int gstart_DAY_pos[G];                             // first day in time series (positive PCRs) for each individual wrt X_DAY_pos
  int gend_DAY_pos[G];                               // last day in time series (positive PCRs) for each individual wrt X_DAY_pos
  int N_NegTests;                                    // total number of negative test results (i.e. censored data points)
  vector[N_NegTests] X_DAY_neg;                      // day of measurement in time series for negative PCRs (i.e. censored data points)
  int N_NegTests_individual[G];                      // a vector containing the number of negative PCR tests for each individual
  int gstart_DAY_neg[G_neg];                         // first day in time series (negative PCRs) for each individual wrt X_DAY_neg
  int gend_DAY_neg[G_neg];                           // last day in time series (negative PCRs) for each individual wrt X_DAY_neg
  int condition_on_data;                             // 1 for parameter estimation, 0 for prior predictive plots
}


parameters {
  real<upper=min(Y_DAY_pos)> L; # limit for censoring
  
  // grand means for key model parameters
  real log_slope_up_mu;
  real log_slope_down_mu;
  real log_intercept_mu;

  // individual random effects for key model parameters
  real<lower=0> intercept_sigma;
  vector[G] intercept_raw; 
  real<lower=0> slope_up_sigma;
  vector[G] slope_up_raw;
  real<lower=0> slope_down_sigma;
  vector[G] slope_down_raw;
  real<lower=0> sigma_sigma;
  vector[G] sigma_raw;
  
  // smoothness parameter for logistic weight function
  real<lower=5> beta_sweight_mu;
  
  // shifts for individuals
  vector<lower=-15,upper=20>[G] shift;
  
  // error variance
  real sigma_mu; // mean
}

transformed parameters {
  // error variance
  vector[G] sigma_sigma_times_raw = sigma_sigma * sigma_raw;
  vector[G] sigma = exp(sigma_mu + sigma_sigma_times_raw);
  // vectors for individual level parameters
  vector<lower=0>[G] intercept;
  vector<lower=0>[G] slope_up;
  vector<upper=0>[G] slope_down;
  vector<lower=0>[G] time2peak;
  { // calculate model parameters
     vector[G] intercept_individual = intercept_sigma * intercept_raw;
     vector[G] slope_up_individual = slope_up_sigma * slope_up_raw;
     vector[G] slope_down_individual = slope_down_sigma * slope_down_raw;
     vector[G] log_intercept = 
                 log_intercept_mu + // intercept (fixed effect [FE], group level mean)
                 intercept_individual; // individual level random effects
     vector[G] log_slope_up = 
                 log_slope_up_mu +
                 slope_up_individual; 
     vector[G] log_slope_down = 
                 log_slope_down_mu + 
                 slope_down_individual;
                 
     // apply link function
     intercept = exp(log_intercept);
     slope_up = exp(log_slope_up);
     slope_down = -exp(log_slope_down);
  }
  time2peak = intercept ./ slope_up;
}


model {
  // DAY
  log_intercept_mu ~ normal(2.1,.1); 
  log_slope_up_mu ~ normal(.8,.25); 
  log_slope_down_mu ~ normal(-1.5,.5); 
  beta_sweight_mu ~ normal(10,1); 
  
  intercept_sigma ~ gamma(4,10);
  intercept_raw ~ std_normal();
  
  slope_up_sigma ~ gamma(4,10);
  slope_up_raw ~ std_normal();
  
  slope_down_sigma ~ gamma(4,10);
  slope_down_raw ~ std_normal();
  
  sigma_mu ~ std_normal();
  sigma_sigma ~ std_normal();
  sigma_raw ~ std_normal();

  if (condition_on_data == 1) {
    // Counter of individuals that contain negative PCR tests (i.e. censored data points).
    int g2 = 0;

    # iterating over individuals
    for (g in 1:G) {
      # Positive PCR tests
      int N_g = gend_DAY_pos[g] - gstart_DAY_pos[g] + 1; // Number of positive PCRs
      vector[N_g] DAY_shifted1 = X_DAY_pos[gstart_DAY_pos[g]:gend_DAY_pos[g]] + shift[g];
      vector[N_g] weight_down1 = inv_logit(DAY_shifted1 * beta_sweight_mu);
      vector[N_g] est_up1 = intercept[g] + slope_up[g]*DAY_shifted1;
      vector[N_g] est_down1 = intercept[g] + slope_down[g]*DAY_shifted1;
      vector[N_g] yhat1 = (1-weight_down1) .* est_up1 + weight_down1 .* est_down1;
      
      target += normal_lpdf(Y_DAY_pos[gstart_DAY_pos[g]:gend_DAY_pos[g]] | yhat1, sigma[g]);
      
      // Negative PCR tests (censored data points)
      int N_neg_tests = N_NegTests_individual[g];
      if (N_neg_tests > 0){
        g2 = g2 + 1;
        
        vector[N_neg_tests] DAY_shifted1_neg = X_DAY_neg[gstart_DAY_neg[g2]:gend_DAY_neg[g2]] + shift[g];
        vector[N_neg_tests] weight_down1_neg = inv_logit(DAY_shifted1_neg * beta_sweight_mu);
        vector[N_neg_tests] est_up1_neg = intercept[g] + slope_up[g]*DAY_shifted1_neg;
        vector[N_neg_tests] est_down1_neg = intercept[g] + slope_down[g]*DAY_shifted1_neg;
        vector[N_neg_tests] yhat1_neg = (1-weight_down1_neg) .* est_up1_neg + weight_down1_neg .* est_down1_neg;
        
        target += normal_lcdf(rep_vector(L, N_neg_tests) | yhat1_neg, sigma[g]); // integrating out censored data points
      }
    }
  }
}

generated quantities {
  real slope_up_mu = exp(log_slope_up_mu);
  real slope_down_mu = exp(log_slope_down_mu);
  real intercept_mu = exp(log_intercept_mu);
  real time2peak_mu = intercept_mu/slope_up_mu;
}

1 Like

It’s actually log-linear, which is much better. That’s because a log linear regression corresponds to a one-compartment pharmacokinetic clearance model (i.e., exponential decay in load).

That’s going to be due to how you simulate, not due to inference. I don’t understand what you’re plotting with “estimated days of sampling”. Aren’t you just simulating the tests?

When I fit something like this, I constrained the slope down to be negative and the log intercept to also be negative, so that exponentiating them gave a probability of a positive test. That’s easier than doing what you do, which is exponentiating and negative exponentiating. Then you can put priors directly on the constrained parameters without having to do the manual Jacobian adjustment (you can also put priors on the log parameters as you do, which can be easier when building a hierarchical model).

I wasn’t clear on why you were calling negative Covid tests “censored”.

Your model’s going to be hard to identify with the redundant parameterization of intercepts/slopes and then individual differences in intercepts and slopes, because you won’t have a lot of data per individual.

I would advise against hard constraints like vector<lower=-15,upper=20>.

Rather than using the condition_on_data flag, you can alternatively just provide a G equal to 0.

A couple of complications that make this look more like a mixture model in practice, at least for Covid:

  1. Long Covid and short Covid have different clearance rates. I have PCR testing data from hospitalizations that shows many people hospitalized with Covid testing positive after 50 days. It was the basis for my ISBA poster this year, which just used the decay part, starting from day of symptom onset.

  2. Different variants have different ramp up rates to symptom onsets and different decay rates. There’s a paper we wrote with the UK Health Security Agency on this around delta and omicron (omicron ramped up faster and decayed more slowly, if I’m remembering correctly).

  3. Most of the measurements are in days, but this is something probably best measured in hours (for instance, 3.5 days vs. 4 days for symptom onset in different variants).

  4. We don’t have a lot of measurements before day 0, which is right around symptom onset, because nobody’s testing then.

1 Like

Hi, thanks a lot for your reply!

Sorry if this was a bit unclear. By “Estimated days of sampling” I am referring to when in time (relative to the day of peak viral load, day “0”) the model places the tests for each individual. This information is not provided to the model, only the days between an individual’s tests are provided.

In the histogram with the title “Estimated days of sampling” I’m showing the posterior mean of when a sample was taken, for all samples/tests. In the histogram below that, I am showing when the sampling was actually done (because I am working with simulated data, I know the true days of sampling). As you can see there is a discrepancy around day 0 between the two plots, with the model (first plot) not placing a lot of tests in that area.

I don’t quite understand how this would work but I will have a look at the paper you mentioned (thanks!) and hopefully things will become clearer. Also, just to make sure, the way the model is implemented now doesn’t require a Jacobian adjustment, right?

This is because a negative PCR doesn’t necessarily mean that there was no RNA in the original swab or the person being swabbed, it might just have been too little to be detected in this PCR run. If I understand it right, PCRs are only run long enough to detect log10 viral loads down to a certain threshold so if there’s less than that in the sample the PCR will give a negative result.

Yes, although I don’t know which parts/parameters you could get rid of (i.e. which parts exactly are redundant).

Thanks for the tip, I will play around with that more. I already tried implementing the shifts as random effects (with a normal prior and no constraints) which gave me estimates closer to the true slopes and intercept but higher Rhat values. Also, the issue I’m observing (with few data points placed around day 0) was even more pronounced (see plot below).

True, I guess this problem is somewhat mitigated when only including PCR tests from people that didn’t test positive longer than x days (which is what was done in the paper the model is from, here’s the github repo) but I was also wondering about a mixture approach.

Sounds interesting, thanks also for the other considerations! Do you mean this paper?

Again, thanks a lot for taking the time to read and reply!

Right.

There’s not a paper, just a poster for ISBA. I just took this as the time course of sensitivity for PCR tests over time t since system onset:

\textrm{sensitivity}(t) = \exp(\alpha + \beta \cdot t)

where we constrain \alpha, \beta < 0. Because t \geq 0, if \alpha < 0 and \beta < 0, then \alpha + \beta \cdot t < 0, so that \exp(\alpha + \beta \cdot t) \in (0, 1). This corresponds to exponential decay of sensitivity, which presumably corresponds to exponential decay of viral load (this I’m not sure about). In any case, this simple regression corresponds to a one-compartment pharmacokinetic clearance model (see, e.g., Clearance (pharmacology) - Wikipedia).

I wasn’t clear on why you were calling negative Covid tests “censored”.

Thanks. The term “censoring” already has a specific (different) meaning. What you’re talking about is called a “false negative”. But then you have to be clear on what a test means and what it’s setting to detect.

Similarly, a positive test doesn’t mean there was RNA, so we can have false positives, too. This is all usually built into the observation model, where the probability of a positive test is a combination of true positive and false positive probability, which is prev * sens + (1 - prev) * (1 - spec) where prev is disease prevalence, sens is sensitivity, and spec is sensitivity.

Sorry about that—“redundant” was a bad choice of words. It’s more that it introduces non-identifiability. If I have an intercept and an effect for age group, I get the linear predictor \alpha + \beta_{\textrm{age}[n]} where \textrm{age}[n] is the age group of the n-th observation. If I add a constant c to \alpha and subtract from all the \beta, then I get the same linear predictor (i.e., it’s not identified). There are two popular strategies to try to identify—set \beta_1 = 0 (pin its value) or set \textrm{sum}(\beta) = 0. Those strategies properly identify the likelihood. Alternatively, you can identify the Bayesian posterior by imposing a prior like \beta \sim \textrm{normal}(0, 1). Both this approach and the sum-to-zero approach are at least symmetric in the parameters.

Yes. I should’ve put a citation in. The UK-HSA, where Tom works, is still tracking all the new variants in terms of their dynamics and their proportion of cases (the UK does a ton of genetic profiling on positive PCR tests through NHS).

1 Like

Thanks for the clarifications and additional information, I really appreciate the feedback!

Yes (about the false negatives). In the paper the model is from, log10 viral loads from negative PCRs are imputed, with an upper limit of 3 (see Figure 2.17 below, taken from here). I thought this corresponds to estimating left-censored data as described here. Instead of imputing these values, I am trying to integrate them out in my version of the model.

That makes sense, thanks! It seems like in the original model, the priors for \beta are set like you suggested, i.e. quite tight and with a mean of 0 (if these are the important aspects?). However, I got rid of all “unnecessary” parameters like age in my model, to just have a very basic, simplistic version and see if I can fit that. I will keep these considerations in mind for when I start adding parameters though. The sum-to-zero approach sounds quite appealing.

I am still wondering about the weird behaviour I am seeing with the shift estimation. I think I will play around with different priors more and also try out different ways of dealing with negative PCRs, including the possibility of false negatives and positives. Still happy for any possible explanations of why I’m seeing what I’m seeing.

Thanks again!

Thanks for the clarification. That sounds like censoring if any value beyond 3 is reported as 3. You can integrate out either directly or by imputing then dropping the imputation—that also integrates out.

1 Like