Censored data is not required in modelling


#1

Hi
I am trying to understand the following example from stan reference guide as I would like to implement it to my data set which has some censored values for each subject.

For example:

y1_obs<- c(6, 2, 3 ,1, 1)
y2_obs <- c(3, 4, 1, 1 1)

The observed data are stored in a matrix. The censoring point is 1 in this case.

  1. I would like to know how the following code can be changed to fit the model using the above data set. :
  2. Why would the stan code not have y_cens data in the data block?
data {
int<lower=0> N_obs;
int<lower=0> N_cens;
real y_obs[N_obs];
}
parameters {
real<upper=min(y_obs)> L;
real mu;
real<lower=0> sigma;
}
model {
L ~ normal(mu, sigma);
y_obs ~ normal(mu, sigma);
target += N_cens * normal_lcdf(L | mu, sigma);
}

Thanks


#2

I think y_obs is the censored data. The observations are censored, I guess, so whoever wrote it must have just called it y_obs.

I think this model would work for a single vector of input. Is it that you have multiple vector inputs and you want a model to fit that?


#3

Thanks @bbbales2. I have a situation where each y_obs vector corresponds to observations for a specific individual and need to fit a hierarchical model for the observed data matrix with mu computed from an ODE solution for each individual . So, if I am correct would the code be modified for this situation as:

data {
int<lower=1> J;                 // number of individuals 
int<lower=0> N_obs;          // number of observations   
int<lower=0> N_cens[J];      //number of censored points for individual j = 1, ..., J
real y_obs[N_obs, J];           //observed measurements
real<upper=min(y_obs)> L;   //censored point      
}
parameters {
real<lower =0>sigma_raw;
real<lower=0> sigma[J];
}
transformed parameters {
real mu[N_obs, J];  
 
  for (j in 1:J){
     mu[, j] = ode_solution;
 }
}
model {
    sigma_raw ~ exponential(1);
for (j in 1:J)
   sigma[j] ~ exponential(sigma_raw);
   y_obs[, j]~ normal(mu[,j], sigma[j]);
target += N_cens[j] * normal_lcdf(L | mu[,j], sigma[j]);
}

And then the R code for the data block for the above data would be:

stan_data <- list(J = 2, 
                            N_obs = 5, 
                            N_cens = matrix(data = c(2, 3), 1, 2), 
                            y_obs = matrix(data = c(y1_obs, y2_obs), 5, 2), 
                            L = 1)

My confusion of implementing the example to this situation arose in the place of how the data block should be presented. And so would like to know if the changes to the stan code and stan_data list is correct for the described situation.

Many thanks


#4

You’ll want to put curly braces here:

for (j in 1:J) {
  sigma[j] ~ exponential(sigma_raw);
  y_obs[, j]~ normal(mu[,j], sigma[j]);
  target += N_cens[j] * normal_lcdf(L | mu[,j], sigma[j]);
}

If you have the same number of observations for each person, you can do:

int<lower=0> N_obs;          // number of observations per person
real y_obs[N_obs, J];           //observed measurements

Otherwise you’ll need to do something like:

int<lower=0> N_obs_total;          // Number of total observations   
int<lower=0> which_individual[N];      // Indicator of which person owns which observation
real y_obs[N];           // All observed measurements pushed together in a flat array

That probably isn’t so clear. Check here for more details: https://mc-stan.org/docs/2_18/stan-users-guide/sparse-data-structures.html

Regarding this:

sigma_raw ~ exponential(1);
for (j in 1:J)
   sigma[j] ~ exponential(sigma_raw);

Maybe start with a single noise parameter? Something like:

parameters {
   real<lower = 0.0> sigma;
   ...
}

model {
  sigma ~ normal(0, 1);
  ...
}

If that doesn’t make sense, you can add the different noise for each individual separately.

I assume you’ll add ODE parameters and such?


#5

Thanks @bbbales2
I do have same number of observations for each person. I am still confused about the N_obs in the stan_data list. Would N_obs = 5 for this example or would it need to be changed for each individual depending on if the observation is more than or equal to censored value?

And if so would coding be changed as

data {
int<lower=1> J;                 // number of individuals 
int<lower=0> N_obs[J];          // number of uncensored observations   for individual j = 1, ..., J
int<lower=0> N_cens[J];      //number of censored points for individual j = 1, ..., J
real y_obs[N_obs, J];           //observed measurements
real<upper=min(y_obs)> L;   //censored point      
}
parameters {
real<lower =0>sigma_raw;
real<lower=0> sigma[J];
}
transformed parameters {
real mu[N_obs, J];  
 
  for (j in 1:J){
     mu[, j] = ode_solution;
 }
}
model {
    sigma_raw ~ exponential(1);

for (j in 1:J){
   sigma[j] ~ exponential(sigma_raw);
   y_obs[, j]~ normal(mu[,j], sigma[j]);
target += N_cens[j] * normal_lcdf(L | mu[,j], sigma[j]);
}

}

And data would be like:

stan_data <- list(J = 2, 
                            N_obs = matrix(data = c(3,2), 
                            N_cens = matrix(data = c(2, 3), 1, 2), 
                            y_obs = matrix(data = c(y1_obs, y2_obs), 5, 2), 
                            L = 1)

I wanted to make sure that I get this right as the observations corresponds to time series measurements and thus mu[, j] needs to be computed for the time points corresponding to jth individual.

Also, there are some columns in the data matrix with no censoring points (ie. all observations >= 1).

Otherwise you’ll need to do something like:

int<lower=0> N_obs_total;          // Number of total observations   
int<lower=0> which_individual[N];      // Indicator of which person owns which observation
real y_obs[N];           // All observed measurements pushed together in a flat array

Did you mean to say N instead of N_obs_total here?

I assume you’ll add ODE parameters and such?

Yes I do have ODE parameters, some of which are fixed and some of them are individual specific.

Thanks


#6

Going by the manual, N_obs corresponds to the number of non-censored observations and N_cens is the number of censored measurements for each individual.

I think I used observations too loosely previously (my bad). If you have the same number of non-censored observations for every individual, you can use the 2d array. Otherwise you should do something else.

It’d be less efficient than the ragged structure but maybe easier to debug if you added another matrix that described what was censored or not:

data {
  int<lower=1> J;                 // number of individuals 
  int<lower=0> N_obs;          // number of uncensored observations for each individual
  real y_obs[N_obs, J];           //observed measurements
  int is_censored[N_obs, J];           //1 if censored, 0 if not
  real<upper=min(y_obs)> L;   //censored point
}
...
model {
  ...
  for (j in 1:J){
    for(n in 1:N_obs) {
      if(is_censored[n, j] == 0) {
        y_obs[n, j] ~ normal(mu[n, j], sigma[j]);
      } else {
        target += normal_lcdf(L | mu[n,j], sigma[j]);
      }
    }
  }
}

Does that look cleaner?

Yeah, good catch.

The only way to be sure you’re right is to set up some sorta test to make sure the thing you hope will happen is actually happening. The simplest thing would just be to use prints to make sure data is showing up where you expect, and the better thing is doing experiments on simulated data to make sure you can recover the parameters you hope to recover.

With something complicated like an ODE, you should do those. It’ll be less stressful in the long run. Apologies for the soapbox, but don’t trust what I say if you wanna make sure and be right :D. I make quite a lot of mistakes.


#7

Thanks @bbbales2.

Does that look cleaner?

This really helped to clear my confusion.

With something complicated like an ODE, you should do those.

I agree, the ODE is complex with 6 dimensions. So will simulate data and check if parameters can be recovered.

THanks a lot