Setting a prior for missing data where the prior is defined by another vector of data and lower and upper bounds that depend on the next and previous observed value

I’m analyzing data for a study where the treatment variable is partially missing. The treatment consisted of components that were installed over time. See the attached rough sketch. During the installation phase, the proportion of the components installed is known. After the installation phase, only a few census were taken, such that at certain times, the precise proportion of components is known. The components failed or were outright removed over time, after the installation phase. One solution would be to simply make the assumption that the components disappeared in a linear fashion between each census (blue lines). However, I would prefer to instead use a missing data model, where each timepoint between census was a parameter with upper bound as the previous census value and with lower bound as the value of the next census point. My idea was to specify a normal prior centered at the blue line (which could be a vector of data) and with upper and lower bounds as the previous and next census value, respectively. The idea for that would be to put most probability mass near the blue line but allow for the possibility of many components failing or being removed at a single timepoint and then plateauing until the next census.
Question 1 - does this seem like a decent idea?
Question 2 - is there a good way to specify each timepoint’s custom prior without writing out each prior individually? The bounds (and mean in the normal prior) change with each timepoint, so I need a different prior for every timepoint that has missing data.

Thanks for the help!

Note - edited because my original thinking about the upper bound was incorrect.

What is the statistical model that these data are used in? Is the response the census, and you want to smooth predictions between the censuses? Or is the proportion of the components remaining being used to predict some other response? Are there a very large number of components, so that the proportion is essentially continuous, or just a few, so that the proportion is strongly discrete? Do the components fail independently with exponentially distributed times to failure, or do they age and fail near the end of some lifetime, or do they depend on one another so that when one fails it can cause cascading failures in others within a relatively short time window?

1 Like

The components are the treatment variable and thus used as a predictor in a model. They are not the response variable.

The proportion during the install phases was strongly discrete (no missing data though), but in the failure phase where there is missing data it is essentially continuous.

There’s no straightforward answer to these questions. The components likely do fail independently with exponentially distributed times to failure, but 1) I have no other outside information on failure times, and 2) the data includes both failures and outright removal of components by people. Sometimes people just decided to remove and replace regardless of failure. So it’s not some sort of straightforward data for some kind of failure time model. I have no information about when outright removal happened. All I have is the known installation data and the few census points after installation was completed.

1 Like

I think the idea of monotonicity constraints to interpolate between the points makes sense, but I don’t have a strong intuition about whether the interaction of your proposed truncated normal prior and monotonicity constraint would yield a good prior model or not. I suggest some prior predictive checks to see what you get out.

I’ll note that the idea of independent exponentially distributed wait times is not consistent with the cartoon that you drew above, since in that case the number of components should follow approximate exponential decay, rather than an accelerating crash towards zero.

Oops, I misread part of your post. I definitely think you should include a monotonicity constraint. I thought you were constraining so that each timestep had an upper bound given by the value in the previous timestep (not just the value at the previous census).

Thanks for the thoughts. Any idea on making the prior specification easier when each prior is different for every missing datapoint?

Well, that’s because toward the end of the study there was likely a lot of removal of components that wasn’t due to failure.

What do you mean by a monotonicity constraint? You mean only decreasing? Given that this is a combination of both failure and removal by people (two separate processes), it doesn’t have to be ever decreasing. It could have a lot of removal, a flat plateau, and then more removal. It certainly can’t increase, though.

That’s what I mean.

1 Like

I’ll have to look around on how to specify something like that. Any ideas? Would that be separate from priors as I suggested? In addition to those priors? or replace priors with upper and lower bounds?

If there’s truly a large number of components that semi-regularly fail at random, then it would be fine to assume strict monotonicity instead of just non-increasing. You could achieve a true monotonicity constraint in a couple of ways. I-splines would be one option, but it sounds like you don’t want a model that assumes smoothness. In that case, the first solution that occurs to me would be, for every inter-census interval, to declare a simplex parameter that gives the proportions of the difference between census i and census i+1 that occur during a given timestep. Then the cumulative sums along this simplex, translated and scaled so they start from the first census value and finish at the second census value, give the values at the end of the discrete timesteps. The challenge is to place an appropriate prior on the simplex, and the only guidance I can give is a suggestion to play around with prior predictive checks.

1 Like

Thanks for the ideas!

I’ll have to think about how to program that. I guess it is nicer than the original idea with priors, since the priors I suggested wouldn’t have the monotonicity constraint.

Thinking about this more, I don’t readily see how this provides the monotonicity constraint… do you have any Stan code that provides an example? Perhaps I don’t exactly understand what you mean, as I don’t see how this constrains the imputed value at time t[census i]+2 to be less than or equal to the imputed value at t[census i]+1. It would seem to simply constrain the values to sum to the previous census.

The cumulative sums along a positive-constrained vector (like a simplex) are guaranteed to be monotonic.

So maybe something along the lines of this simple example?? In this example, I made x as a sequence that decreases by 0.01, to make it easy to follow. Missing data are 6 points from 10:15.
I’m not entirely sure this is doing what I want, as I end up with all the same imputed values, but I also didn’t put any prior on the simplex.
Edit: something closer to imputing the actual values can be had by adding a prior like target += dirichlet_lpdf(x_mis | [1,3,5,7,9,11]'); but that’s just playing around with priors. I don’t know enough about dirichlet to have any good idea of what I’m doing here, yet ;)

library(rstan)

x <- seq(from=0.9, to=0.1, by=-0.01)
length(x)
mu <- 2 + 1.5*x
y <- rnorm(length(x), mu, 1)
x[10:15] <- NA


d <- list(y = y, x_obs = c(x[1:9],x[16:81]), N = length(y), N_obs = 75, N_mis = 6, ii_obs = c(1:9,16:81), ii_mis = c(10:15))

stan_code <- "
data {
  int<lower = 0> N;
  int<lower = 0> N_obs;
  int<lower = 0> N_mis;
  int<lower = 1, upper = N> ii_obs[N_obs];
  int<lower = 1, upper = N> ii_mis[N_mis];
  real y[N];
  real x_obs[N_obs];
}
parameters {
  simplex[N_mis] x_mis;
  real<lower=0> sigma;
  real alpha;
  real beta;
}
transformed parameters {
  real x[N];
  real x_mis_t[N_mis];
  x[ii_obs] = x_obs;
  for (i in 1:N_mis){
  x_mis_t[i] = x_obs[9] - (x_mis[i] * (x_obs[9] - x_obs[16]));
  }
  x[ii_mis] = x_mis_t;
}
model {
  for (n in 1:N) {
      target += normal_lpdf(y[n] | alpha + beta*x[n], sigma); 
    }

  target += normal_lpdf(sigma | 0, 2.5) - 1 * normal_lccdf(0 | 0, 2.5);
  target += normal_lpdf(alpha | 2, 2.5);
  target += normal_lpdf(beta | 0, 5);
  
}
"

#fit model
simple_mis_simplex_mod <- stan(model_code = stan_code, data = d,
             chains = 4, iter = 2000, warmup = 1000, 
             thin = 1, cores = 4)

Unless I’m missing something, I don’t see where you take the cumulative sums along x_mis.

1 Like

Nope I didn’t. I’ll have to look how to incorporate that.

(In Edit 2 below, I figured out the problem)

Something like this is, I think, along the lines of what you are suggesting (using the same example data as I posted above).

data {
  int<lower = 0> N;
  int<lower = 0> N_obs;
  int<lower = 0> N_mis;
  int<lower = 1, upper = N> ii_obs[N_obs];
  int<lower = 1, upper = N> ii_mis[N_mis];
  real y[N];
  vector[N_obs] x_obs;
}
parameters {
  simplex[N_mis + 1] x_mis;
  real<lower=0> sigma;
  real alpha;
  real beta;
}
transformed parameters {
  vector[N] x;
  vector[N_mis] x_mis_t;
  x[ii_obs] = x_obs;
  x_mis_t = x_obs[9] - ((cumulative_sum(x_mis)) * (x_obs[9] - x_obs[16]));
  x[ii_mis] = x_mis_t;
}
model {
  for (n in 1:N) {
      target += normal_lpdf(y[n] | alpha + beta*x[n], sigma); 
    }

  target += normal_lpdf(sigma | 0, 2.5) - 1 * normal_lccdf(0 | 0, 2.5);
  target += normal_lpdf(alpha | 2, 2.5);
  target += normal_lpdf(beta | 0, 5);
  
}

However, I’m having a little trouble with the coding. The above model doesn’t sample, because I declare the simplex to be N_mis+1. It will sample if you declare simplex to be N_mis, but that is incorrect, because I need the cutpoints in the simplex to equal the number of missing values, so it must be 1 longer. I think the problem with sampling the above may come in this line, x_mis_t = x_obs[9] - ((cumulative_sum(x_mis)) * (x_obs[9] - x_obs[16])); . How do I get only the cumulative sum up to length N_mis (1 shorter than the length of the simplex) ? Thanks!

Edit: I tried changing the transformed parameters block to this:

transformed parameters {
  vector[N] x;
  vector[N_mis] x_mis_t;
  vector[N_mis] x_mis_s;
  x[ii_obs] = x_obs;
  x_mis_s = x_mis[1:N_mis];
  x_mis_t = x_obs[9] - ((cumulative_sum(x_mis_s)) * (x_obs[9] - x_obs[16]));
  x[ii_mis] = x_mis_t;
}

Which samples and gives the correct value for x_mis_s, but the x_mis_t is still well off (0.8, 0.78, 0.76, 0.75, 0.73, 0.71), so cumulative_sum doesn’t appear to work how I think it works…

Edit 2 - The code was correct except the index was wrong. Should be x_mis_t = x_obs[9] - ((cumulative_sum(x_mis_s)) * (x_obs[9] - x_obs[10])); . Dang, I’m gonna have to be careful with all the indexing that will go into code with multiple simplexes, but I think this is what you were suggesting and seems to work well. Thanks for the idea! Full reproducible example, below.

library(rstan)

x <- seq(from=0.9, to=0.1, by=-0.01)
length(x)
mu <- 2 + 1.5*x
y <- rnorm(length(x), mu, 1)
x[10:15] <- NA


d <- list(y = y, x_obs = c(x[1:9],x[16:81]), N = length(y), N_obs = 75, N_mis = 6, ii_obs = c(1:9,16:81), ii_mis = c(10:15))

stan_code <- "
data {
  int<lower = 0> N;
  int<lower = 0> N_obs;
  int<lower = 0> N_mis;
  int<lower = 1, upper = N> ii_obs[N_obs];
  int<lower = 1, upper = N> ii_mis[N_mis];
  real y[N];
  vector[N_obs] x_obs;
}
parameters {
  simplex[N_mis + 1] x_mis;
  real<lower=0> sigma;
  real alpha;
  real beta;
}
transformed parameters {
  vector[N] x;
  vector[N_mis] x_mis_t;
  vector[N_mis] x_mis_s;
  x[ii_obs] = x_obs;
  x_mis_s = cumulative_sum(x_mis[1:N_mis]);
  x_mis_t = x_obs[9] - ((x_mis_s) * (x_obs[9] - x_obs[10]));
  x[ii_mis] = x_mis_t;
}
model {
  for (n in 1:N) {
      target += normal_lpdf(y[n] | alpha + beta*x[n], sigma); 
    }

  target += normal_lpdf(sigma | 0, 2.5) - 1 * normal_lccdf(0 | 0, 2.5);
  target += normal_lpdf(alpha | 2, 2.5);
  target += normal_lpdf(beta | 0, 5);
  
}
"

#fit model
simple_mis_simplex_mod <- stan(model_code = stan_code, data = d,
             chains = 4, iter = 2000, warmup = 1000, 
             thin = 1, cores = 4)

For sure. I suspect default Dirichlet uniform prior for the simplex is a prior for straight line between the two census, but if needed it seems I could use something like that and then put on additional prior for each imputed value like

  for (n in 1:N_mis) {
      target += normal_lpdf(x_mis_t[n] | mis_prior[n], 0.05);
    }

where x_mis_t are the imputation parameters and mis_prior is data of length N_mis that is the values of a straight line between census i and i+1, if I wanted to have most prior probability mass for linear trend between points. I’ll have to think about it and play around with pp checks.
Or, I could just increase alpha, as that should make it more uniform if that is what I wanted.