Setting a prior for missing data where the prior is defined by another vector of data and lower and upper bounds that depend on the next and previous observed value

jd_c · July 17, 2023, 5:03pm

I’m analyzing data for a study where the treatment variable is partially missing. The treatment consisted of components that were installed over time. See the attached rough sketch. During the installation phase, the proportion of the components installed is known. After the installation phase, only a few census were taken, such that at certain times, the precise proportion of components is known. The components failed or were outright removed over time, after the installation phase. One solution would be to simply make the assumption that the components disappeared in a linear fashion between each census (blue lines). However, I would prefer to instead use a missing data model, where each timepoint between census was a parameter with upper bound as the previous census value and with lower bound as the value of the next census point. My idea was to specify a normal prior centered at the blue line (which could be a vector of data) and with upper and lower bounds as the previous and next census value, respectively. The idea for that would be to put most probability mass near the blue line but allow for the possibility of many components failing or being removed at a single timepoint and then plateauing until the next census.
Question 1 - does this seem like a decent idea?
Question 2 - is there a good way to specify each timepoint’s custom prior without writing out each prior individually? The bounds (and mean in the normal prior) change with each timepoint, so I need a different prior for every timepoint that has missing data.

Thanks for the help!

Note - edited because my original thinking about the upper bound was incorrect.

jsocolar · July 17, 2023, 5:59pm

What is the statistical model that these data are used in? Is the response the census, and you want to smooth predictions between the censuses? Or is the proportion of the components remaining being used to predict some other response? Are there a very large number of components, so that the proportion is essentially continuous, or just a few, so that the proportion is strongly discrete? Do the components fail independently with exponentially distributed times to failure, or do they age and fail near the end of some lifetime, or do they depend on one another so that when one fails it can cause cascading failures in others within a relatively short time window?

jd_c · July 17, 2023, 6:20pm

The components are the treatment variable and thus used as a predictor in a model. They are not the response variable.

The proportion during the install phases was strongly discrete (no missing data though), but in the failure phase where there is missing data it is essentially continuous.

There’s no straightforward answer to these questions. The components likely do fail independently with exponentially distributed times to failure, but 1) I have no other outside information on failure times, and 2) the data includes both failures and outright removal of components by people. Sometimes people just decided to remove and replace regardless of failure. So it’s not some sort of straightforward data for some kind of failure time model. I have no information about when outright removal happened. All I have is the known installation data and the few census points after installation was completed.

jsocolar · July 17, 2023, 6:33pm

I think the idea of monotonicity constraints to interpolate between the points makes sense, but I don’t have a strong intuition about whether the interaction of your proposed truncated normal prior and monotonicity constraint would yield a good prior model or not. I suggest some prior predictive checks to see what you get out.

I’ll note that the idea of independent exponentially distributed wait times is not consistent with the cartoon that you drew above, since in that case the number of components should follow approximate exponential decay, rather than an accelerating crash towards zero.

jsocolar · July 17, 2023, 6:35pm

Oops, I misread part of your post. I definitely think you should include a monotonicity constraint. I thought you were constraining so that each timestep had an upper bound given by the value in the previous timestep (not just the value at the previous census).

jd_c · July 17, 2023, 6:36pm

Thanks for the thoughts. Any idea on making the prior specification easier when each prior is different for every missing datapoint?

Well, that’s because toward the end of the study there was likely a lot of removal of components that wasn’t due to failure.

jd_c · July 17, 2023, 6:38pm

What do you mean by a monotonicity constraint? You mean only decreasing? Given that this is a combination of both failure and removal by people (two separate processes), it doesn’t have to be ever decreasing. It could have a lot of removal, a flat plateau, and then more removal. It certainly can’t increase, though.

jsocolar · July 17, 2023, 6:39pm

That’s what I mean.

jd_c · July 17, 2023, 6:41pm

I’ll have to look around on how to specify something like that. Any ideas? Would that be separate from priors as I suggested? In addition to those priors? or replace priors with upper and lower bounds?

jsocolar · July 17, 2023, 6:47pm

If there’s truly a large number of components that semi-regularly fail at random, then it would be fine to assume strict monotonicity instead of just non-increasing. You could achieve a true monotonicity constraint in a couple of ways. I-splines would be one option, but it sounds like you don’t want a model that assumes smoothness. In that case, the first solution that occurs to me would be, for every inter-census interval, to declare a simplex parameter that gives the proportions of the difference between census i and census i+1 that occur during a given timestep. Then the cumulative sums along this simplex, translated and scaled so they start from the first census value and finish at the second census value, give the values at the end of the discrete timesteps. The challenge is to place an appropriate prior on the simplex, and the only guidance I can give is a suggestion to play around with prior predictive checks.

jd_c · July 17, 2023, 7:00pm

Thanks for the ideas!

I’ll have to think about how to program that. I guess it is nicer than the original idea with priors, since the priors I suggested wouldn’t have the monotonicity constraint.

jd_c · July 18, 2023, 4:48pm

Thinking about this more, I don’t readily see how this provides the monotonicity constraint… do you have any Stan code that provides an example? Perhaps I don’t exactly understand what you mean, as I don’t see how this constrains the imputed value at time t[census i]+2 to be less than or equal to the imputed value at t[census i]+1. It would seem to simply constrain the values to sum to the previous census.

jsocolar · July 18, 2023, 4:52pm

The cumulative sums along a positive-constrained vector (like a simplex) are guaranteed to be monotonic.

jd_c · July 18, 2023, 6:37pm

So maybe something along the lines of this simple example?? In this example, I made x as a sequence that decreases by 0.01, to make it easy to follow. Missing data are 6 points from 10:15.
I’m not entirely sure this is doing what I want, as I end up with all the same imputed values, but I also didn’t put any prior on the simplex.
Edit: something closer to imputing the actual values can be had by adding a prior like target += dirichlet_lpdf(x_mis | [1,3,5,7,9,11]'); but that’s just playing around with priors. I don’t know enough about dirichlet to have any good idea of what I’m doing here, yet ;)

library(rstan)

x <- seq(from=0.9, to=0.1, by=-0.01)
length(x)
mu <- 2 + 1.5*x
y <- rnorm(length(x), mu, 1)
x[10:15] <- NA


d <- list(y = y, x_obs = c(x[1:9],x[16:81]), N = length(y), N_obs = 75, N_mis = 6, ii_obs = c(1:9,16:81), ii_mis = c(10:15))

stan_code <- "
data {
  int<lower = 0> N;
  int<lower = 0> N_obs;
  int<lower = 0> N_mis;
  int<lower = 1, upper = N> ii_obs[N_obs];
  int<lower = 1, upper = N> ii_mis[N_mis];
  real y[N];
  real x_obs[N_obs];
}
parameters {
  simplex[N_mis] x_mis;
  real<lower=0> sigma;
  real alpha;
  real beta;
}
transformed parameters {
  real x[N];
  real x_mis_t[N_mis];
  x[ii_obs] = x_obs;
  for (i in 1:N_mis){
  x_mis_t[i] = x_obs[9] - (x_mis[i] * (x_obs[9] - x_obs[16]));
  }
  x[ii_mis] = x_mis_t;
}
model {
  for (n in 1:N) {
      target += normal_lpdf(y[n] | alpha + beta*x[n], sigma); 
    }

  target += normal_lpdf(sigma | 0, 2.5) - 1 * normal_lccdf(0 | 0, 2.5);
  target += normal_lpdf(alpha | 2, 2.5);
  target += normal_lpdf(beta | 0, 5);
  
}
"

#fit model
simple_mis_simplex_mod <- stan(model_code = stan_code, data = d,
             chains = 4, iter = 2000, warmup = 1000, 
             thin = 1, cores = 4)

jsocolar · July 18, 2023, 8:08pm

Unless I’m missing something, I don’t see where you take the cumulative sums along x_mis.

jd_c · July 18, 2023, 8:18pm

Nope I didn’t. I’ll have to look how to incorporate that.

jd_c · July 19, 2023, 2:51pm

(In Edit 2 below, I figured out the problem)

Something like this is, I think, along the lines of what you are suggesting (using the same example data as I posted above).

data {
  int<lower = 0> N;
  int<lower = 0> N_obs;
  int<lower = 0> N_mis;
  int<lower = 1, upper = N> ii_obs[N_obs];
  int<lower = 1, upper = N> ii_mis[N_mis];
  real y[N];
  vector[N_obs] x_obs;
}
parameters {
  simplex[N_mis + 1] x_mis;
  real<lower=0> sigma;
  real alpha;
  real beta;
}
transformed parameters {
  vector[N] x;
  vector[N_mis] x_mis_t;
  x[ii_obs] = x_obs;
  x_mis_t = x_obs[9] - ((cumulative_sum(x_mis)) * (x_obs[9] - x_obs[16]));
  x[ii_mis] = x_mis_t;
}
model {
  for (n in 1:N) {
      target += normal_lpdf(y[n] | alpha + beta*x[n], sigma); 
    }

  target += normal_lpdf(sigma | 0, 2.5) - 1 * normal_lccdf(0 | 0, 2.5);
  target += normal_lpdf(alpha | 2, 2.5);
  target += normal_lpdf(beta | 0, 5);
  
}

However, I’m having a little trouble with the coding. The above model doesn’t sample, because I declare the simplex to be N_mis+1. It will sample if you declare simplex to be N_mis, but that is incorrect, because I need the cutpoints in the simplex to equal the number of missing values, so it must be 1 longer. I think the problem with sampling the above may come in this line, x_mis_t = x_obs[9] - ((cumulative_sum(x_mis)) * (x_obs[9] - x_obs[16])); . How do I get only the cumulative sum up to length N_mis (1 shorter than the length of the simplex) ? Thanks!

Edit: I tried changing the transformed parameters block to this:

transformed parameters {
  vector[N] x;
  vector[N_mis] x_mis_t;
  vector[N_mis] x_mis_s;
  x[ii_obs] = x_obs;
  x_mis_s = x_mis[1:N_mis];
  x_mis_t = x_obs[9] - ((cumulative_sum(x_mis_s)) * (x_obs[9] - x_obs[16]));
  x[ii_mis] = x_mis_t;
}

Which samples and gives the correct value for x_mis_s, but the x_mis_t is still well off (0.8, 0.78, 0.76, 0.75, 0.73, 0.71), so cumulative_sum doesn’t appear to work how I think it works…

Edit 2 - The code was correct except the index was wrong. Should be x_mis_t = x_obs[9] - ((cumulative_sum(x_mis_s)) * (x_obs[9] - x_obs[10])); . Dang, I’m gonna have to be careful with all the indexing that will go into code with multiple simplexes, but I think this is what you were suggesting and seems to work well. Thanks for the idea! Full reproducible example, below.

library(rstan)

x <- seq(from=0.9, to=0.1, by=-0.01)
length(x)
mu <- 2 + 1.5*x
y <- rnorm(length(x), mu, 1)
x[10:15] <- NA


d <- list(y = y, x_obs = c(x[1:9],x[16:81]), N = length(y), N_obs = 75, N_mis = 6, ii_obs = c(1:9,16:81), ii_mis = c(10:15))

stan_code <- "
data {
  int<lower = 0> N;
  int<lower = 0> N_obs;
  int<lower = 0> N_mis;
  int<lower = 1, upper = N> ii_obs[N_obs];
  int<lower = 1, upper = N> ii_mis[N_mis];
  real y[N];
  vector[N_obs] x_obs;
}
parameters {
  simplex[N_mis + 1] x_mis;
  real<lower=0> sigma;
  real alpha;
  real beta;
}
transformed parameters {
  vector[N] x;
  vector[N_mis] x_mis_t;
  vector[N_mis] x_mis_s;
  x[ii_obs] = x_obs;
  x_mis_s = cumulative_sum(x_mis[1:N_mis]);
  x_mis_t = x_obs[9] - ((x_mis_s) * (x_obs[9] - x_obs[10]));
  x[ii_mis] = x_mis_t;
}
model {
  for (n in 1:N) {
      target += normal_lpdf(y[n] | alpha + beta*x[n], sigma); 
    }

  target += normal_lpdf(sigma | 0, 2.5) - 1 * normal_lccdf(0 | 0, 2.5);
  target += normal_lpdf(alpha | 2, 2.5);
  target += normal_lpdf(beta | 0, 5);
  
}
"

#fit model
simple_mis_simplex_mod <- stan(model_code = stan_code, data = d,
             chains = 4, iter = 2000, warmup = 1000, 
             thin = 1, cores = 4)

jd_c · July 19, 2023, 5:34pm

For sure. I suspect default Dirichlet uniform prior for the simplex is a prior for straight line between the two census, but if needed it seems I could use something like that and then put on additional prior for each imputed value like

  for (n in 1:N_mis) {
      target += normal_lpdf(x_mis_t[n] | mis_prior[n], 0.05);
    }

where x_mis_t are the imputation parameters and mis_prior is data of length N_mis that is the values of a straight line between census i and i+1, if I wanted to have most prior probability mass for linear trend between points. I’ll have to think about it and play around with pp checks.
Or, I could just increase alpha, as that should make it more uniform if that is what I wanted.

Topic		Replies	Views
Missing parameters and priors Modeling	25	808	June 27, 2020
Time series with missing values that need truncated imputation Modeling	0	329	April 19, 2022
Defining priors for dependent random variables in STAN Modeling rstan , specification	1	610	November 8, 2021
Setting up priors correctly Modeling rstan , fitting-issues , specification	3	870	May 15, 2023
Treating a missing count data as a parameter and putting truncated binomial prior for it Modeling specification	7	402	October 26, 2023

Setting a prior for missing data where the prior is defined by another vector of data and lower and upper bounds that depend on the next and previous observed value

Related topics