Sampling efficiency when using a ragged array in a simplex-like way

TL; DR: I’m modeling missing data points in a time series as a thing like a simplex, and it’s relatively fast but I think it has relatively low n_eff/sample, and I’d like to make it better.

I’m modeling a time series:

Y_t ~ normal(Y_t-1 + mu,sigma)

with missing observations:

Y[3] = nan
Y[4] = nan

I’m dealing with the missing observations by treating them as parameters to be estimated; if there’s an observation at time t=2, Y_2, and another observation at t=5, Y_5, then Y_3 and Y_4 are inferred as parameters. What do we know about those parameters? Since this is a time series, we know that the difference between the two observed time points (Y_5 and Y_2) must be explained by the steps between Y_2, Y_3, Y_4, and Y_5.

Y_5 - Y_2 == (Y_5 - Y_4) + (Y_4 - Y_3) + (Y_3 - Y_2)

(This is particularly natural if we know that Y is monotonically increasing, in which case the time series model is Y_t ~ normal(Y_t-1 + mu, sigma) T[Y_t-1,]). The monotonic case isn’t germane to my question today, but it’s the use case I’m eventually building up to.)

The way that I actually model this in Stan is not to treat Y_4 and Y_3 as parameters (the levels), but to treat (Y_5 - Y_4), (Y_4 - Y_3), and (Y_3 - Y_2) as parameters (the steps). We’re not modeling the levels; we’re modeling the steps. I can then constrain these steps using the equation above, and back out what the levels Y_3 and Y_4 should be. At the end of the post is my Stan model for this.

The problem is as follows: One of the parameters is unnecessary. If you know Y_5 - Y_2, Y_5 - Y_4, and Y_4 - Y_3, then you know Y_3 - Y_2. If we could lop off one of these parameters, surely things would be better. Is there a way to do this? Is there a better way to attack the problem?

Example output (with mu=3, sigma=1, and Y_3 and Y_4 missing)

Inference for Stan model: anon_model_4888889db6e52ba8f2aaf6c10bceb20d.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                                  mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
mu                                3.05  9.9e-4   0.05   2.96   3.02   3.05   3.09   3.14   2212    1.0
sigma                             1.03  7.7e-4   0.03   0.97   1.01   1.03   1.06    1.1   1896    1.0
restricted_latent_parameters[0]-3.8e-3    0.46   0.65  -0.95  -0.62  -0.03   0.61   0.95      2   3.07
restricted_latent_parameters[1]-6.6e-3    0.46   0.65  -0.97  -0.63   0.01   0.62   0.96      2    3.1
restricted_latent_parameters[2] 5.6e-4    0.46   0.65  -0.95  -0.62-4.1e-3   0.62   0.95      2    3.1
Y_latent[0]                       2.73 3.1e-164.4e-16   2.73   2.73   2.73   2.73   2.73      2    nan
Y_latent[1]                       5.98 6.3e-168.9e-16   5.98   5.98   5.98   5.98   5.98      2    nan
Y_latent[2]                       8.49     0.0    0.0   8.49   8.49   8.49   8.49   8.49      2    nan
Y_latent[3]                      11.22    0.01   0.73   9.69  10.76  11.27  11.69  12.61   4000    1.0
Y_latent[4]                      13.98    0.01   0.72  12.56  13.51  13.96  14.44  15.41   4000    1.0
Y_latent[5]                      16.71     0.0    0.0  16.71  16.71  16.71  16.71  16.71      2    nan
Y_latent[6]                      19.79 5.0e-157.1e-15  19.79  19.79  19.79  19.79  19.79      2    nan
...

If there are no missing observations then we have markedly higher n_eff; sometimes all the way up to 1 n_eff/sample.

And here’s the model:

data {
    int T; // number of time steps    
    vector[T] Y; // data to model
    int n_missing_observations;
}
parameters {
    real mu;
    real<lower=0> sigma;
    vector<lower=-1, upper=1>[n_missing_steps] restricted_latent_parameters;
}
transformed parameters {
    vector[T] Y_latent;
    
    {
    int restricted_param_counter;
    int gap_width;
    real previous_value;
    int previous_value_index;
    
    restricted_param_counter = 1;
    
    Y_latent[1] = Y[1];
    previous_value = Y[1];
    previous_value_index = 1;

    for (t in 2:T){
        if (!is_nan(Y[t])){
            Y_latent[t] = Y[t];
            
            gap_width = t-previous_value_index;
            if (gap_width>1){
                // These are the unobserved STEPS between observed time steps. 
                // I.e. If Y_3 and Y_1 are observed, by Y_2 is not, these are (Y_3 - Y_2) and (Y_2-Y_1)
                // We will say that these steps have to sum up to the observed difference between Y_3 and Y_1.
                // The unobserved levels then have values that are the cumulative sum of these steps.

                Y_latent[previous_value_index+1:t-1] = 
                cumulative_sum(
                 restricted_latent_parameters[restricted_param_counter:restricted_param_counter+gap_width-1]
                 / sum(restricted_latent_parameters[restricted_param_counter:restricted_param_counter+gap_width-1])
                 * (Y[t] - previous_value) //Take the parameters and divide by their sum, then multiply by the total step that needs to be accounted for. You now have a set of steps that lead up to the next observed level.
                 )[1:gap_width-1] + previous_value;

                // We don't include the last step in this sum, since we already explicitly grabbed the level
                // we ultimately get to from the data itself.

                restricted_param_counter = restricted_param_counter + gap_width-1;

            }

            previous_value = Y[t];
            previous_value_index = t;
        } 
    }
    }
}
model { 
    mu ~ normal(0, 1);
    sigma ~ cauchy(0, 4);
    for (t in 2:T){
        Y_latent[t] ~ normal(Y_latent[t-1]+mu, sigma);
    }
}

In terms of thinking about this, you just want to follow the generative model. You don’t want to think about all those differences, but just remember that y[t + 1] gets a distribution in terms of y[t] and that everything fits into a big joint density, so you can do inference on any of the unknown pieces you want.

So I’d implement this with the approach outlined in the manual for missing data. You might find it easier to store indexes for the total number of steps T and observed indexes and missing indexes.

data {
  int<lower=0> T;
  int<lower=0> N_obs;
  vector[N_obs] Y_obs;
  int<lower=1, upper=T> idx_obs[N_obs];
  int<lower=1, upper=T> idx_miss[T - N_obs];
}
parameters {
  real mu;
  real<lower=0> sigma;
  vector[T - N_obs] Y_miss;
}
transformed parameters {
  vector[T] Y;
  Y[idx_obs] = Y_obs;
  Y[idx_miss] = Y_miss;
} 
model {
   mu ~ normal(0, 1);
   sigma ~ cauchy(0, 4);
   Y[2:T] ~ normal(mu + Y[1:(T-1)], sigma);
}

with the idea being that you observe Y[idx[n]] for n in 1:N_obs but don’t observe Y[idx_miss[n]] for n in 1:(T - N_obs). This isn’t the most efficient way to do this, but should be OK unless your time series are very long and the fraction of missing data is relatively small.

A more efficient strategy is to split out the four cases (missing-missing, missing-observed, observed-missing, and observed-observed) into four sampling statements that could be vectorized.

It’s an odd time series with mu being added each step. Are you sure you want to do that without something downweighting the previous Y?

Thanks @Bob_Carpenter. I would very much love to do that, and it’s what I started with. The issue is that I will eventually be developing a monotonic time series, so that the change from Y_t-1 to Y_t must be greater than or equal to 0. Just inferring Y[t] falls down on this, while inferring Y[t]-Y[t-1] just requires setting some bounds. Let’s see the monotonic cases:

data {
    int T; // number of time steps    
    vector[T] Y; // data to model
    int n_missing_observations;
}

parameters {
    real mu;
    real<lower=0> sigma;
    vector<lower=0, upper=1>[n_missing_observations] restricted_latent_parameters; \\Change #1: lower=0
}


transformed parameters {
    vector[T] Y_latent;
    
    {
    int restricted_param_counter;
    int gap_width;
    real previous_value;
    int previous_value_index;
    
    restricted_param_counter = 1;
    
    Y_latent[1] = Y[1];
    previous_value = Y[1];
    previous_value_index = 1;

    for (t in 2:T){
        if (!is_nan(Y[t])){
            Y_latent[t] = Y[t];
            
            gap_width = t-previous_value_index;
            if (gap_width>1){
                // These are the unobserved UPDATES between observed time steps. 
                // I.e. If Y_3 and Y_1 are observed, by Y_2 is not, these are (Y_3 - Y_2) and (Y_2-Y_1)
                // We will say that these updates have to sum up to the observed difference between Y_3 and Y_1.
                // The unobserved time steps then have values that are the cumulative sum of these updates.

                Y_latent[previous_value_index+1:t-1] = 
                cumulative_sum(
                 restricted_latent_parameters[restricted_param_counter:restricted_param_counter+gap_width-1]
                 / sum(restricted_latent_parameters[restricted_param_counter:restricted_param_counter+gap_width-1])
                 * (Y[t] - previous_value)
                 )[1:gap_width-1] + previous_value;

                // We don't include the last update in this sum, since we already explicitly grabbed the level
                // we ultimately get to from the data itself.

                restricted_param_counter = restricted_param_counter + gap_width-1;

            }

            previous_value = Y[t];
            previous_value_index = t;
        } 
    }
    }
}

model { 
    mu ~ normal(0, 1);
    sigma ~ cauchy(0, 4);
    for (t in 2:T){
        Y_latent[t] ~ normal(Y_latent[t-1]+mu, sigma) T[Y_latent[t-1],]; \\Change #2: truncate it to the previous value.
    }
}

Works great, though not the n_eff I’d like?:

Inference for Stan model: anon_model_63d7a21a0bed00a4ce8f68be12d63c87.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                                  mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
mu                                 3.0  7.2e-4   0.04   2.91   2.97    3.0   3.03   3.09   3735    1.0
sigma                             0.97  6.1e-4   0.03   0.91   0.95   0.97   0.99   1.04   2697    1.0
restricted_latent_parameters[0]   0.62  4.7e-3   0.22   0.17   0.45   0.63   0.79   0.97   2121    1.0
restricted_latent_parameters[1]   0.62  4.7e-3   0.22   0.18   0.46   0.63   0.79   0.97   2073    1.0
restricted_latent_parameters[2]   0.62  4.7e-3   0.22   0.18   0.46   0.63   0.79   0.97   2151    1.0
Y_latent[0]                        0.0     0.0    0.0    0.0    0.0    0.0    0.0    0.0   4000    nan
Y_latent[1]                       4.65     0.0    0.0   4.65   4.65   4.65   4.65   4.65      2    nan
Y_latent[2]                       8.24     0.0    0.0   8.24   8.24   8.24   8.24   8.24      2    nan
Y_latent[3]                      11.44    0.01    0.7  10.01   11.0  11.46   11.9  12.81   4000    1.0
Y_latent[4]                      14.65    0.01   0.69  13.32  14.19  14.63  15.09  16.02   4000    1.0
Y_latent[5]                      17.85 5.0e-157.1e-15  17.85  17.85  17.85  17.85  17.85      2    nan
...

Whereas for inferring Y[t] directly, here’s the model:

data {
    int T; // number of time steps    
    vector[T] Y; // data to model
    int n_missing_observations;
}

parameters {
    real mu;
    real<lower=0> sigma;
    vector[n_missing_observations] latent_data; \\can't put a lower bound on these, because they could be anything.
}

transformed parameters {
    vector[T] Y_latent;
    
    {
    int latent_data_counter;
    latent_data_counter = 1;
    
    for (t in 1:T){
        if (is_nan(Y[t])){
            Y_latent[t] = latent_data[latent_data_counter];
            latent_data_counter = latent_data_counter + 1;
        } else{
            Y_latent[t] = Y[t];
        }
    }
    }
}

model { 
    mu ~ normal(0, 1);
    sigma ~ cauchy(0, 4);
    for (t in 2:T){
        Y_latent[t] ~ normal(Y_latent[t-1]+mu, sigma) T[Y_latent[t-1],];
    }
}

Just yields errors:

Initialization between (-2, 2) failed after 100 attempts.
 Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
Rejecting initial value:
  Log probability evaluates to log(0), i.e. negative infinity.
  Stan can't start sampling from this initial value.

I don’t understand the model, but Y_latent[1] is just getting set to a data value, so it doesn’t have a well-defined n_eff as there’s zero variance in the sample. I’m also not sure where Y_latent[0] is coming from—I take it this is something other than RStan—Stan itself indexes from 1.

Thanks, @Bob_Carpenter. This is output from pystan, hence the 0-indexing. My issue is not about the n_eff for the Y_latent values; they’re doing fine. It’s the n_eff for the mu and sigma. If we were not modeling a monotonically-increasing time series, and thus used the method you previously described, then n_eff for mu and sigma is typically more like 4000 (1 n_eff/sample). But in this case, n_eff is lower.

As I wrote in my first post, I suspect the reason is in the way the latent data is constructed. For 2 missing observations there are 3 steps inferred as parameters, but the 3 values are not independent. If the 3 steps were collectively treated as a simplex, my understanding is that Stan would actually only use 2 parameters. I wonder if there’s a way to achieve that same mechanics here.

1/2 is fine. The underlying problem is that the geometry changes and it becomes harder to explore. It’s when you get down to the 1/200 range that you want to worry and start using other checks to make sure the n_eff estimate itself is reasoanble.

Good to know! Thanks.