# Sampling efficiency when using a ragged array in a simplex-like way

TL; DR: I’m modeling missing data points in a time series as a thing like a simplex, and it’s relatively fast but I think it has relatively low n_eff/sample, and I’d like to make it better.

I’m modeling a time series:

`Y_t ~ normal(Y_t-1 + mu,sigma)`

with missing observations:

`Y[3] = nan`
`Y[4] = nan`

I’m dealing with the missing observations by treating them as parameters to be estimated; if there’s an observation at time `t=2`, `Y_2`, and another observation at `t=5`, `Y_5`, then `Y_3` and `Y_4` are inferred as parameters. What do we know about those parameters? Since this is a time series, we know that the difference between the two observed time points (`Y_5` and `Y_2`) must be explained by the steps between `Y_2`, `Y_3`, `Y_4`, and `Y_5`.

`Y_5 - Y_2 == (Y_5 - Y_4) + (Y_4 - Y_3) + (Y_3 - Y_2)`

(This is particularly natural if we know that `Y` is monotonically increasing, in which case the time series model is `Y_t ~ normal(Y_t-1 + mu, sigma) T[Y_t-1,]`). The monotonic case isn’t germane to my question today, but it’s the use case I’m eventually building up to.)

The way that I actually model this in Stan is not to treat `Y_4` and `Y_3` as parameters (the levels), but to treat `(Y_5 - Y_4)`, `(Y_4 - Y_3)`, and `(Y_3 - Y_2)` as parameters (the steps). We’re not modeling the levels; we’re modeling the steps. I can then constrain these steps using the equation above, and back out what the levels `Y_3` and `Y_4` should be. At the end of the post is my Stan model for this.

The problem is as follows: One of the parameters is unnecessary. If you know `Y_5 - Y_2`, `Y_5 - Y_4`, and `Y_4 - Y_3`, then you know `Y_3 - Y_2`. If we could lop off one of these parameters, surely things would be better. Is there a way to do this? Is there a better way to attack the problem?

Example output (with `mu=3`, `sigma=1`, and `Y_3` and `Y_4` missing)

``````Inference for Stan model: anon_model_4888889db6e52ba8f2aaf6c10bceb20d.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
mu                                3.05  9.9e-4   0.05   2.96   3.02   3.05   3.09   3.14   2212    1.0
sigma                             1.03  7.7e-4   0.03   0.97   1.01   1.03   1.06    1.1   1896    1.0
restricted_latent_parameters[0]-3.8e-3    0.46   0.65  -0.95  -0.62  -0.03   0.61   0.95      2   3.07
restricted_latent_parameters[1]-6.6e-3    0.46   0.65  -0.97  -0.63   0.01   0.62   0.96      2    3.1
restricted_latent_parameters[2] 5.6e-4    0.46   0.65  -0.95  -0.62-4.1e-3   0.62   0.95      2    3.1
Y_latent[0]                       2.73 3.1e-164.4e-16   2.73   2.73   2.73   2.73   2.73      2    nan
Y_latent[1]                       5.98 6.3e-168.9e-16   5.98   5.98   5.98   5.98   5.98      2    nan
Y_latent[2]                       8.49     0.0    0.0   8.49   8.49   8.49   8.49   8.49      2    nan
Y_latent[3]                      11.22    0.01   0.73   9.69  10.76  11.27  11.69  12.61   4000    1.0
Y_latent[4]                      13.98    0.01   0.72  12.56  13.51  13.96  14.44  15.41   4000    1.0
Y_latent[5]                      16.71     0.0    0.0  16.71  16.71  16.71  16.71  16.71      2    nan
Y_latent[6]                      19.79 5.0e-157.1e-15  19.79  19.79  19.79  19.79  19.79      2    nan
...
``````

If there are no missing observations then we have markedly higher `n_eff`; sometimes all the way up to 1 `n_eff/sample`.

And here’s the model:

``````data {
int T; // number of time steps
vector[T] Y; // data to model
int n_missing_observations;
}
parameters {
real mu;
real<lower=0> sigma;
vector<lower=-1, upper=1>[n_missing_steps] restricted_latent_parameters;
}
transformed parameters {
vector[T] Y_latent;

{
int restricted_param_counter;
int gap_width;
real previous_value;
int previous_value_index;

restricted_param_counter = 1;

Y_latent[1] = Y[1];
previous_value = Y[1];
previous_value_index = 1;

for (t in 2:T){
if (!is_nan(Y[t])){
Y_latent[t] = Y[t];

gap_width = t-previous_value_index;
if (gap_width>1){
// These are the unobserved STEPS between observed time steps.
// I.e. If Y_3 and Y_1 are observed, by Y_2 is not, these are (Y_3 - Y_2) and (Y_2-Y_1)
// We will say that these steps have to sum up to the observed difference between Y_3 and Y_1.
// The unobserved levels then have values that are the cumulative sum of these steps.

Y_latent[previous_value_index+1:t-1] =
cumulative_sum(
restricted_latent_parameters[restricted_param_counter:restricted_param_counter+gap_width-1]
/ sum(restricted_latent_parameters[restricted_param_counter:restricted_param_counter+gap_width-1])
* (Y[t] - previous_value) //Take the parameters and divide by their sum, then multiply by the total step that needs to be accounted for. You now have a set of steps that lead up to the next observed level.
)[1:gap_width-1] + previous_value;

// We don't include the last step in this sum, since we already explicitly grabbed the level
// we ultimately get to from the data itself.

restricted_param_counter = restricted_param_counter + gap_width-1;

}

previous_value = Y[t];
previous_value_index = t;
}
}
}
}
model {
mu ~ normal(0, 1);
sigma ~ cauchy(0, 4);
for (t in 2:T){
Y_latent[t] ~ normal(Y_latent[t-1]+mu, sigma);
}
}
``````

In terms of thinking about this, you just want to follow the generative model. You don’t want to think about all those differences, but just remember that `y[t + 1]` gets a distribution in terms of `y[t]` and that everything fits into a big joint density, so you can do inference on any of the unknown pieces you want.

So I’d implement this with the approach outlined in the manual for missing data. You might find it easier to store indexes for the total number of steps `T` and observed indexes and missing indexes.

``````data {
int<lower=0> T;
int<lower=0> N_obs;
vector[N_obs] Y_obs;
int<lower=1, upper=T> idx_obs[N_obs];
int<lower=1, upper=T> idx_miss[T - N_obs];
}
parameters {
real mu;
real<lower=0> sigma;
vector[T - N_obs] Y_miss;
}
transformed parameters {
vector[T] Y;
Y[idx_obs] = Y_obs;
Y[idx_miss] = Y_miss;
}
model {
mu ~ normal(0, 1);
sigma ~ cauchy(0, 4);
Y[2:T] ~ normal(mu + Y[1:(T-1)], sigma);
}
``````

with the idea being that you observe `Y[idx[n]]` for `n in 1:N_obs` but don’t observe `Y[idx_miss[n]]` for `n in 1:(T - N_obs)`. This isn’t the most efficient way to do this, but should be OK unless your time series are very long and the fraction of missing data is relatively small.

A more efficient strategy is to split out the four cases (missing-missing, missing-observed, observed-missing, and observed-observed) into four sampling statements that could be vectorized.

It’s an odd time series with `mu` being added each step. Are you sure you want to do that without something downweighting the previous `Y`?

Thanks @Bob_Carpenter. I would very much love to do that, and it’s what I started with. The issue is that I will eventually be developing a monotonic time series, so that the change from `Y_t-1` to `Y_t` must be greater than or equal to 0. Just inferring `Y[t]` falls down on this, while inferring `Y[t]-Y[t-1]` just requires setting some bounds. Let’s see the monotonic cases:

``````data {
int T; // number of time steps
vector[T] Y; // data to model
int n_missing_observations;
}

parameters {
real mu;
real<lower=0> sigma;
vector<lower=0, upper=1>[n_missing_observations] restricted_latent_parameters; \\Change #1: lower=0
}

transformed parameters {
vector[T] Y_latent;

{
int restricted_param_counter;
int gap_width;
real previous_value;
int previous_value_index;

restricted_param_counter = 1;

Y_latent[1] = Y[1];
previous_value = Y[1];
previous_value_index = 1;

for (t in 2:T){
if (!is_nan(Y[t])){
Y_latent[t] = Y[t];

gap_width = t-previous_value_index;
if (gap_width>1){
// These are the unobserved UPDATES between observed time steps.
// I.e. If Y_3 and Y_1 are observed, by Y_2 is not, these are (Y_3 - Y_2) and (Y_2-Y_1)
// We will say that these updates have to sum up to the observed difference between Y_3 and Y_1.
// The unobserved time steps then have values that are the cumulative sum of these updates.

Y_latent[previous_value_index+1:t-1] =
cumulative_sum(
restricted_latent_parameters[restricted_param_counter:restricted_param_counter+gap_width-1]
/ sum(restricted_latent_parameters[restricted_param_counter:restricted_param_counter+gap_width-1])
* (Y[t] - previous_value)
)[1:gap_width-1] + previous_value;

// We don't include the last update in this sum, since we already explicitly grabbed the level
// we ultimately get to from the data itself.

restricted_param_counter = restricted_param_counter + gap_width-1;

}

previous_value = Y[t];
previous_value_index = t;
}
}
}
}

model {
mu ~ normal(0, 1);
sigma ~ cauchy(0, 4);
for (t in 2:T){
Y_latent[t] ~ normal(Y_latent[t-1]+mu, sigma) T[Y_latent[t-1],]; \\Change #2: truncate it to the previous value.
}
}
``````

Works great, though not the n_eff I’d like?:

``````Inference for Stan model: anon_model_63d7a21a0bed00a4ce8f68be12d63c87.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
mu                                 3.0  7.2e-4   0.04   2.91   2.97    3.0   3.03   3.09   3735    1.0
sigma                             0.97  6.1e-4   0.03   0.91   0.95   0.97   0.99   1.04   2697    1.0
restricted_latent_parameters[0]   0.62  4.7e-3   0.22   0.17   0.45   0.63   0.79   0.97   2121    1.0
restricted_latent_parameters[1]   0.62  4.7e-3   0.22   0.18   0.46   0.63   0.79   0.97   2073    1.0
restricted_latent_parameters[2]   0.62  4.7e-3   0.22   0.18   0.46   0.63   0.79   0.97   2151    1.0
Y_latent[0]                        0.0     0.0    0.0    0.0    0.0    0.0    0.0    0.0   4000    nan
Y_latent[1]                       4.65     0.0    0.0   4.65   4.65   4.65   4.65   4.65      2    nan
Y_latent[2]                       8.24     0.0    0.0   8.24   8.24   8.24   8.24   8.24      2    nan
Y_latent[3]                      11.44    0.01    0.7  10.01   11.0  11.46   11.9  12.81   4000    1.0
Y_latent[4]                      14.65    0.01   0.69  13.32  14.19  14.63  15.09  16.02   4000    1.0
Y_latent[5]                      17.85 5.0e-157.1e-15  17.85  17.85  17.85  17.85  17.85      2    nan
...
``````

Whereas for inferring `Y[t]` directly, here’s the model:

``````data {
int T; // number of time steps
vector[T] Y; // data to model
int n_missing_observations;
}

parameters {
real mu;
real<lower=0> sigma;
vector[n_missing_observations] latent_data; \\can't put a lower bound on these, because they could be anything.
}

transformed parameters {
vector[T] Y_latent;

{
int latent_data_counter;
latent_data_counter = 1;

for (t in 1:T){
if (is_nan(Y[t])){
Y_latent[t] = latent_data[latent_data_counter];
latent_data_counter = latent_data_counter + 1;
} else{
Y_latent[t] = Y[t];
}
}
}
}

model {
mu ~ normal(0, 1);
sigma ~ cauchy(0, 4);
for (t in 2:T){
Y_latent[t] ~ normal(Y_latent[t-1]+mu, sigma) T[Y_latent[t-1],];
}
}
``````

Just yields errors:

``````Initialization between (-2, 2) failed after 100 attempts.
Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
Rejecting initial value:
Log probability evaluates to log(0), i.e. negative infinity.
Stan can't start sampling from this initial value.
``````

I don’t understand the model, but `Y_latent[1]` is just getting set to a data value, so it doesn’t have a well-defined n_eff as there’s zero variance in the sample. I’m also not sure where `Y_latent[0]` is coming from—I take it this is something other than RStan—Stan itself indexes from 1.

Thanks, @Bob_Carpenter. This is output from `pystan`, hence the 0-indexing. My issue is not about the `n_eff` for the `Y_latent` values; they’re doing fine. It’s the `n_eff` for the `mu` and `sigma`. If we were not modeling a monotonically-increasing time series, and thus used the method you previously described, then `n_eff` for `mu` and `sigma` is typically more like 4000 (1 n_eff/sample). But in this case, `n_eff` is lower.

As I wrote in my first post, I suspect the reason is in the way the latent data is constructed. For 2 missing observations there are 3 steps inferred as parameters, but the 3 values are not independent. If the 3 steps were collectively treated as a simplex, my understanding is that Stan would actually only use 2 parameters. I wonder if there’s a way to achieve that same mechanics here.

1/2 is fine. The underlying problem is that the geometry changes and it becomes harder to explore. It’s when you get down to the 1/200 range that you want to worry and start using other checks to make sure the n_eff estimate itself is reasoanble.

Good to know! Thanks.