Ragged array of simplexes

In exchange for substantial increases in computation – 2^17 leapfrog steps is a huge computational burden that will make many analyses impractical. I think that’s a pretty fair criterion for “not working well”.

2^17 for over 4 billion observed counts that pin the posterior distribution of pi down to many decimal places. If you only have a couple hundred thousand counts, then you don’t have to change max_treedepth to get the same essentially perfect estimates.

thanks @bgoodri - it’s really helpful to see when these things work and how they break down (i.e. statistically valid but slow, or introducing biased).

Out of curiosity, I coded up the identifiable model

data {
  int<lower=2> K;
  int<lower=0,upper=2147483647> x[K];
}
parameters {
  vector<lower=0>[K-1] gamma;
}
transformed parameters {
  real sum_gamma = 1 + sum(gamma);
  vector[K] pi = append_row(1,gamma) / sum_gamma;
}
model {
  target += -K *log(sum_gamma);
  target += dirichlet_lpdf(pi | rep_vector(1,K));
  target += multinomial_lpmf(x | pi);
}

which has comparable performance:

        mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
gamma[1]    2.00    0.00 0.00   2.00   2.00   2.00   2.00   2.00  1234    1
gamma[2]    3.00    0.00 0.00   3.00   3.00   3.00   3.00   3.00  1378    1
gamma[3]    4.00    0.00 0.00   4.00   4.00   4.00   4.00   4.00  1350    1
sum_gamma  10.00    0.00 0.00  10.00  10.00  10.00  10.00  10.00  1171    1
pi[1]       0.10    0.00 0.00   0.10   0.10   0.10   0.10   0.10  1171    1
pi[2]       0.20    0.00 0.00   0.20   0.20   0.20   0.20   0.20  2797    1
pi[3]       0.30    0.00 0.00   0.30   0.30   0.30   0.30   0.30  4000    1
pi[4]       0.40    0.00 0.00   0.40   0.40   0.40   0.40   0.40  4000    1
lp__      -38.69    0.03 1.16 -41.81 -39.26 -38.39 -37.84 -37.36  1851    1

but is nearly 1000 times faster:

sum(sapply(post.2@sim$samples,function(z) attr(z,‘elapsed_time’)))
[1] 0.33933
sum(sapply(post@sim$samples,function(z) attr(z,‘elapsed_time’)))
[1] 275.2609

I wonder if there would be situations where the underdefined solution would perform better…

2 Likes

Thanks Ben for the suggestion to sample from gamma distributions and transform to avoid fussing with the Jacobians.
Sorry, I should have acknowledged your suggestion for this approach in the previous thread I referenced from last year. I went for the option of implementing the simplex functions with explicit constraints because I thought the extra parameters might reduce efficiency, but didn’t do any comparisons on it.

At some point I’ll try to implement both in my model and report how the performance compares (though at the moment I think my sampling issues are more because I’m trying fit a model that isn’t really well identified from the data rather than implementation of the simplex…).

Thanks also Aaron for the clever suggestion for returning the log jacobian and transform together in the Stan function.

Thanks,
Jeff

The identified versions are generally better, but we obviously haven’t tried every case.

The distinguishing feature is that true unbounded non-identifiability leads to an improper posterior, as in the example from the problematic posteriors chapter of the manual:

y ~ normal(alpha + beta, sigma);

with no priors on alpha or beta. This is just the simplest case of a non-identified regression—it’s two intercepts. If you add a prior, the posterior becomes proper, but the likelihood is still non-identified. It’s of course better in this case just to use

y ~ normal(gamma, sigma);

I really like this approach of sampling from a normalized set of gamma random variables so that the lengths of the simplexes can vary. I’m not sure where to actually turn this into a ragged array. It seems like pi is just a single simplex in the code examples. Am I missing something?

If, for example, you need a simplex of size K1 and a simplex of size K2, declare a positive_ordered[K1 + K2] vector, separately normalize the first K1 and the last K2 elements, and put unit-scale gamma priors on things accordingly.

Actually, if you know that you need exactly two simplexes, you should just do

simplex[K1] pi_1;
simplex[K2] pi_2;

but this general idea works when the number of simplexes needed is not known until runtime.

1 Like

Thanks for the input Ben. I’m trying to get as far as I can with my hierarchical model before attending StanCon next week so I can ask better questions when I get there :).

The number of simplexes is a function of the number of observations and they’ll number in the hundreds or thousands.

Can you give me another hint on how your proposal works? If I normalize the elements of the positive_ordered vector wont it lose it’s ordering?

Also, is the prior this function?

target += exponential_lpdf(gamma | 1);

I’m not sure how to split this off from the likelihood. Would it be something like:

data {
   int<lower=1> O;  // count of observations where each observation contains 1 or more sub-observations
   int<lower=1> S;  // count of all sub-observations
   int<lower=2> K;  // maximum size of the simplex for all sub-observations
}

parameters {
   vector<lower=0>[K] gamma[O];  //randomly sampled values from gamma distribution for each observation
}

transformed parameters {
   for (o in 1:O)   
      vector[K] pi[o] = gamma[o] / sum(gamma[o]);
}

model {
   for (o in 1:O) {
      pi[o] ~ exponential(1);
   }
   // use pi in the likelihood function
}

Am I close?

I meant to say just vector<lower=0>[K1 + K2]; in general, the elements do not need to be ordered. The exponential prior corresponds to a gamma prior with a shape of 1 and a scale of 1, which implies the resulting simplex vector has a uniform distribution. You can use other shapes but the prior is on the thing that you declare in parameters rather than the simplexes that you declare in the transformed parameters or the model block.

Your example would not parse. There is a more complicated example at


where zeta is what you call gamma and I am using the segment function to extract the next nc elements after zeta_mark from it.

1 Like

Thanks again Ben! So I think I’ve managed to incorporate your feedback into a model (that compiles). I’m pasting it below for anybody else who has the same question.

My remaining question is how to integrate the shape parameter like you’ve mentioned. As you point out, this example code is using a uniform distribution, but I really need to modulate the shape. Do I need an additional variable in transformed parameters or am I supposed to do some kind of initialization of the zeta?

data {
   int<lower=1> M;               // num observations
   int<lower=1> N;                // number of sub-observations for all observations
   real<lower=0> y[N];               //Observations where each row is a subobservation
   int<lower=1,upper=M> observation[N];    //observation id for subobservation n
   int<lower=1> row_to_subobservation_map[N];      //subobservation index for each row
   int<lower=1> count_of_subobservations_per_observation[M];
}

parameters {
   vector<lower=0,upper=1.0>[N] zeta;      // subobservation dist over observation
} 

model {
   
   for (n in 1:N) {
      
      int zeta_mark = n-row_to_subobservation_map[n]+1;
      int nc = count_of_subobservations_per_observation[observation[n]];
      
      vector[nc] pi = segment(zeta,zeta_mark,nc);  // gamma(zeta | shape, 1)
      pi = pi / sum(pi);                           // thus dirichlet(pi | shape)

      target += log(y[n]*pi[row_to_subobservation_map[n]]); // likelihood

  }
}
1 Like

zeta should not have an upper bound of 1. You just need to do

target += gamma_lpdf(zeta | shape, 1);

where shape is a positive vector that you pass in as data.

I’m still struggling here. Do I really want to increase target with both pi as well as with zeta? For some reason that seems off to me, but I’m pretty new at this so it’s quite possible.

When I use (a multivariate version of) the model below I get divergent sample errors. Is this model that you’re proposing? If so, I’ll look into issues with incorporating my additional feature variables, else I’ll try to fix my model.

data {
   int<lower=1> M;               // num observations
   int<lower=1> N;                // number of sub-observations for all observations
   real<lower=0> y[N];               //Observations where each row is a subobservation
   int<lower=1,upper=M> observation[N];    //observation id for subobservation n
   int<lower=1> row_to_subobservation_map[N];      //subobservation index for each row
   int<lower=1> count_of_subobservations_per_observation[M];
   real shape;     // subobservation weight prior

}

transformed data {
   vector<lower=0>[N] shape_vec;
   for (n in 1:N){
      shape_vec[n] = shape; //symmetric dirichlet prior
   }  
}

parameters {
   vector<lower=0>[N] zeta;      // subobservation dist over observation
} 

transformed parameters {

}

model {
   
   for (n in 1:N) {
      
      int zeta_mark = n-row_to_subobservation_map[n]+1;
      int nc = count_of_subobservations_per_observation[observation[n]];
      
      vector[nc] pi = segment(zeta,zeta_mark,nc);  // gamma(zeta | shape, 1)
      pi = pi / sum(pi);                           // thus dirichlet(pi | shape)

      target += log(y[n]*pi[row_to_subobservation_map[n]]); // likelihood
      
  }
  target += gamma_lpdf(zeta | shape, 1.);
}

No, just by zeta.

So where does pi fit in if I’m only inferring zeta? How does this enforce the simplex?

The line pi = pi / sum (pi); implies pi sums to 1 and it was already nonnegative.

2 Likes

Bumping this old thread in hopes that someone can help me clarify one specific point. I would like to use @aaronjg’s identifiable formulation below, but I don’t entirely understand it.

data {
 int<lower=2> K;
 int<lower=0,upper=2147483647> x[K];
}
parameters {
 vector<lower=0>[K-1] gamma;
}
transformed parameters {
 real sum_gamma = 1 + sum(gamma);
 vector[K] pi = append_row(1,gamma) / sum_gamma;
}
model {
 target += -K *log(sum_gamma);
 target += dirichlet_lpdf(pi | rep_vector(1,K));
 target += multinomial_lpmf(x | pi);
}

Specifically, I gather that -K *log(sum_gamma) is the log Jacobian of the transform from gamma to pi, but I’m not sure why that’s the case. I’ve tried to work it out, but TBH I get a bit lost already with formulating an invertible mapping from K-1 dimensions to K. Clearly the simplex pi is effectively only K-1-dimensional, but if you drop, say, pi[1] it seems like you would still need to include sum_gamma in order to solve for gamma, and then you’re back to K dimensions. Regardless of how I set it up, the total derivative winds up with a bunch of 1/sum_gamma^2 terms and I don’t see how it simplifies to the expression in the code.

Any hints would be greatly appreciated.

Think about the gamma vector as K dimensional - there are the original K-1 defined gammas and then the 1 which is implicitly there as well. the function from \mathbb{R}^k to \mathbb{R}^k is just dividing each component in this vector by sum_gamma, so the jacobian is diag(1/sum_gamma) which is a K \times K matrix so the determinant of the jacobian is (1/sum_gamma)^K and thus the log-determinant is -K *log(sum_gamma).

Thanks for the reply. That \mathbb{R}^k \to \mathbb{R}^k mapping makes sense, and it’s invertible because one element of the simplex \pi is the reciprocal of the sum \Gamma := \sum_{i=1}^K \gamma_{i}. I’m still not sure I understand the Jacobian; for example,

J_{11} = \frac{\partial \pi_{1}}{\partial \gamma_{1}} = \frac{\partial}{\partial \gamma_{1}} \left[\frac{\gamma_{1}}{\Gamma}\right] = \frac{1 \cdot \Gamma - \gamma_{1} \cdot 1}{\Gamma^2} = \frac{1}{\Gamma} - \frac{\gamma_{1}}{\Gamma^2}.

Maybe I just need remedial calculus?

I’ve been using a different approach based on simplex amalgamation. If K_{0} is the dimension of the largest simplex you’ll need, you can declare an array of length-K_{0} simplexes and form a smaller one, say of length K, by amalgamating the last (or first) K_{0} - K + 1 elements. An appropriate Dirichlet prior ensures that the amalgamation is simplex-uniform, which is what I want. Here’s a toy example:

data {
  int<lower=2> K0;          // length of original simplex
  int<lower=1,upper=K0> K;  // length of amalgamated simplex
  int<lower=0> x[K];        // multinomial counts
}
parameters {
  simplex[K0] gamma;
}
transformed parameters {
  vector<lower=0>[K] pi = append_row(head(gamma, K - 1), sum(tail(gamma, K0 - K + 1)));
}
model {
  // prior on gamma that implies pi ~ Dir(rep_vector(1,K))
  gamma ~ dirichlet(append_row(rep_vector(1, K - 1), rep_vector(1.0/(K0 - K + 1), K0 - K + 1)));

  // likelihood
  x ~ multinomial(pi);
}

The amalgamated K_{0} - K + 1 elements of \gamma are nonidentified, but bounded because they’re just sampled from the simplex prior. Seems to work fine in practice, and I find it easier to wrap my head around (YMMV).

@ebuhle: For the case of N simplexes with different dimensions I would need a array of simplexes( each of length K_0)

parameters {
  simplex[K0] gamma[N];
}

and then each will amalgamate differently. Does this approach would cause identifiability problems?

@Fabian_Crespo, that is actually my real use case, and it works fine. I need multiple simplices of each “reduced” size K, so I construct the corresponding Dirichlet \alpha parameters in transformed data rather than doing it on the fly.