Ragged array of simplexes


#8

There’s a quick discussion in the manual. Stan should fail to fit if there’s unbounded non-identifiability in the model.

Usually there’s some kind of weak identifiability through priors (including bounds), in which case, your mileage will vary depending on the specifics.


#9

Thanks @bob_carpenter, I guess I’m still confused since it seems like some sections of the manual (23.1 mitigating the invariances) seem to say that non-identifiabilities are bad and should be eliminated, others say that it is fine if it is constrained (unit vector) and others seem to have no issue with it at all (reparameterizing a student-t distribution).

I feel like there is some nuance between these different cases that I am missing…

@jeffeaton - if you have performance issues, you might e able to do something like this, where the lp and simplex are calculated at the same time, but then you would have to split it up in the transformed parameters block.

  vector simplex_constrain_new(vector y) {
    real lj = 0;
    vector[rows(y)+1] x;
    int Km1 = rows(y);
    real stick_len = 1.0;

    for (k in 1:Km1) {
      real eq_share = -log(Km1 - k+1);
      real adj_y_k = y[k] + eq_share;      
      real z_k = inv_logit(adj_y_k);
      x[k] = stick_len * z_k;
      lj = lj + log(stick_len) - log1p_exp(-adj_y_k) - log1p_exp(adj_y_k);      
      stick_len = stick_len - x[k];
    }
    x[Km1+1] = stick_len;

    return append_row(lj,x);
  }

#10

I think you may have misinterpreted my post as being literal.


#11

All well and good if you want to sample directly from a Dirichlet, which is not the relevant task here. When sampling directly there’s no problem because the constraint is applied only after the fact. When using this is a prior, however, you have to confront the non-identifiability. The Gamma priors will weakly-identify the posterior to a neighborhood, but within that neighborhood the data will constrain the posterior onto a nonlinear surface that will eventually obstruct efficient sampling. Consequently in the relevant task the softmax’d gamma parameterization will work poorly with MCMC. As I said.


#12

I didn’t realize your relevant tasks required 64 bit integers. With multinomial data that Stan can actually ingest, the phenomenon you are referring to manifests itself in the leapfrogger taking more and smaller steps and yields imprecise estimates of the irrelevant vector gamma, but MCMC sampling for the simplex pi is fine when you do

post <- stan("simplexes.stan", data = list(K = 4, x = 2 * 
             rmultinom(1, size = .Machine$integer.max, prob = 1:4 / 10)[,1]), 
             control = list(max_treedepth = 17L))

to get

> print(post, digits = 4)
Inference for Stan model: simplexes.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

             mean se_mean     sd     2.5%      25%      50%      75%    97.5% n_eff   Rhat
gamma[1]   0.4009  0.0183 0.2186   0.0888   0.2469   0.3643   0.5115   1.0050   143 1.0113
gamma[2]   0.8018  0.0366 0.4373   0.1776   0.4938   0.7287   1.0232   2.0101   143 1.0113
gamma[3]   1.2027  0.0549 0.6560   0.2664   0.7407   1.0931   1.5347   3.0151   143 1.0113
gamma[4]   1.6036  0.0732 0.8746   0.3552   0.9877   1.4574   2.0462   4.0202   143 1.0113
pi[1]      0.1000  0.0000 0.0000   0.1000   0.1000   0.1000   0.1000   0.1000  2658 0.9998
pi[2]      0.2000  0.0000 0.0000   0.2000   0.2000   0.2000   0.2000   0.2000  4000 0.9994
pi[3]      0.3000  0.0000 0.0000   0.3000   0.3000   0.3000   0.3000   0.3000  4000 1.0005
pi[4]      0.4000  0.0000 0.0000   0.4000   0.4000   0.4000   0.4000   0.4000  4000 0.9994
lp__     -39.5860  0.0715 1.4965 -43.3719 -40.3148 -39.2481 -38.4819 -37.7376   438 1.0062

for simplexes.stan defined as

data {
  int<lower=2> K;
  int<lower=0,upper=2147483647> x[K];
}
parameters {
  vector<lower=0>[K] gamma;
}
transformed parameters {
  vector[K] pi = gamma / sum(gamma);
}
model {
  target += exponential_lpdf(gamma | 1);
  target += multinomial_lpmf(x | pi);
}

#13

In exchange for substantial increases in computation – 2^17 leapfrog steps is a huge computational burden that will make many analyses impractical. I think that’s a pretty fair criterion for “not working well”.


#14

2^17 for over 4 billion observed counts that pin the posterior distribution of pi down to many decimal places. If you only have a couple hundred thousand counts, then you don’t have to change max_treedepth to get the same essentially perfect estimates.


#15

thanks @bgoodri - it’s really helpful to see when these things work and how they break down (i.e. statistically valid but slow, or introducing biased).

Out of curiosity, I coded up the identifiable model

data {
  int<lower=2> K;
  int<lower=0,upper=2147483647> x[K];
}
parameters {
  vector<lower=0>[K-1] gamma;
}
transformed parameters {
  real sum_gamma = 1 + sum(gamma);
  vector[K] pi = append_row(1,gamma) / sum_gamma;
}
model {
  target += -K *log(sum_gamma);
  target += dirichlet_lpdf(pi | rep_vector(1,K));
  target += multinomial_lpmf(x | pi);
}

which has comparable performance:

        mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
gamma[1]    2.00    0.00 0.00   2.00   2.00   2.00   2.00   2.00  1234    1
gamma[2]    3.00    0.00 0.00   3.00   3.00   3.00   3.00   3.00  1378    1
gamma[3]    4.00    0.00 0.00   4.00   4.00   4.00   4.00   4.00  1350    1
sum_gamma  10.00    0.00 0.00  10.00  10.00  10.00  10.00  10.00  1171    1
pi[1]       0.10    0.00 0.00   0.10   0.10   0.10   0.10   0.10  1171    1
pi[2]       0.20    0.00 0.00   0.20   0.20   0.20   0.20   0.20  2797    1
pi[3]       0.30    0.00 0.00   0.30   0.30   0.30   0.30   0.30  4000    1
pi[4]       0.40    0.00 0.00   0.40   0.40   0.40   0.40   0.40  4000    1
lp__      -38.69    0.03 1.16 -41.81 -39.26 -38.39 -37.84 -37.36  1851    1

but is nearly 1000 times faster:

sum(sapply(post.2@sim$samples,function(z) attr(z,‘elapsed_time’)))
[1] 0.33933
sum(sapply(post@sim$samples,function(z) attr(z,‘elapsed_time’)))
[1] 275.2609

I wonder if there would be situations where the underdefined solution would perform better…


#16

Thanks Ben for the suggestion to sample from gamma distributions and transform to avoid fussing with the Jacobians.
Sorry, I should have acknowledged your suggestion for this approach in the previous thread I referenced from last year. I went for the option of implementing the simplex functions with explicit constraints because I thought the extra parameters might reduce efficiency, but didn’t do any comparisons on it.

At some point I’ll try to implement both in my model and report how the performance compares (though at the moment I think my sampling issues are more because I’m trying fit a model that isn’t really well identified from the data rather than implementation of the simplex…).

Thanks also Aaron for the clever suggestion for returning the log jacobian and transform together in the Stan function.

Thanks,
Jeff


#17

The identified versions are generally better, but we obviously haven’t tried every case.

The distinguishing feature is that true unbounded non-identifiability leads to an improper posterior, as in the example from the problematic posteriors chapter of the manual:

y ~ normal(alpha + beta, sigma);

with no priors on alpha or beta. This is just the simplest case of a non-identified regression—it’s two intercepts. If you add a prior, the posterior becomes proper, but the likelihood is still non-identified. It’s of course better in this case just to use

y ~ normal(gamma, sigma);

#18

I really like this approach of sampling from a normalized set of gamma random variables so that the lengths of the simplexes can vary. I’m not sure where to actually turn this into a ragged array. It seems like pi is just a single simplex in the code examples. Am I missing something?


#19

If, for example, you need a simplex of size K1 and a simplex of size K2, declare a positive_ordered[K1 + K2] vector, separately normalize the first K1 and the last K2 elements, and put unit-scale gamma priors on things accordingly.

Actually, if you know that you need exactly two simplexes, you should just do

simplex[K1] pi_1;
simplex[K2] pi_2;

but this general idea works when the number of simplexes needed is not known until runtime.


#20

Thanks for the input Ben. I’m trying to get as far as I can with my hierarchical model before attending StanCon next week so I can ask better questions when I get there :).

The number of simplexes is a function of the number of observations and they’ll number in the hundreds or thousands.

Can you give me another hint on how your proposal works? If I normalize the elements of the positive_ordered vector wont it lose it’s ordering?

Also, is the prior this function?

target += exponential_lpdf(gamma | 1);

I’m not sure how to split this off from the likelihood. Would it be something like:

data {
   int<lower=1> O;  // count of observations where each observation contains 1 or more sub-observations
   int<lower=1> S;  // count of all sub-observations
   int<lower=2> K;  // maximum size of the simplex for all sub-observations
}

parameters {
   vector<lower=0>[K] gamma[O];  //randomly sampled values from gamma distribution for each observation
}

transformed parameters {
   for (o in 1:O)   
      vector[K] pi[o] = gamma[o] / sum(gamma[o]);
}

model {
   for (o in 1:O) {
      pi[o] ~ exponential(1);
   }
   // use pi in the likelihood function
}

Am I close?


#21

I meant to say just vector<lower=0>[K1 + K2]; in general, the elements do not need to be ordered. The exponential prior corresponds to a gamma prior with a shape of 1 and a scale of 1, which implies the resulting simplex vector has a uniform distribution. You can use other shapes but the prior is on the thing that you declare in parameters rather than the simplexes that you declare in the transformed parameters or the model block.

Your example would not parse. There is a more complicated example at


where zeta is what you call gamma and I am using the segment function to extract the next nc elements after zeta_mark from it.


#22

Thanks again Ben! So I think I’ve managed to incorporate your feedback into a model (that compiles). I’m pasting it below for anybody else who has the same question.

My remaining question is how to integrate the shape parameter like you’ve mentioned. As you point out, this example code is using a uniform distribution, but I really need to modulate the shape. Do I need an additional variable in transformed parameters or am I supposed to do some kind of initialization of the zeta?

data {
   int<lower=1> M;               // num observations
   int<lower=1> N;                // number of sub-observations for all observations
   real<lower=0> y[N];               //Observations where each row is a subobservation
   int<lower=1,upper=M> observation[N];    //observation id for subobservation n
   int<lower=1> row_to_subobservation_map[N];      //subobservation index for each row
   int<lower=1> count_of_subobservations_per_observation[M];
}

parameters {
   vector<lower=0,upper=1.0>[N] zeta;      // subobservation dist over observation
} 

model {
   
   for (n in 1:N) {
      
      int zeta_mark = n-row_to_subobservation_map[n]+1;
      int nc = count_of_subobservations_per_observation[observation[n]];
      
      vector[nc] pi = segment(zeta,zeta_mark,nc);  // gamma(zeta | shape, 1)
      pi = pi / sum(pi);                           // thus dirichlet(pi | shape)

      target += log(y[n]*pi[row_to_subobservation_map[n]]); // likelihood

  }
}

How to model a choice between 2 distributions
#23

zeta should not have an upper bound of 1. You just need to do

target += gamma_lpdf(zeta | shape, 1);

where shape is a positive vector that you pass in as data.


#24

I’m still struggling here. Do I really want to increase target with both pi as well as with zeta? For some reason that seems off to me, but I’m pretty new at this so it’s quite possible.

When I use (a multivariate version of) the model below I get divergent sample errors. Is this model that you’re proposing? If so, I’ll look into issues with incorporating my additional feature variables, else I’ll try to fix my model.

data {
   int<lower=1> M;               // num observations
   int<lower=1> N;                // number of sub-observations for all observations
   real<lower=0> y[N];               //Observations where each row is a subobservation
   int<lower=1,upper=M> observation[N];    //observation id for subobservation n
   int<lower=1> row_to_subobservation_map[N];      //subobservation index for each row
   int<lower=1> count_of_subobservations_per_observation[M];
   real shape;     // subobservation weight prior

}

transformed data {
   vector<lower=0>[N] shape_vec;
   for (n in 1:N){
      shape_vec[n] = shape; //symmetric dirichlet prior
   }  
}

parameters {
   vector<lower=0>[N] zeta;      // subobservation dist over observation
} 

transformed parameters {

}

model {
   
   for (n in 1:N) {
      
      int zeta_mark = n-row_to_subobservation_map[n]+1;
      int nc = count_of_subobservations_per_observation[observation[n]];
      
      vector[nc] pi = segment(zeta,zeta_mark,nc);  // gamma(zeta | shape, 1)
      pi = pi / sum(pi);                           // thus dirichlet(pi | shape)

      target += log(y[n]*pi[row_to_subobservation_map[n]]); // likelihood
      
  }
  target += gamma_lpdf(zeta | shape, 1.);
}

#25

No, just by zeta.


#26

So where does pi fit in if I’m only inferring zeta? How does this enforce the simplex?


#27

The line pi = pi / sum (pi); implies pi sums to 1 and it was already nonnegative.