Following up on several discussions of simplex adjustments and ragged arrays of simplexes

I am trying to code a model that requires defining ragged arrays of simplexes. But for simplicity, I want to ask how to code a vector transformation that applies the same constraints as a simplex (without using an explicit simplex) and can then be used for a prior distribution parameter?

Background: There have been several previous discussions that I can find on the topic (here, here, here, and here), but I believe that none of these have a clear and complete answer to the problem which is motivating my question. I think part of the confusion might be that there are multiple ways to go about this and the differences and similarities are not obvious. I am relatively new to Stan, and most of the details on this topic, such as the stickbreaking algorithm, are going over my head. It is possible that some of the previous discussions do actually have an answer to my question, but they seem to expect a level of familiarity with simplex transformations that I don’t have. My personal goal here is primarily to create a working model, and then maybe later I will dig into the weeds if necessary — but hopefully a thread combining previous discussions of this topic could also be useful to other users with many of the same questions as myself.

My use case: For the sake of simplicity, I have a working Stan model (based on this paper and available as an R package here) that uses simplex’s in the prior and as part of a parameter transformation. It could be a little cleaner but anyway this is a good starting point for my use case and I think a useful edge case for this problem of coding vector simplex adjustments.

Stan model using simplexes
// Note that the Metropolis proposals may be rejected during warmup, 
// this can be avoided by increasing the step size to 0.1
data {
  int<lower=1> n;           // number of election units i
  int<lower=2> R;           // number of candidates in election 1
  int<lower=2> C;           // number of candidates in election 2
  // (this doesn't have to be a simplex but the condition should be met)
  array[n] simplex[R] p_r;       // marginal proportion for row r in unit i 
  array[n, C] int<lower=0> N_c; // marginal count for column c in unit i
  real<lower=0> lambda_1;    // gamma hyperprior shape parameter
  real<lower=0> lambda_2;    // gamma hyperprior scale parameter
}

parameters {
  matrix<lower=0>[R, C] alpha_r;            
  array[n, R] simplex[C] beta_r; // proportion of row r in column c for unit i
}

transformed parameters {
  array[n] simplex[C] theta_c; // estimated marginal proportion for column c in unit i
  array[n] matrix[R, C] beta_mat; // matrix of vectors beta_r to be used for transformation
  for (i in 1:n) {
    for (r in 1:R) {
      beta_mat[i, r] = to_row_vector(beta_r[i, r]);
    }
    theta_c[i] = beta_mat[i]' * p_r[i];
  }
}

model {
  for (i in 1:n) {
    for (r in 1:R) {
      alpha_r[r,] ~ gamma(lambda_1, lambda_2);
      beta_r[i, r] ~ dirichlet(alpha_r[r,]);
    }
    N_c[i] ~ multinomial(theta_c[i]);
  }
}

Example data:

p_r <- matrix(
  c(0.3259669, 0.6740331,
    0.3667954, 0.6332046,
    0.3578947, 0.6421053,
    0.3595166, 0.6404834,
    0.3571429, 0.6428571,
    0.3814103, 0.6185897,
    0.4117647, 0.5882353,
    0.3826087, 0.6173913,
    0.3631436, 0.6368564,
    0.4102564, 0.5897436),
  nrow = 10,
  ncol = 2
            )
N_c <- matrix(
  c(42, 139,
    63, 196,
    51,  44,
    131, 200,
    226, 236,
    74, 238,
    10,   7,
    163, 297,
    190, 179,
    18,  60),
  nrow = 10,
  ncol = 2
            )
data <- list(n = 10, R = 2, C = 2, p_r = p_r, N_c = N_c, lambda_1 = 4, lambda_2 = 2)

Main question: How would you replace all of the explicit simplex’s in this model with vector parameter transformations that apply the same constraints?

Related questions based on previous discussion threads on this topic (and a potential solution):

  1. The proposed _lp function from @Christopher-Peterson here although promising does not seem to work for my use case because it loses the adjusted simplex parameters. From some tests I’ve performed, it does not seem so simple as using simplex_constrain_softmax_lp in the transformed parameters block. There have not been any updates to that thread for a while, although it is implied that one could modify the function and define the adjustment in the transformed parameters block to be able to then use the adjusted simplex parameter in the model block. I just have not been able to do that successfully. Maybe I am missing something obvious and there is someone who can explain the correct way to do this?

  2. I have seen the suggestion in two places to use code that was put together by @mjhajharia and @Bob_Carpenter (here and here) but like with (1) I don’t know how I would actually implement these functions in the model. (I am also not quite sure which version is applicable for this use case?) Maybe these functions are in fact the obvious answer to the motivating question and the implementation is trivial, but so far I have not figured it out. How might you go about using these functions? And which function is preferable for my use case?

  3. The other option I found was this proposal from @aaronjg which is actually very similar to the proposal I mention in (1), so I thought that it could be a demonstration of how to perform the simplex adjustment in the transformed parameters block. I have included the implementation below. The new model does compile and samples successfully, although it consistently throws this exception during warmup, even after increasing the step size: Exception: dirichlet_lpdf: probabilities is not a valid simplex. sum(probabilities) = nan, but should be 1. I believe that this exception is similar to what you will see from the original model, except that the exception handling is different since I don’t actually use simplex in this new version — meaning that it should be safe to ignore I believe. How would you handle this exception? And to what extent do the differences between this approach to simplex adjustment and that suggested in (1) (apart from the question of implementation) matter for my use case?

Similar Stan model but with parameter adjustments instead of explicit simplexes
data {
  int<lower=1> n;           // number of election units i
  int<lower=2> R;           // number of candidates in election 1
  int<lower=2> C;           // number of candidates in election 2
  array[n] simplex[R] p_r;       // marginal proportion for row r in unit i
  array[n, C] int<lower=0> N_c; // marginal count for column c in unit i
  real<lower=0> lambda_1;    // gamma hyperprior shape parameter
  real<lower=0> lambda_2;    // gamma hyperprior scale parameter
}

parameters {
  matrix<lower=0>[R, C] alpha_r;    

  array[n, R] vector<lower=0>[C - 1] gamma_r;
  array[n] vector<lower=0>[C - 1] gamma_c;
}

transformed parameters {
  array[n, R] vector[C] beta_r; // proportion of row r in column c for unit i
  array[n, R] real sum_gamma_r;
  for (i in 1:n) {
    for (r in 1:R) {
      sum_gamma_r[i, r] = 1 + sum(gamma_r[i, r]);
      beta_r[i, r] = append_row(1, gamma_r[i, r]) / sum_gamma_r[i, r];
    }
  }
  
  array[n] vector[C] theta_c;
  array[n] real sum_gamma_c;
  for (i in 1:n) {
    sum_gamma_c[i] = 1 + sum(gamma_c[i]);
    theta_c[i] = append_row(1, gamma_c[i]) / sum_gamma_c[i];
  }
  
  array[n] matrix[R, C] beta_mat;
  for (i in 1:n) {
    for (r in 1:R) {
      beta_mat[i, r] = to_row_vector(beta_r[i, r]);
    }
    theta_c[i] = beta_mat[i]' * p_r[i];
  }
}

model {
  for (i in 1:n) {
    for (r in 1:R) {
      target += -C * log(sum_gamma_r[i, r]);
    }
    target += -C * log(sum_gamma_c[i]);
  }

  for (i in 1:n) {
    for (r in 1:R) {
      alpha_r[r,] ~ gamma(lambda_1, lambda_2);
      beta_r[i, r] ~ dirichlet(alpha_r[r,]);
    }
    N_c[i] ~ multinomial(theta_c[i]);
  }
}
1 Like

I hope the below helps you. Expanded softmax doesn’t set one of the values and instead centers the simplex using a standard normal. We’ve found that this is better for the sampler.

I’ve added some comments in the code. The first comment is on theta_c because you calculate a simplex and then you overwrite the values in the following loop. The other is on constructing beta_mat and beta_c when you just need one of those.

I haven’t tested this. It’s possible that you need to transpose beta_mat in the model block as

beta_mat[i, r]' ~ dirichlet(alpha_r[r,]);

or fix some other syntax errors.

functions {
vector expanded_softmax_simplex_constrain_lp(vector y) {
  int N = rows(y);
  real r = log_sum_exp(y);

  target += std_normal_lpdf(r - log(N));
  target += sum(y) - N * r;

  return exp(y - r);
}
}
data {
  int<lower=1> n;           // number of election units i
  int<lower=2> R;           // number of candidates in election 1
  int<lower=2> C;           // number of candidates in election 2
  array[n] simplex[R] p_r;       // marginal proportion for row r in unit i
  array[n, C] int<lower=0> N_c; // marginal count for column c in unit i
  real<lower=0> lambda_1;    // gamma hyperprior shape parameter
  real<lower=0> lambda_2;    // gamma hyperprior scale parameter
}

parameters {
  matrix<lower=0>[R, C] alpha_r;    

  array[n, R] vector<lower=0>[C] gamma_r;
  array[n] vector<lower=0>[C] gamma_c;
}

transformed parameters {
  // array[n, R] vector[C] beta_r; // proportion of row r in column c for unit i
  array[n] vector[C] theta_c;
  array[n] matrix[R, C] beta_mat;

  // you overwrite theta_c with beta_mat x p_r so don't construct that first
  // just construct beta_mat instead of beta_r
  for (i in 1:n) {
    // theta_c[i] = simplex_constrain_lp(gamma_c[i]);
    for (r in 1:R) {
       beta_mat[i, r] = simplex_constrain_lp(gamma_r[i, r])';
    }
    theta_c[i] = beta_mat[i]' * p_r[i];
  }
}

model {
  to_vector(alpha_r) ~ gamma(lambda_1, lambda_2);
  for (i in 1:n) {
    for (r in 1:R) {
      beta_mat[i, r] ~ dirichlet(alpha_r[r,]);
    }
    N_c[i] ~ multinomial(theta_c[i]);
  }
}
1 Like

Thank you for the response! I was able to tinker with the code you provided and got the model to work, so finally I might have a working solution for the larger question about coding ragged arrays of simplexes. (I did have to increase the step size to prevent exceptions during warmup, just like with the original version, but that’s benign as far as I’m aware). I’ll put the debugged code here for anyone else interested in the solution.

Model with simplex_constrain_lp
functions {
  vector simplex_constrain_lp(vector y) {
    int N = rows(y);
    real r = log_sum_exp(y);

    target += std_normal_lpdf(r - log(N));
    target += sum(y) - N * r;

    return exp(y - r);
  }
}

data {
  int<lower=1> n;           // number of election units i
  int<lower=2> R;           // number of candidates in election 1
  int<lower=2> C;           // number of candidates in election 2
  array[n] simplex[R] p_r;       // marginal proportion for row r in unit i
  array[n, C] int<lower=0> N_c; // marginal count for column c in unit i
  real<lower=0> lambda_1;    // gamma hyperprior shape parameter
  real<lower=0> lambda_2;    // gamma hyperprior scale parameter
}

parameters {
  matrix<lower=0>[R, C] alpha_r;    

  array[n, R] vector<lower=0>[C] gamma_r;
  array[n] vector<lower=0>[C] gamma_c;
}

transformed parameters {
  array[n] vector[C] theta_c;
  array[n] matrix[R, C] beta_mat;

  for (i in 1:n) {
    theta_c[i] = simplex_constrain_lp(gamma_c[i]);
    for (r in 1:R) {
       beta_mat[i, r] = simplex_constrain_lp(gamma_r[i, r])';
    }
    theta_c[i] = beta_mat[i]' * p_r[i];
  }
}

model {
  to_vector(alpha_r) ~ gamma(lambda_1, lambda_2);
  for (i in 1:n) {
    for (r in 1:R) {
      beta_mat[i, r] ~ dirichlet(alpha_r[r,]);
    }
    N_c[i] ~ multinomial(theta_c[i]);
  }
}

Before calling this the solution, I just wanted to clarify one thing. Is that expanded softmax function the general solution to a problem like this? Could there be any reason to use the proposed function I mention in (1) or any of the functions mentioned in (2) as solutions for a similar problem, and if so, what considerations should be made?

1 Like

Any simplex function will work. I’m fact, we’re seeing that the invILR parameterization of the simplex performs the best over a wide range of scenarios (paper forthcoming).

1 Like

There are more reasons than faulting the simplex parameterization in the model. It may be possible to isolate parts of the model to find where this is occurring. A few things I would try are setting the alphas with data (removing the gamma prior with this as well) and trying the log Dirichlet lpdf with a log simplex parameterization (summing log beta with log p_r before exponentiating for theta).

1 Like

Thanks for the answer and suggestions for debugging!

Just want to clarify two things. Would you be able to point me to where I would find the code for the “invILR parameterization of the simplex” (if I don’t have to wait for the forthcoming paper)?

And I’m not quite sure I’m fully understanding the second recommendation you made for debugging (“summing log beta with log p_r before exponentiating for theta”). Would you be able to drop a code snippet? Thanks again!

The inv ilr is at transforms/simplex_transforms/stan/transforms/ILR/ILR_functions.stan at main · bob-carpenter/transforms · GitHub.

  1. Do things on the log-scale. Also swap in the inv_ILR simplex transform if you want. Note that the inv_ILR transform has one-less parameter than the softmax.
  matrix log_matrix_vector_product(matrix log_A, vector log_x) {
    int N = rows(log_A);
    vector[N] result;

    for (i in 1:n) {
        vector[cols(A)] sum_vector = log_A[i] + log_x;
        result[i] = log_sum_exp(sum_vector);
      }
    
    return result;
  }

  vector simplex_log_constrain_lp(vector y) {
    int N = rows(y);
    real r = log_sum_exp(y);
    vector[N] log_x = y - r;
  
    target += std_normal_lpdf(r - log(N));
    target += log_x[N];
    
    return log_x;
  }
real log_dirichlet_lpdf(vector log_theta, vector alpha) {
  int N = rows(log_theta);
  if (N != rows(alpha))
    reject("Input must contain same number of elements as alpha");
  return dot_product(alpha, log_theta) - log_theta[N]
         + lgamma(sum(alpha)) - sum(lgamma(alpha));
}
 real multinomial_log_lpmf(array int y, vector log_theta) {
    int N = sum(y);
    int K = num_elements(log_theta);
    real lp = lgamma(N + 1);

    for (k in 1:K)
      lp += log_theta[k] * y[k] - lgamma(y[k] + 1);
    
  return lp;
  }
}

data {
  int<lower=1> n;           // number of election units i
  int<lower=2> R;           // number of candidates in election 1
  int<lower=2> C;           // number of candidates in election 2
  array[n] vector<upper=0>[R] log_p_r;       // marginal log proportion for row r in unit i
  array[n, C] int<lower=0> N_c; // marginal count for column c in unit i
  real<lower=0> lambda_1;    // gamma hyperprior shape parameter
  real<lower=0> lambda_2;    // gamma hyperprior scale parameter
}
parameters {
  matrix<lower=0>[R, C] alpha_r;    

  array[n, R] vector<lower=0>[C] gamma_r;
  array[n] vector<lower=0>[C] gamma_c;
}
transformed parameters {
  array[n] vector[C] log_theta_c;
  array[n] matrix[R, C] log_beta_mat;

  for (i in 1:n) {
   // you overwrite theta below
   // theta_c[i] = simplex_log_constrain_lp(gamma_c[i]);
    for (r in 1:R) {
       beta_mat[i, r] = simplex_log_constrain_lp(gamma_r[i, r])';
    }
   log_theta_c[i] = log_matrix_vector_product(beta_mat[i]', log_p_r[i]);
  }
}
model {
  to_vector(alpha_r) ~ gamma(lambda_1, lambda_2);
  for (i in 1:n) {
    for (r in 1:R) {
      log_beta_mat[i, r] ~ log_dirichlet(alpha_r[r,]);
    }
    N_c[i] ~ multinomial_log(log_theta_c[i]);
  }
}
  1. Try the dirichlet_multinomial but since you have p_r it’s a weighted dirichlet multinomial:

We want the marginal distribution after integrating out ( Z ):

P(x \mid n, \alpha, p) = \int \text{Mult}(x \mid n, Z \cdot p) \cdot \prod_{r=1}^R \text{Dir}(Z_{:,r} \mid \alpha_r) \, dZ

I believe this integration results in the following weighted Dirichlet-multinomial distribution:

P(x \mid n, \alpha, p) = \frac{n!}{x_1! \cdots x_C!} \prod_{c=1}^C \left( \frac{\prod_{r=1}^R \Gamma(x_c + \alpha_{c,r})}{\Gamma\left(\sum_{r=1}^R (x_c + \alpha_{c,r})\right)} \cdot \frac{\Gamma\left(\sum_{r=1}^R \alpha_{c,r} \right)}{\prod_{r=1}^R \Gamma(\alpha_{c,r})} \cdot p_r^{x_c} \right)

That’s enough for me now, you can get the matrix beta back out in the generated quantities.

functions {
 real log_multivariate_beta (vector alpha) {
   int N = num_elements(alpha);
   return sum(lgamma(alpha)) - N * lgamma(sum(alpha));
}

real weighted_dirichlet_multinomial_lpmf (array int y, vector alpha, vector p_r) {
   int N = sum(y);
   int K = num_elements(y);
   real lp = lgamma(N + 1);
  
   for (k in 1:K) {
     lp += y[k] * log(p_r[k]) + log_multivariate_beta(y + alpha) - log_multivariate_beta(alpha) - lgamma(y[k] + 1);
   }
   return lp;
 }
}
data {
  int<lower=1> n;           // number of election units i
  int<lower=2> R;           // number of candidates in election 1
  int<lower=2> C;           // number of candidates in election 2
  array[n] simplex[R] p_r;       // marginal proportion for row r in unit i
  array[n, C] int<lower=0> N_c; // marginal count for column c in unit i
  real<lower=0> lambda_1;    // gamma hyperprior shape parameter
  real<lower=0> lambda_2;    // gamma hyperprior scale parameter
}
parameters {
  matrix<lower=0>[R, C] alpha_r;    
}
model {
  to_vector(alpha_r) ~ gamma(lambda_1, lambda_2);

  for (i in 1:n) 
    N_c[i] ~ weighted_dirichlet_multinomial(alpha_r[i], p_r[i]);
}
1 Like

This is above and beyond, thank you!

It’s been a while but I just wanted to follow up to ask @spinkney what was your source for that specification of log_dirichlet_lpdf? I recently asked about the RNG version of this and Bob Carpenter pointed out that it looks like there is a stray log_theta[N] term. His math looks correct to me so I just want to make sure I’m not overlooking something important.

Annoyingly both @Bob_Carpenter and my code is right.

The key here is the simplex transform where I add target += log_x[N];. Now, if you stay on the log scale then this needs to be added back in which I do in the log_dirichlet_lpdf.

From a pedagogical perspective it makes more sense but not from a computational one. So just remove those two lines, which becomes:

  vector simplex_log_constrain_lp(vector y) {
    int N = rows(y);
    real r = log_sum_exp(y);
    vector[N] log_x = y - r;
  
    target += std_normal_lpdf(r - log(N));
 //   target += log_x[N];
    
    return log_x;
  }
real log_dirichlet_lpdf(vector log_theta, vector alpha) {
  int N = rows(log_theta);
  if (N != rows(alpha))
    reject("Input must contain same number of elements as alpha");
  return dot_product(alpha, log_theta) + // - log_theta[N]
         lgamma(sum(alpha)) - sum(lgamma(alpha));
}
1 Like

Thanks for the clarification. I’m still working my way through the math that Bob Carpenter wrote out, but if I’m understanding you correctly when you say “from a pedagogical perspective it makes more sense but not from a computational one,” would it be equivalent to leave in target += log_x[N] and -log_theta[N]? I think I’m not picking up on the distinction you’re making. And would the same logic apply with inv_ilr_log_simplex_constrain_lp in lieu of simplex_log_constrain_lp?

Hi @spinkney, I was just curious if your forthcoming paper introducing the invILR algorithm and other simplex constraints has a name that I can cite?

1 Like

There’s a bit more written up around our sum-to-zero constraint that @spinkney just added. We’ve just been citing Adrian Seyboldt’s discussion on the PyMC forums. There’s no writeup yet of which I’m aware other than our in-progress paper on simplex parameterizations.

You can cite this paper Isometric Logratio Transformations for Compositional Data Analysis | Mathematical Geosciences