How to constrain vector parameters with stan

I’m trying to run a state-space model using rstan. I want the sum of alpha[k] and beta[k] to fall between 0 and 1 in this model. That is, 0≤alpha[1]+beta[1]≤1, 0≤alpha[2]+beta[2]≤1, … , I want to estimate alpha and beta so that 0≤alpha[k]+beta[k]≤1.

data {
  int<lower=0> N_data;
  int<lower=0> N_category_ism;
  
  matrix[N_data, N_category_ism] Rews_low;
  matrix[N_data, N_category_ism] Rews_med;
  matrix[N_data, N_category_ism] Rews_high;

  int<lower=1, upper=N_category_ism> Y[N_data];
}


parameters {
  real<lower=0, upper=1> alpha[N_data];
  real<lower=0, upper=1> beta[N_data];
  real<lower=0> sd_alpha;
  real<lower=0> sd_beta;
}

transformed parameters {
  matrix[N_data, N_category_ism] Q;
  for (k in 1:N_data){
    for (kk in 1:N_category_ism){
      Q[k, kk] = (1-alpha[k]-beta[k])*(Rews_low[k, kk]+Rews_med[k, kk]+Rews_high[k, kk]) + alpha[k]*Rews_low[k, kk] + beta[k]*((Rews_low[k, kk]+Rews_med[k, kk]+Rews_high[k, kk])-((Rews_med[k, kk]-Rews_low[k, kk])+(Rews_high[k, kk]-Rews_low[k, kk])+(Rews_high[k, kk]-Rews_med[k, kk])));
    }
  }
  
}


model {
    
  for (k in 1:N_data){
    Y[k] ~ categorical_logit(Q[k,]');
  }
  for (k in 2:N_data){
      if ((alpha[k]+beta[k]) > 1 || (alpha[k]+beta[k]) == 0){
          target += negative_infinity();
      }
      alpha[k] ~ normal(alpha[k-1], sd_alpha);
      beta[k] ~ normal(beta[k-1], sd_beta);
  }
}

generated quantities{
    int<lower=1, upper=N_category_ism> Y_generated[N_data];
    real log_lik[N_data];
    
    for (k in 1:N_data){
      Y_generated[k] = categorical_logit_rng(Q[k,]');
      log_lik[k] = categorical_logit_lpmf(Y[k]|Q[k,]');
    }
}

Initially, I thought of the following code:

parameters{
  real<lower=0, upper=1> alpha[N_data];
  real<lower=0, upper=1> beta[N_data];
}

transformed parameters{
	real<lower=0, upper=1> gamma[N_data];
	for (k in 1:N_data){
		gamma[k] = alpha[k] + beta[k];
	}
}

However, this approach of defining variables in the transformed parameter block and restricting them is not recommended because it is very inefficient for sampling.
So, I wrote the first code I presented, but I got the following error:

Chain 1 Rejecting initial value:
Chain 1   Log probability evaluates to log(0), i.e. negative infinity.
Chain 1   Stan can't start sampling from this initial value.
Chain 1 Initialization between (-2, 2) failed after 100 attempts. 
Chain 1  Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.

I changed the init parameter of cmdstanr::sample to 0.5, 0.1, etc., but the same error appeared.
Is there any solution to this problem? Any comments are welcome. Thank you in advance!

R version 4.2.0 (2022-04-22)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.2.1

cmdstanr_0.5.3
1 Like

Welcome to the Stan community.

You should be able to use alpha to construct a constraint for beta to achieve this. Note also that the array syntax changed a while back (see below).

array[N_data] real<lower=0, upper=1> alpha;
array[N_data] real<lower=0, upper=1-alpha> beta;
1 Like

Depending on how alpha and beta are used in the likelihood, an alternative parameterization that might be more efficient (or less efficient, depending on the modeling context) is

parameters{
   array[N_data] real<lower=0, upper=1> gamma;
   array[N_data] real<lower=0, upper=1> frac;
}
transformed parameters{
   array[N_data] real<lower=0, upper=1> alpha = gamma .* frac;
   array[N_data] real<lower=0, upper=1> beta = gamma - alpha;
}

This parameterization might also be easier (or harder, depending on domain knowledge) to set priors on.

3 Likes

@simonbrauer @jsocolar Thank you very much for your prompt advice!

I tried the code suggested by @simonbrauer and @jsocolar and got the following compile errors, respectively.

   -------------------------------------------------
    17:  parameters {
    18:      array[N_data] real<lower=0, upper=1> alpha;
    19:      array[N_data] real<lower=0, upper=1-alpha> beta;
                                               ^
    20:      real<lower=0> sd_alpha;
    21:      real<lower=0> sd_beta;
   -------------------------------------------------

Ill-typed arguments supplied to infix operator -. Available signatures: 
(int, int) => int
(real, real) => real
(real, vector) => vector
(vector, real) => vector
(vector, vector) => vector
(complex, complex) => complex
(real, row_vector) => row_vector
(row_vector, real) => row_vector
(row_vector, row_vector) => row_vector
(real, matrix) => matrix
(matrix, real)
 => matrix
(matrix, matrix) => matrix
(complex, complex_vector) => complex_vector
(complex_vector, complex) => complex_vector
(complex_vector, complex_vector) => complex_vector
(complex, complex_row_vector) => complex_row_vector
(complex_row_vector, complex) => complex_row_vector
(complex_row_vector, complex_row_vector) => complex_row_vector
(complex, complex_matrix) => complex_matrix
(complex_matrix, complex) => complex_matrix
(complex_matrix, complex_matrix) => complex_matrix
Instead supplied arguments of incompatible type: int, array[] real.

Instead of array[N_data] real<lower=0, upper=1-alpha> beta; I also tried array[N_data] real<lower=0, upper=1.0-alpha> beta; but the same compilation error appeared.

   -------------------------------------------------
    24:  transformed parameters {
    25:    matrix[N_data, N_category_ism] Q;
    26:     array[N_data] real<lower=0, upper=1> alpha = gamma .* frac;
                                                         ^
    27:     array[N_data] real<lower=0, upper=1> beta = gamma - alpha;
    28:    for (k in 1:N_data){
   -------------------------------------------------

Ill-typed arguments supplied to infix operator .*. Available signatures: 
(int, int) => int
(real, real) => real
(row_vector, vector) => real
(real, vector) => vector
(vector, real) => vector
(matrix, vector) => vector
(complex, complex) => complex
(complex_row_vector, complex_vector) => complex
(real, row_vector) => row_vector
(row_vector, real) => row_vector
(row_vector, matrix) => row_vector
(real, matrix) => matrix
(vec
tor, row_vector) => matrix
(matrix, real) => matrix
(matrix, matrix) => matrix
(complex, complex_vector) => complex_vector
(complex_vector, complex) => complex_vector
(complex_matrix, complex_vector) => complex_vector
(complex, complex_row_vector) => complex_row_vector
(complex_row_vector, complex) => complex_row_vector
(complex_row_vector, complex_matrix) => complex_row_vector
(complex, complex_matrix) => complex_matrix
(complex_vector, complex_row_vector) => complex_matrix
(complex_matrix, complex) => complex_matrix
(complex_matrix, complex_matrix) => complex_matrix
Instead supplied arguments of incompatible type: array[] real, array[] real.

Is it impossible to apply operators to variables declared as array?
Thank you in advance for your continued help and advice.

1 Like

Oof, that’s what we get for suggesting code without checking it. In both cases, your options are either to decleare everything as vectors rather than 1d arrays, or to declare as arrays but go back and forth to vectors with to_vector and to_array_1d, or declare as arrays and then do all of the operations in loops over the elements. Naturally, the first option will be most efficient, provided that you don’t need these things as arrays later on.

2 Likes

Is possible to use a unit simplex parameter? That is, for one k:

parameters {
    simplex[3] theta;
}
transformed parameters {
    // Declare alpha/beta as the 1st/2nd component to put priors on?
    real alpha = theta[1];
    real beta = theta[2];
}

If I understand the code correctly, for each k,

Q[k, kk] = (1-alpha[k] - beta[k]) * D1[k,kk] + alpha[k] * D2[k,kk] + beta[k] * D3[k,kk];

where D1, D2 and D3 are functions of the data only, so can precompute them in a transformed data block.

So the alpha’s and beta’s, with their constraints, describe a decomposition between D1, D2 and D3 (ie. a simplex).

PS: The composition is on the logit scale as the likelihood is Y[k] ~ categorical_logit(Q[k]) and I’m wondering if that has any implications. I have no idea what the data is but have you considered Y[k] ~ categorical(Q[k]) instead?

1 Like

@jsocolar I deeply appreciate you giving me the appropriate advice!
I am unfamiliar with stan’s vectorization syntax, so I have tried to do all the operations in a for loop for each element.

transformed parameters {
	matrix[N_data, N_category_ism] Q;
    array[N_data] real<lower=0, upper=1> alpha;
    array[N_data] real<lower=0, upper=1> beta;
	for (k in 1:N_data){
		alpha[k] = gamma[k] * frac[k];
		beta[k] = gamma[k] - alpha[k];
	} 
}

As a result, I finally succeeded in getting the stan code to run!
However, as expected, the computation speed is slow. If you know how to vectorize this code, I would be grateful if you could let me know.

@desislava Thank you for your advice! Your proposal also looks promising. alpha, beta and gamma are vectors, not single variables, is your method applicable?

I’m glad I wasn’t the only one! At risk of doing it again, here’s what it should have been. 1 - alpha works in this case because a scalar minus a vector is defined and returns a vector of the same length as alpha. Unfortunately, a scalar minus and array is not defined, which is why our respective code blocks didn’t work.

vector<lower=0, upper=1>[N_data] alpha;
vector<lower=0, upper=1-alpha>[N_data] beta;

I suspect @jsocolar 's approach will be more useful in the long term since it more easily extends to other data structures which may not work with vectors of boundary constraints (e.g. an array of vectors). But if you learn anything about their relative efficiency or effectiveness, then I’d be curious to know.

1 Like

@simonbrauer Thank you for fixing the code. I tried your code and it worked successfully. However, the estimation results for alpha[k] and beta[k] differed significantly from @jsocolar 's approach. Since the estimation results based on your code seem more intuitive and natural, I decided to adopt your approach this time.

@desislava Thank you for suggesting the code. alpha and beta can indeed be implemented as simplex, but in my case, they seem to need to be implemented as matrices, not vectors.
By the way, categorical_logit(Q[k]) means categorical(softmax(Q[k])).

@jsocolar @simonbrauer @desislava Once again, thank you from the bottom of my heart!

Interesting. I would have expected them to be pretty similar (assuming you’re using diffuse priors). Was one or the other clearly wrong (eg if you’re comparing to simulations)? I might have to play around with this just to get a better handle of it.

Note that in “my” version, you either need to put priors directly on gamma and frac, or if putting priors on alpha and beta you need a jacobian adjustment. Perhaps this explains the difference.

1 Like

@jsocolar @simonbrauer Thank you very much!
For future analysis, it would be nice to know why the estimation results based on @simonbrauer 's approach and @jsocolar 's approach are different from each other.
I picked up some estimation results and it seems that the alpha is smaller and the beta is larger for @jsocolar 's approach compared to @simonbrauer 's approach.
Both alpha and beta, and gamma and frac, only have lower and upper bounds between 0 and 1, and no priors are specified (i.e., non-informative priors are specified).

Without access to your likelihood, we can only talk in the most general terms. But the basic point is that the priors are different. There are two separate issues at play here, and both are worth understanding deeply. The first issue is that lack of explicit priors doesn’t mean flat margins. Here, the version from @simonbrauer , without explicit priors, yields a prior that is jointly flat over alpha and beta within the region that satisfies the constraint. That means that the margins for alpha and beta won’t be uniform, due to the non-square shape of the region of [alpha, beta] space that satisfies the constraint.

The second issue is that “flat” is parameterization-specific, and even without constraints that which is flat in one parameterization need not be flat in another parameterization. For example, if I have a prior model that is flat over x, then the implied prior over x^3 is not flat. If I reparameterize in terms of x^3, and use a flat prior there, then the implied prior over x is not flat. So even without constraints, just because you see two paramterizations that both lack explicit prior statements does not mean that the priors are identical–“flatness” is not parameterization invariant.

“My” version without added explicit priors puts an untruncated flat prior on gamma and frac over the unit square, which yields some particular implied prior on alpha and beta. Simon’s version without additional explicit priors puts a flat prior on alpha and beta over the truncated unit square (truncated to satisfy the constraint), yielding different prior pushforwards on the margins alpha and beta. Here’s Stan code to investigate the implied priors from both versions (this is a complete Stan program, and samples from it will be samples from the relevant prior distributions):

parameters {
  real<lower=0, upper = 1> gamma;
  real<lower=0, upper = 1> frac;
  real<lower = 0, upper = 1> alpha;
  real<lower = 0, upper = 1-alpha> beta;
}

transformed parameters{
  real alpha2 = gamma * frac;
  real beta2 = gamma - alpha2;
}

Edit: to make this actionable for you, and assuming that you have no prior idea about whether alpha or beta should be larger, the question to ask yourself is: "Am I more comfortable assuming that any value of [alpha, beta] that satisfies the constraint is equally likely? Or am I more comfortable that any value between 0 and 1 of the sum alpha + beta is equally likely? Notice that these are different, mutually exclusive assumptions.

2 Likes

@jsocolar I would like to express my sincere appreciation for your work on my problem.
I ran your stan code and the mean value of each parameter is as follows:

variable	mean
lp__	-8.493405102
gamma	0.50358781297596
frac	0.5056329864688
alpha	0.3322288283071
beta	0.334328422185725
alpha2	0.25467054311529100
beta2	0.248917262785932

And here is the corresponding R code:

cmdstanr_testcode_20240131$sample(data = data_stan_20240118,
                                                  iter_sampling = 5000,
                                                  iter_warmup = 500,
                                                  thin = 1,
                                                  chains = 4,
                                                  parallel_chains = 4,
                                                  seed = 1,
                                                  refresh = 1000,
                                                  max_treedepth = 15,
                                                  init = 0.5, # seems to be irrelevant
                                                  save_warmup = F
)

Again, thank you very much for your important point.
I assume that alpha and beta (and also 1-alpha-beta) are time-series changing values, and not values that change abruptly from time t to t+1. Therefore, it seems to me that it would be uncomfortable to assume that any value between 0 and 1 of the sum alpha + beta is equally likely.

It’s worth mentioning that @desislava 's simplex-based approach is equivalent to my approach. Without prior information (beyond the parameterization), you end up with a uniform distribution across all valid combinations of alpha and beta. Per @jsocolar 's import point, I think the most important difference between these approaches is the quantities that you want to put priors on. Frankly, I think that mine is the least-intuitive to put priors on, where desislava 's approach would allow you to put a prior on the third element of the simplex which would be substantively equivalent to putting a prior on gamma in jsocolar’s.

For the (entirely unnecessary) fun of math, note that you can transform jsocolar’s approach to mine via the function f(\gamma) = \sqrt{\gamma} and mine to his via f^{-1}(\gamma) = \gamma^2. Both of our approaches sample frac uniformly, but mine doesn’t sample gamma uniformly. In mine, the probability of each value of gamma is proportional the length of the diagonal from (0,\frac{\gamma}{2}) to (\frac{\gamma}{2},0). Consequently, the implicit CDF of gamma in my case is F(g)=g^2.

(thanks to jsocolar for the base code block)

parameters {
    // jscolar's parameterization
    real<lower=0, upper = 1> gamma_j;
    real<lower=0, upper = 1> frac_j;

    // simonbrauer's parameterization
    real<lower = 0, upper = 1> alpha_s;
    real<lower = 0, upper = 1 - alpha_s> beta_s;
}
transformed parameters{
    real alpha_j = gamma_j * frac_j;
    real beta_j = gamma_j - alpha_j;

    real gamma_s = alpha_s + beta_s;
    real frac_s = alpha_s / gamma_s;

    // Transform jscolar's to simonbrauer's
    real gamma_js = sqrt(gamma_j);
    real frac_js = frac_j;
    real alpha_js = gamma_js * frac_js;
    real beta_js = gamma_js - alpha_js;

    // Transform simonbrauer's to jscolar's
    real gamma_sj = (alpha_s + beta_s)^2;
    real frac_sj = alpha_s / gamma_s;
    real alpha_sj = gamma_sj * frac_sj;
    real beta_sj = gamma_sj - alpha_sj;
}

3 Likes

The math is cool but I’m still trying to understand how much it matters when the linear transformation involving the alphas, betas – or alternatively gammas, fracs – is pushed through the non-linear softmax of the categorical_logit likelihood. Isn’t unusual to be constraining the parameters like this on the logit scale?

2 Likes

Good observation. In a purely mathematical sense, it still clearly matters when pushed through the softmax, as the OP reports that the posteriors are nontrivially different. But it’s a good observation that this might not in fact correspond to the model that the OP intends. @ykogr I think @desislava is right that this is worth taking a close look at.

@simonbrauer Thanks to your impressive demonstration, I now understand the relationship between the implicit prior distribution and the parameter estimates. I would like to sincerely appreciate you devoting your time to help me out.
@jsocolar @desislava I see that I should also consider how alpha and beta behave when they are pushed through the softmax function. Indeed, @simonbrauer 's approach and @jsocolar 's approach have different Q values in addition to the parameters alpha and beta. Curiously, however, when Y is generated using the estimated parameters and the categorical_logit_rng function, the two approaches yield nearly identical output.

generated quantities{
    int<lower=1, upper=N_category_ism> Y_generated[N_data];
    real log_lik[N_data];
    
    for (k in 1:N_data){
      Y_generated[k] = categorical_logit_rng(Q[k,]');
      log_lik[k] = categorical_logit_lpmf(Y[k]|Q[k,]');
    }
}
1 Like