# Sum-to-zero with multiple constraints

Does that SVD approach only work for normal distributions?

It should be possible to simulate from multivariate normal distributions constrained with multiple equality constraints.

1 Like

I think all these approaches only work with the normal distribution. I’m going to write up how we derive that inv_sqrt thing which shows the general constraint case but it is assuming normality. The number of equality constraints must be less than the number of dimensions you start with.

I didn’t know this was so interesting! What’s your use case with multiple constraints? Let’s move this conversation into a new thread.

2 Likes

@spinkney I took it as a given that the equality constraints must be less than the number of dimensions.

My use case was that I was implementing this paper  in Stan to get the sampling error for portfolio weights and seeing if I could place restrictions on the weight vector, beyond just that the weights sum to 1. So for instance, if the weighted mean sums to a particular value. I think I borrowed from methodology from xsample  to do it (focusing on the equality constraints because I recall the inequality ones being too annoying to get working).

I was following the other thread with high interest, as a model I am currently working with requires a number of (separate) sum-to-zero constraints for identification reasons. The model is a unidimensional 3PLN response time model (see Fox, Klotzke, & Simsek (2021), Eq. 4):

log(RT_{ij}) = \lambda_{i} - \phi_{i} \zeta_{j} + \epsilon_{ij}, with \epsilon_{ij} \sim N(0, \sigma^2_{\epsilon_{i}}),

where log(RT_{ij}) is the observed log-response time, \lambda_{i} is the time intensity (the average time needed to complete item i), \phi_{i} the time discrimination (the item-specific effect of working speed on the response time), and \zeta_{j} is the person-specific speed parameter. To identify the model, you need a) to restrict the sum of the time-intensities to zero, i.e. \sum^I_{i=1}\lambda_{i}=0, and b) to restrict the product of the time discriminations to 1, i.e. \prod^I_{i=1}\phi_{i}=1. With the QR method outlined in another thread, this works well.

Model 1 (Single Component 3PLN, QR)
data {
int<lower=1> I;                      // number of items
int<lower=1> J;                      // number of examinees
array[J,I] real log_rt;              // log response times
}

transformed data {
vector[I] Q_r_diag;
vector[I] Q_r_last;
for(i in 1:I){
Q_r_diag[i] = -sqrt((I-i)/(I-i+1.0));
Q_r_last[i] = inv_sqrt((I-i) * (I-i+1));
}
}

parameters {
vector[J] speed;
vector[I-1] lambda_raw;
vector[I-1] phi_raw;
vector<lower=0>[I] sigma;
}

transformed parameters {
vector[I] lambda;
vector[I] phi;
real lambda_aux = 0;
real phi_aux = 0;

for(i in 1:I-1){
lambda[i] = lambda_aux + lambda_raw[i] * Q_r_diag[i];
lambda_aux = lambda_aux + lambda_raw[i] * Q_r_last[i];
phi[i] = phi_aux + phi_raw[i] * Q_r_diag[i];
phi_aux = phi_aux + phi_raw[i] * Q_r_last[i];
}

lambda[I] = lambda_aux;
phi[I] = phi_aux;
}

model {

target += std_normal_lpdf( speed );
target += normal_lpdf(lambda_raw | 0, inv_sqrt(1 - inv(I)));
target += normal_lpdf(phi_raw | 0, inv_sqrt(1 - inv(I)));
target += exponential_lpdf( sigma | 5);

for(i in 1:I) target += normal_lpdf( log_rt[,i] |  lambda[i] - exp(phi[i]) * speed, sigma[i]);
}


Applying the SVD method allows me to model the mean of the time-intensity parameters explicitly,
which was not possible before. I have to, however, restrict the sum of the speed parameters to
zero. Parameter recovery is pretty good, without any sampling issues and high ESS.

Model 2 (Single Component 3PLN, SVD)
data {
int<lower=1> I;                      // number of items
int<lower=1> J;                      // number of examinees
matrix[I,I] item_mat;                // log response times
array[J,I] real log_rt;
}

parameters {
vector[J-1] speed_raw;
real mu_lambda;
real<lower=0> tau_lambda;
vector[I-1] lambda_raw;
vector[I-1] phi_raw;
vector<lower=0>[I] sigma;
}

transformed parameters {
vector[J] speed = append_row(speed_raw, -sum(speed_raw));
vector[I] phi = exp(append_row(phi_raw, -sum(phi_raw)));
vector[I] lambda = mu_lambda + item_mat * append_row(lambda_raw, 0);
}

model {
target += normal_lpdf(speed_raw | 0, inv_sqrt(1 - inv(J)));
target += normal_lpdf(mu_lambda | 0, 5);
target += exponential_lpdf(tau_lambda | 1);
target += normal_lpdf(lambda_raw | mu_lambda, inv_sqrt(1 - inv(I)) * tau_lambda);
target += normal_lpdf(phi_raw | 0, inv_sqrt(1 - inv(I)));
target += exponential_lpdf( sigma | 5);

for(i in 1:I) target += normal_lpdf( log_rt[,i] |  lambda[i] - phi[i] * speed, sigma[i]);
}


My actual problem materializes when I want to model a multidimensional response time model based on the 3PLN outlined above. Here, I observe the sum of two response time components (for simplicity,
assume that the time discrimination and the residual variance is the same for both components):

RT_{ij} = exp(\lambda_{1i} - \phi_{i} \zeta_{1j} + \epsilon_{ij}) + exp(\lambda_{2i} - \phi_{i} \zeta_{2j} + \epsilon_{ij})

My idea was to implement this model with a Fenton-Wilkinson-Approximation for the sum of lognormals (perhaps you remember the thread). However, with the original constraints the model does not work properly. Here’s what I have got so far:

Model 3 (Multi-Component 3PLN, QR)
functions {

row_vector fenton3(row_vector m, row_vector s, int D, int N_lower) {
vector[N_lower] term;
row_vector out;
real cache1;
real cache2 = log_sum_exp(m + 0.5 * s);
int iter = 1;

term[iter] = (m + m) + 0.5 * (s + s) + s;

for(i in 2:D) {
for(j in 1:(i - 1)) {
iter += 1;
term[iter] = (m[i] + m[j]) + 0.5 * (s[i] + s[j]) + log2();
}
iter += 1;
term[iter] = (m[i] + m[i]) + 0.5 * (s[i] + s[i]) + s[i];
}

cache1 = log_sum_exp(term);

return [2 * cache2 - 0.5 * cache1,
sqrt(cache1 - 2 * cache2)];
}

row_vector[ ] fenton_array (matrix m, matrix s, int D) {
int N = rows(m);
int lower = 3;
row_vector out[N];

for (n in 1:N) out[n] = fenton3(to_row_vector(m[n,]), to_row_vector(s[n,]), D, lower);

return out;
}
}

data {
int<lower=1> I;                      // number of items
int<lower=1> J;                      // number of examinees
matrix<lower=0>[J,I] rt;
}

transformed data {
vector[I] Q_r_diag;
vector[I] Q_r_last;

for(i in 1:I){
Q_r_diag[i] = -sqrt((I-i)/(I-i+1.0));
Q_r_last[i] = inv_sqrt((I-i) * (I-i+1));
}

}

parameters {
matrix[J,2] speed;
cholesky_factor_corr L_speed;
vector[I-1] lambda1_raw;
vector[I-1] lambda2_raw;
vector[I-1] phi_raw;
vector<lower=0>[I] sigma;
}

transformed parameters {
vector[I] lambda1;
vector[I] lambda2;
vector[I] phi;
real lambda1_aux = 0;
real lambda2_aux = 1;
real phi_aux = 0;

for(i in 1:I-1) {
lambda1[i] = lambda1_aux + lambda1_raw[i] * Q_r_diag[i];
lambda1_aux = lambda1_aux + lambda1_raw[i] * Q_r_last[i];
lambda2[i] = lambda2_aux + lambda2_raw[i] * Q_r_diag[i];
lambda2_aux = lambda2_aux + lambda2_raw[i] * Q_r_last[i];
phi[i] = phi_aux + phi_raw[i] * Q_r_diag[i];
phi_aux = phi_aux + phi_raw[i] * Q_r_last[i];
}

lambda1[I] = lambda1_aux;
lambda2[I] = lambda2_aux;
phi[I] = phi_aux;
}

model {

target += lkj_corr_cholesky_lupdf( L_speed | 1);
for(j in 1:J) target += multi_normal_cholesky_lupdf( speed[j] | [0, 0], diag_pre_multiply([1, 1], L_speed));

target += normal_lupdf( lambda1_raw | 0, inv_sqrt(1 - inv(I)));
target += normal_lupdf( lambda2_raw | 0, inv_sqrt(1 - inv(I)));
target += normal_lupdf( phi_raw | 0, inv_sqrt(1 - inv(I)));
target += exponential_lupdf( sigma | 5 );

for(i in 1:I) {
matrix[J,2] mu;
matrix[J,2] sds;
row_vector F[J];
mu[,1] = lambda1[i] - phi[i] * speed[,1];
mu[,2] = exp(phi[i]) * (lambda2[i] / exp(phi[i]) - speed[,2]);
sds[,1] = rep_vector(sigma[i], J);
sds[,2] = rep_vector(sigma[i], J);
F = fenton_array(mu, sds, 2);
target += lognormal_lupdf( rt[,i] | to_vector(F[,1]), to_vector(F[,2]) );
}
}


Please note that in this model, the second component uses an alternative specification of the 3PLN. It is, however, equivalent to the specification outlined above. This model works… somehow, sometimes, but not reliably (it feels quite hacky, especially with \lambda_{2aux} = 1). I am banging my head against a wall with this thing for quite some time now. The main issue is (I think) that I haven’t found a way yet to specify the model so that it is possible to distinguish between the two components sufficiently, and the model is not identified. Of course, the original identification constraints do not help in this case. So I was wondering if there is a way to add additional constraints to an already constrained vector of parameters, such as, for instance, the time intensity parameters \lambda_{1i} of the first component.

FWIW, here’s some code to generate data for the three models: data_gen.R (3.9 KB)

If I misunderstood the focus of this thread, or this is too off-topic, feel free to split the thread.

Best wishes
Chris

1 Like

I don’t understand the model and I’m not familiar with the literature. If you explain what you want in model 3 a bit more, I may be able to help. Can you elaborate on a few things:

• What do the the parameters mean? And why are the current constraints there?
• What does setting \lambda_{2\text{aux}} = 1 achieve?
• What constraint(s) you wish to have on \lambda_{1i}?

Sure, no problem!

(1) Suppose I have a test consisting of I = 30 Items, administered to J = 500 students. To solve each item, two abilities are required (for instance, reading and mathematics). Thus, the data consists of a IxJ matrix of lognormally distributed response times, where each observed response time is the sum of two lognormally distributed response time components. Each component is modeled with a three-parameter lognormal response time model outlined in my previous post. The time needed to complete an item depends a) on the working speed of the student \zeta_{j} (the ‘speed’ parameters in Model 3) and b) on two item parameters, namely the time intensity \lambda_{i} (that’s essentially an intercept indicating the average time needed to complete an item) and the time discrimination \phi_{i} (restricted to be positive; indicating the item-specific effect of working speed on the response time). \zeta and \lambda work on the same scale: if the former increases, there is a negative shift, if the latter increases, there is a positive shift of the location of the time distribution.

Model 3 - Model Block
model {

target += lkj_corr_cholesky_lupdf( L_speed | 1);
for(j in 1:J) target += multi_normal_cholesky_lupdf( speed[j] | [0, 0], diag_pre_multiply([1, 1], L_speed));

target += normal_lupdf( lambda1_raw | 0, inv_sqrt(1 - inv(I)));
target += normal_lupdf( lambda2_raw | 0, inv_sqrt(1 - inv(I)));
target += normal_lupdf( phi_raw | 0, inv_sqrt(1 - inv(I)));
target += exponential_lupdf( sigma | 5 );

for(i in 1:I) {
matrix[J,2] mu;
matrix[J,2] sds;
row_vector F[J];
mu[,1] = lambda1[i] - phi[i] * speed[,1];
mu[,2] = exp(phi[i]) * (lambda2[i] / exp(phi[i]) - speed[,2]);
sds[,1] = rep_vector(sigma[i], J);
sds[,2] = rep_vector(sigma[i], J);
F = fenton_array(mu, sds, 2);
target += lognormal_lupdf( rt[,i] | to_vector(F[,1]), to_vector(F[,2]) );
}
}


Thus, in Model 3 the \zeta, \lambda, \phi and the variances \sigma^2 of both response times components are sampled first. In the following loop, for each resoponse, I calculate the means of the lognormal distributions of the two response time components first, and store them in the mu-matrix. Similarly, I store the sampled variances of the lognormal distributions of the two response time components in the sds-matrix. Both matrices are then supplied to the fenton_array-function, the Fenton-Wilkinson-Approximation for the sum of lognormals. The resulting mean and variance is the stored in the F-array and used in the likelihood at the end of the loop.

The current constraints are there in order to identify the means and variances of the speed parameters. To identify the scales, the product of the time discriminations \phi_{i} is restricted to one, to identify the means, the sum of the time intensities \lambda_{i} is restricted to zero. Fox et al. (2021) mention another option to identify the means, namely by fixing the mean of the speed parameters \zeta_{j} to zero. These restrictions are carried over from the single component case.

(2) Setting \lambda_{2aux} = 1 was just me experimenting what can be done with the QR-based sum-to-zero constraint. It does nothing except that the loop for \lambda_{2} in the transformed parameters block starts from one, and not from zero. Well, the model samples rather quickly (800ms for 30 items and 500 individuals) and converges more often than not. When I subtract the 1 from the estimated time intensities and speed parameters, I recover their true values (more or less from the speed parameters). This may be completely wrong though, and I’d like to get rid of that.

(3) Convergence is problematic especially when the time intensities of the two response time components are too close to each other. So the constraints of the single component case do not seem to be sufficient. Pairs plots show bi-modal posteriors, also between the speed parameters (which may be not that surprising, as they are on the same scale). My idea was to use the SVD sum-to-zero constraint to be able to model the means \mu_{\lambda1} and \mu_{\lambda2} of the time intensities of the two response time components explicitly, so that I can give them an ordered-constraint and/or nudge them further apart. When I do that, however, sampling takes forever and the model does not converge. So, I think my main problem is how to identify the two reponse time components in Model 3, while retaining the original sum-to-zero constraints for the time intensity parameters \lambda_{1i} and \lambda_{2i}. Perhaps there is another way, perhaps there is none, and the model is just not feasible, I am not sure.

As a side note, when the model converges I recover the generating parameters quite well, especially for the \lambda, \phi and \sigma^2 parameters, and the speed parameters \zeta to a certain extent. Diagnostics show no issues, no divergences, no BFMI warning, no maximum treedepth warning, nothing, and good effective sample sizes. When the model does not convergence, I still have no diagnostic warnings except for Rhat being larger than the cutoff, but Rhat is something about 1.7 and the effective sample size is at 17-something for almost all parameters. So I wonder if the sampler just gets stuck right at the beginning, and does not move at all.

I hope all this is at least a little bit comprehensible. If anything’s unclear, or there are further questions, please let me know!

1 Like