Test: Soft vs Hard sum-to-zero constrain + choosing the right prior for soft constrain


#1

Hello,

I have done some tests comparing soft vs hard sum-to-zero constrain for a simplex regression. The simulated data is really defined having incredibly low error

image

HARD
148.82 seconds (Total)

               mean      se_mean          sd       2.5%        25%        50%        75%      97.5%    n_eff      Rhat
weights[1,1] -1.1458828 0.0005821308 0.007623225 -1.1599872 -1.1514000 -1.1455387 -1.1406438 -1.1312168 171.4891 0.9993008
weights[1,2] -0.4548668 0.0003671708 0.005209648 -0.4658918 -0.4585181 -0.4546926 -0.4514349 -0.4447093 201.3171 0.9986171
weights[1,3]  0.2378516 0.0004638308 0.005466593  0.2277121  0.2344816  0.2379877  0.2412291  0.2504590 138.9039 1.0068702
weights[1,4]  1.3904648 0.0003544976 0.004300609  1.3823475  1.3877333  1.3905326  1.3930715  1.3992900 147.1747 1.0046746
weights[1,5]  0.9263817 0.0002775686 0.003620667  0.9190944  0.9238915  0.9263782  0.9289837  0.9336622 170.1518 1.0000263
weights[2,1]  0.2195011 0.0008506887 0.010793117  0.1986463  0.2123535  0.2191074  0.2271519  0.2411544 160.9728 0.9996466
weights[2,2] -0.7747972 0.0002818525 0.005537706 -0.7853681 -0.7783540 -0.7751138 -0.7715074 -0.7628066 386.0255 0.9979987
weights[2,3] -2.7699840 0.0006716145 0.009954892 -2.7892922 -2.7763711 -2.7700132 -2.7634324 -2.7494017 219.7015 1.0006988
weights[2,4] -0.7760190 0.0003541681 0.005065385 -0.7863347 -0.7793836 -0.7757947 -0.7726554 -0.7662422 204.5530 1.0045940
weights[2,5]  1.2296639 0.0002174585 0.003480065  1.2226467  1.2274545  1.2297332  1.2320280  1.2366653 256.1072 0.9994862

SOFT | sum( weights[i] ) ~ normal(0,0.0001 * ncat)
274.33 seconds (Total)

                   mean      se_mean          sd        2.5%        25%         50%         75%       97.5%    n_eff      Rhat
weights[1,1] -1.3391406 2.293569e-04 0.004116341 -1.34689154 -1.3419479 -1.33901422 -1.33620911 -1.33157701 322.1062 1.0021981
weights[1,2] -0.6460621 1.665210e-04 0.003204347 -0.65266495 -0.6483706 -0.64596065 -0.64375407 -0.63989639 370.2892 0.9981509
weights[1,3]  0.0487602 1.747026e-04 0.003143532  0.04266207  0.0467480  0.04875449  0.05087039  0.05509691 323.7701 0.9989046
weights[1,4]  1.2004083 9.242863e-05 0.001855101  1.19674888  1.1992482  1.20052803  1.20162172  1.20403749 402.8302 0.9983294
weights[1,5]  0.7360301 9.226777e-05 0.001954185  0.73247408  0.7346718  0.73600215  0.73736124  0.73968258 448.5710 0.9988598
weights[2,1]  0.7972074 4.435837e-04 0.007603421  0.78291908  0.7918101  0.79735841  0.80248565  0.81130783 293.8102 0.9998841
weights[2,2] -0.2007952 3.901609e-04 0.006597414 -0.21296538 -0.2057596 -0.20103038 -0.19646600 -0.18663437 285.9302 0.9980394
weights[2,3] -2.2042454 4.989408e-04 0.008751689 -2.22082484 -2.2106547 -2.20442567 -2.19815306 -2.18722835 307.6703 0.9980393
weights[2,4] -0.1993981 2.050149e-04 0.003841183 -0.20697032 -0.2019425 -0.19949765 -0.19673615 -0.19200305 351.0421 0.9990791
weights[2,5]  1.8071866 1.857058e-04 0.003765082  1.79999651  1.8046481  1.80760927  1.80970585  1.81433016 411.0529 0.9987926

SOFT | sum( weights[i] ) ~ normal(0,0.001 * ncat)
79.38 seconds (Total)

                    mean      se_mean          sd        2.5%         25%         50%         75%       97.5%    n_eff      Rhat
weights[1,1] -1.33906679 0.0002355011 0.004312584 -1.34714862 -1.34190566 -1.33907841 -1.33642318 -1.33011286 335.3429 1.0079406
weights[1,2] -0.64618862 0.0001428191 0.003193533 -0.65236610 -0.64832035 -0.64620616 -0.64413122 -0.64008974 500.0000 0.9983157
weights[1,3]  0.04869102 0.0001630507 0.003289347  0.04241653  0.04658683  0.04871578  0.05059692  0.05539219 406.9810 0.9984773
weights[1,4]  1.20037635 0.0001118742 0.002209674  1.19604522  1.19900284  1.20043974  1.20183060  1.20452307 390.1188 1.0010383
weights[1,5]  0.73600669 0.0001068532 0.002141574  0.73174139  0.73448537  0.73612169  0.73737938  0.74006021 401.6900 1.0019666
weights[2,1]  0.79718099 0.0004392946 0.007917254  0.78134531  0.79165344  0.79744736  0.80230182  0.81249897 324.8159 1.0049828
weights[2,2] -0.20085779 0.0003315689 0.005952465 -0.21337641 -0.20493411 -0.20081094 -0.19664387 -0.18971984 322.2895 0.9980259
weights[2,3] -2.20422130 0.0004126981 0.008191094 -2.21970960 -2.20939100 -2.20411915 -2.19900825 -2.18731448 393.9298 0.9982997
weights[2,4] -0.19941701 0.0002299476 0.004042518 -0.20745008 -0.20198728 -0.19939631 -0.19665669 -0.19174443 309.0624 1.0004030
weights[2,5]  1.80712068 0.0001937383 0.003534819  1.80038185  1.80474004  1.80740252  1.80927082  1.81423255 332.8918 1.0021593

SOFT | sum( weights[i] ) ~ normal(0,0.01 * ncat)
140.97 seconds (Total)

                    mean      se_mean         sd        2.5%         25%         50%         75%       97.5%    n_eff      Rhat
weights[1,1] -1.33955063 0.0008641711 0.01110507 -1.36052415 -1.34727240 -1.33957167 -1.33235902 -1.31665335 165.1364 1.0017526
weights[1,2] -0.64652730 0.0008300391 0.01070266 -0.66796289 -0.65353296 -0.64644981 -0.64006590 -0.62298819 166.2593 1.0066967
weights[1,3]  0.04793137 0.0008624810 0.01088247  0.02781052  0.04031121  0.04761735  0.05478967  0.07136967 159.2047 1.0052243
weights[1,4]  1.19978972 0.0008158316 0.01044942  1.17947424  1.19230257  1.19978662  1.20652287  1.22099937 164.0527 1.0063044
weights[1,5]  0.73554133 0.0008443594 0.01048765  0.71558505  0.72851332  0.73577050  0.74265520  0.75798619 154.2771 1.0061033
weights[2,1]  0.79638481 0.0009657271 0.01240122  0.77180348  0.78893469  0.79593330  0.80412887  0.81997343 164.8996 1.0064321
weights[2,2] -0.20158482 0.0009524633 0.01252942 -0.22578230 -0.21016327 -0.20207057 -0.19368810 -0.17667237 173.0474 0.9994586
weights[2,3] -2.20447889 0.0008962483 0.01329836 -2.23148403 -2.21309907 -2.20474485 -2.19592306 -2.17677002 220.1604 0.9995204
weights[2,4] -0.19983153 0.0009872462 0.01170564 -0.22269266 -0.20701587 -0.20031971 -0.19219750 -0.17605516 140.5851 1.0014370
weights[2,5]  1.80641418 0.0009663525 0.01159318  1.78408569  1.79912203  1.80546777  1.81424173  1.82954490 143.9243 1.0020987

SOFT | sum( weights[i] ) ~ normal(0,0.1 * ncat)
858.51 seconds (Total)

                    mean     se_mean         sd       2.5%           25%         50%        75%        97.5%    n_eff      Rhat
weights[1,1] -1.32546963 0.008589777 0.09236797 -1.5237648 -1.3866248889 -1.32259656 -1.2604053 -1.161136991 115.6323 0.9986507
weights[1,2] -0.63236712 0.008551023 0.09215380 -0.8263866 -0.6930839905 -0.62824401 -0.5679193 -0.467462170 116.1423 0.9985147
weights[1,3]  0.06198528 0.008606207 0.09239113 -0.1359249 -0.0001658668  0.06443147  0.1283943  0.229030744 115.2489 0.9985782
weights[1,4]  1.21397869 0.008559155 0.09211254  1.0158682  1.1538334071  1.21748272  1.2796228  1.379548700 115.8179 0.9986014
weights[1,5]  0.74969554 0.008550612 0.09208181  0.5501582  0.6877385938  0.75286649  0.8144900  0.916024662 115.9720 0.9985180
weights[2,1]  0.78904096 0.008271534 0.10159131  0.5726654  0.7349224799  0.78935339  0.8587340  0.984562498 150.8485 1.0138684
weights[2,2] -0.20872677 0.008256790 0.10111547 -0.4265115 -0.2657348554 -0.20996668 -0.1389566 -0.016155017 149.9729 1.0122870
weights[2,3] -2.21062558 0.008298003 0.10127022 -2.4275077 -2.2657670090 -2.21437824 -2.1449375 -2.007571547 148.9417 1.0131808
weights[2,4] -0.20691604 0.008281673 0.10113992 -0.4194727 -0.2619759795 -0.20620171 -0.1383121 -0.008027167 149.1451 1.0133139
weights[2,5]  1.79936795 0.008289542 0.10118807  1.5851504  1.7427574705  1.79884520  1.8712273  1.994083503 149.0038 1.0127490

HARD model

data { 
  int<lower=1> N;  // total number of observations 
	int<lower=2> ncat;  // number of categories
	int<lower=2> input_dim; // number of predictor levels
	matrix[N,input_dim] X; // predictor design matrix
	vector[ncat] Y[N];  // response variable (simplex?)
}

parameters {
	matrix[input_dim, ncat - 1] weights_raw; // coefficients
	real<lower=1> v;
}
transformed parameters{

	matrix[input_dim, ncat] weights;
	
	for(n in 1:(ncat-1)) for(m in 1:input_dim) weights[m,n] = weights_raw[m,n];
	for(m in 1:input_dim) weights[m, ncat] = -sum(weights_raw[,m]);

}
model {
	matrix[N, ncat] logits;
	
	for (i in 1:input_dim) weights[i] ~ normal(0,2);

	logits = X * weights;
	
	for (n in 1:N) Y[n] ~ dirichlet(softmax(to_vector(logits[n])) * v); 

}

SOFT model

data { 
	int<lower=1> N;  // total number of observations 
	int<lower=2> ncat;  // number of categories
	int<lower=2> input_dim; // number of predictor levels
	matrix[N,input_dim] X; // predictor design matrix
	vector[ncat] Y[N];  // response variable (simplex?)
	real pr;
}
parameters {
	matrix[input_dim, ncat] weights; // coefficients
	real<lower=ncat> v;
}
model {
	matrix[N, ncat] logits;
	
	for (i in 1:input_dim) sum( weights[i] ) ~ normal(0,pr * ncat);
	for (i in 1:input_dim) weights[i] ~ normal(0,2);
	v ~ cauchy(0, 2.5);
	
	logits = X * weights;
	for (n in 1:N) Y[n] ~ dirichlet(softmax(to_vector(logits[n])) * v + 1); 

}

I hope this helps, I am interested in testing the best setting for bigger models, and update the reference manual with useful info.

Bw.


#2

I think the first question is why the soft constraint gets such different results from the hard constraint. Yet the result from the soft constraint doesn’t seem in any way sensitive to how hard the constraint is.

I don’t know why 0.001 * ncat is the magic scale. That’s why I suggested posting here. I was hoping @Daniel_Simpson or @mitzimorris or @avehtari would jump in. It’s related to the spatial ICAR model soft smoothing. I have to imagine it has to be set in balance to the parameter scales, which seem to be roughly on the unit scale here.


#3

There are quite big differences in n_eff’s and “SOFT | sum( weights[i] ) ~ normal(0,0.001 * ncat)” is near optimal performance. I would have assumed hard would work better since there is less parameters, while in soft the distribution is narrow in one dimension and mostly laying in lower dimensional manifold and it’s difficult to set step size well to work in that case, but then in the hard case I guess there is some difficult corners. You might learn something by looking at the joint distributions, treedepths and step sizes in different cases. If you use R, use shinystan for a quick exploration of these.


#4

and also

missing in one of the models. I’m lost and missed the message.


#5

Sorry for that,

I was doing several experiments and I published a inconsistent combinations of models. Here a consistent experiment. The results however are similar.

HARD
125.95 seconds (Total)

                    mean      se_mean          sd       2.5%        25%        50%        75%      97.5%    n_eff      Rhat
weights[1,1] -1.1465013 0.0004886820 0.006971312 -1.1601937 -1.1513639 -1.1463705 -1.1416479 -1.1333525 203.5056 1.0079304
weights[1,2] -0.4552810 0.0003040464 0.004944080 -0.4649425 -0.4586068 -0.4553149 -0.4515520 -0.4457402 264.4181 1.0049710
weights[1,3]  0.2377970 0.0003408599 0.004993808  0.2276239  0.2345214  0.2378231  0.2411309  0.2472629 214.6407 0.9998882
weights[1,4]  1.3900697 0.0002844863 0.004030123  1.3826840  1.3874061  1.3898325  1.3926650  1.3981216 200.6846 0.9999815
weights[1,5]  0.9259694 0.0002378072 0.003464150  0.9194198  0.9235811  0.9258864  0.9283375  0.9327450 212.1991 1.0047354
weights[2,1]  0.2205320 0.0007107085 0.009878573  0.2008913  0.2139086  0.2201455  0.2277367  0.2396755 193.1992 1.0078259
weights[2,2] -0.7746503 0.0003263312 0.005809345 -0.7855681 -0.7782734 -0.7748686 -0.7706206 -0.7631332 316.9110 0.9998456
weights[2,3] -2.7707700 0.0006061541 0.009914921 -2.7902098 -2.7780006 -2.7709846 -2.7642110 -2.7530808 267.5546 0.9983152
weights[2,4] -0.7757060 0.0002913793 0.004788293 -0.7856249 -0.7786280 -0.7755793 -0.7723950 -0.7677324 270.0499 0.9981574
weights[2,5]  1.2299313 0.0002209526 0.003570211  1.2225953  1.2274838  1.2301994  1.2323409  1.2365122 261.0895 1.0000729

SOFT | sum( weights[i] ) ~ normal(0,0.0001 * ncat)
290.33 seconds (Total)

                    mean      se_mean          sd        2.5%         25%        50%         75%       97.5%    n_eff      Rhat
weights[1,1] -1.33606452 2.233669e-04 0.003956647 -1.34411258 -1.33864764 -1.3359613 -1.33356419 -1.32858988 313.7741 1.0028179
weights[1,2] -0.64592482 2.063405e-04 0.003223662 -0.65206824 -0.64813095 -0.6456662 -0.64397375 -0.63922112 244.0788 1.0016192
weights[1,3]  0.04662318 1.695684e-04 0.003066734  0.04044836  0.04465157  0.0467398  0.04884645  0.05232253 327.0863 1.0003346
weights[1,4]  1.19968766 9.557302e-05 0.001813214  1.19622700  1.19848782  1.1997581  1.20091828  1.20311270 359.9377 0.9998433
weights[1,5]  0.73565608 9.445477e-05 0.001968848  0.73203101  0.73427374  0.7356579  0.73696070  0.73974253 434.4870 0.9980832
weights[2,1]  0.79304821 4.310247e-04 0.007267103  0.77863455  0.78837851  0.7927893  0.79770727  0.80777386 284.2616 1.0028358
weights[2,2] -0.20047982 4.242823e-04 0.006329122 -0.21304798 -0.20456771 -0.2009334 -0.19644000 -0.18728606 222.5241 0.9997584
weights[2,3] -2.19435188 4.687465e-04 0.007774505 -2.20866714 -2.20003388 -2.1944085 -2.18950641 -2.17902065 275.0866 1.0008160
weights[2,4] -0.20183196 1.987749e-04 0.003853673 -0.20952413 -0.20433952 -0.2019033 -0.19904516 -0.19421178 375.8605 0.9987363
weights[2,5]  1.80359497 1.978071e-04 0.003752877  1.79626007  1.80137237  1.8038111  1.80594705  1.81062420 359.9524 0.9983332

SOFT | sum( weights[i] ) ~ normal(0,0.001 * ncat)
64.91 seconds (Total)

                    mean      se_mean          sd        2.5%         25%         50%         75%       97.5%    n_eff      Rhat
weights[1,1] -1.33663371 2.287285e-04 0.004189502 -1.34434299 -1.33970760 -1.33660746 -1.33411704 -1.32830485 335.4937 0.9985142
weights[1,2] -0.64571019 1.742974e-04 0.003520531 -0.65248899 -0.64819238 -0.64564299 -0.64310874 -0.63894220 407.9760 0.9979985
weights[1,3]  0.04702615 2.085083e-04 0.003201486  0.04114344  0.04473525  0.04708823  0.04934167  0.05306461 235.7525 1.0041226
weights[1,4]  1.19954926 9.704660e-05 0.002135452  1.19512535  1.19813002  1.19959893  1.20093319  1.20380155 484.1933 1.0049865
weights[1,5]  0.73551910 9.957487e-05 0.002180175  0.73124466  0.73421800  0.73551283  0.73695210  0.74039517 479.3836 0.9999369
weights[2,1]  0.79409044 4.375994e-04 0.007419255  0.77951596  0.78915406  0.79401714  0.79958507  0.80814339 287.4532 0.9993718
weights[2,2] -0.20067554 3.611469e-04 0.006820617 -0.21507234 -0.20527549 -0.20085887 -0.19613695 -0.18691095 356.6807 0.9981562
weights[2,3] -2.19590990 4.747673e-04 0.007757611 -2.20963550 -2.20142379 -2.19614786 -2.19089721 -2.18068420 266.9896 1.0068678
weights[2,4] -0.20148228 2.021728e-04 0.004071686 -0.20908088 -0.20399383 -0.20173291 -0.19905511 -0.19266087 405.6048 1.0045090
weights[2,5]  1.80388372 1.812201e-04 0.003673086  1.79670112  1.80151058  1.80381776  1.80651506  1.81067079 410.8182 0.9992769

SOFT | sum( weights[i] ) ~ normal(0,0.01 * ncat)
154.85 seconds (Total)

                    mean      se_mean         sd        2.5%         25%        50%         75%       97.5%    n_eff      Rhat
weights[1,1] -1.33654686 0.0007855698 0.01113060 -1.35945171 -1.34378431 -1.3364848 -1.32881822 -1.31573457 200.7556 0.9985261
weights[1,2] -0.64545211 0.0007980272 0.01105569 -0.66760509 -0.65286856 -0.6453719 -0.63819605 -0.62419660 191.9271 0.9991639
weights[1,3]  0.04727457 0.0008349868 0.01101993  0.02490617  0.03966358  0.0469636  0.05523526  0.06782581 174.1802 0.9981578
weights[1,4]  1.19989106 0.0008174362 0.01060791  1.17930789  1.19288929  1.1995758  1.20730877  1.22016364 168.4039 0.9985818
weights[1,5]  0.73577012 0.0008165924 0.01066193  0.71427278  0.72889687  0.7353582  0.74306747  0.75729097 170.4751 0.9990746
weights[2,1]  0.79394836 0.0008634479 0.01286363  0.77119308  0.78498017  0.7931229  0.80287852  0.81938392 221.9498 1.0002287
weights[2,2] -0.20091307 0.0007702690 0.01181080 -0.22537387 -0.20832089 -0.2014652 -0.19237443 -0.17877137 235.1117 1.0044895
weights[2,3] -2.19628513 0.0007586940 0.01305487 -2.21847612 -2.20552955 -2.1969898 -2.18732020 -2.17091780 296.0819 0.9985651
weights[2,4] -0.20189928 0.0007101005 0.01050309 -0.22012201 -0.20965404 -0.2023093 -0.19398454 -0.18158648 218.7735 1.0021871
weights[2,5]  1.80357788 0.0007374207 0.01069541  1.78423600  1.79569560  1.8029403  1.81145146  1.82321161 210.3606 1.0050114

HARD model

data { 
  int<lower=1> N;  // total number of observations 
	int<lower=2> ncat;  // number of categories
	int<lower=2> input_dim; // number of predictor levels
	matrix[N,input_dim] X; // predictor design matrix
	vector[ncat] Y[N];  // response variable (simplex?)
}
parameters {
	matrix[input_dim, ncat - 1] weights_raw; // coefficients
	real<lower=1> v;
}
transformed parameters{
	matrix[input_dim, ncat] weights;
	for(n in 1:(ncat-1)) for(m in 1:input_dim) weights[m,n] = weights_raw[m,n];
	for(m in 1:input_dim) weights[m, ncat] = -sum(weights_raw[,m]);
}
model {
	matrix[N, ncat] logits;
	v ~ cauchy(0, 2.5);

	for (i in 1:input_dim) weights[i] ~ normal(0,2);
	logits = X * weights;
	for (n in 1:N) Y[n] ~ dirichlet(softmax(to_vector(logits[n])) * v); 
}

SOFT model

data { 
	int<lower=1> N;  // total number of observations 
	int<lower=2> ncat;  // number of categories
	int<lower=2> input_dim; // number of predictor levels
	matrix[N,input_dim] X; // predictor design matrix
	vector[ncat] Y[N];  // response variable (simplex?)
	real pr;
}
parameters {
	matrix[input_dim, ncat] weights; // coefficients
	real<lower=ncat> v;
}
model {
	matrix[N, ncat] logits;
	v ~ cauchy(0, 2.5);

	for (i in 1:input_dim) sum( weights[i] ) ~ normal(0,pr * ncat);
	for (i in 1:input_dim) weights[i] ~ normal(0,2);
	logits = X * weights;
	for (n in 1:N) Y[n] ~ dirichlet(softmax(to_vector(logits[n])) * v); 

}

Bw.


#6

Considering the HARD model:

I’d expect weights_raw ~ normal(0, 2);

According to Stan manual 2.17
9.7. Parameterizing Centered Vectors


#7

Thanks @andre.pfeuffer .

the results with this edit are analogous.


#8

Comparing hard vs soft centering on different model, I am starting to comcplude that the soft centering is not really robust on varying the scaling prior @Bob_Carpenter @mitzimorris

  • For example, in my previous test the results show that the performance vary with the scale

  • For new tests the results show that while the HARD centring converges

    mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
    alpha[1]   -2.50    0.00 0.03 -2.56 -2.52 -2.50 -2.48 -2.44   359 1.01
    alpha[2]    0.87    0.00 0.01  0.86  0.87  0.87  0.88  0.88   691 1.00
    alpha[3]    1.95    0.00 0.00  1.94  1.94  1.95  1.95  1.96   371 1.01
    alpha[4]    0.43    0.00 0.01  0.42  0.43  0.43  0.44  0.45   812 1.00
    alpha[5]    1.83    0.00 0.00  1.82  1.83  1.83  1.83  1.84   347 1.00
    alpha[6]   -0.84    0.00 0.01 -0.87 -0.85 -0.84 -0.83 -0.81  1000 1.00
    alpha[7]    0.33    0.00 0.01  0.31  0.32  0.33  0.33  0.34   817 1.00
    alpha[8]   -0.49    0.00 0.01 -0.51 -0.50 -0.49 -0.48 -0.47  1000 1.00
    alpha[9]   -0.03    0.00 0.01 -0.05 -0.04 -0.03 -0.02 -0.01   776 1.00
    alpha[10]  -1.30    0.00 0.02 -1.33 -1.31 -1.30 -1.28 -1.26  1000 1.00
    alpha[11]   0.73    0.00 0.01  0.72  0.72  0.73  0.73  0.74  1000 1.00
    alpha[12]  -6.07    0.01 0.20 -6.49 -6.20 -6.06 -5.94 -5.70  1000 1.00
  • the SOFT centring converges or not depending on the scaling.

sum(alpha) ~ normal(0, 0.001); // soft identification of control - Divergent!

     mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff  Rhat
    alpha[1]   -2.03    0.23 0.37 -2.63 -2.35 -1.94 -1.67 -1.49     3  5.18
    alpha[2]    0.53    0.08 0.16  0.20  0.40  0.50  0.70  0.78     4  2.54
    alpha[3]    1.55    0.12 0.21  1.20  1.38  1.54  1.77  1.85     3  3.44
    alpha[4]   -0.06    0.27 0.39 -0.73 -0.25  0.06  0.26  0.35     2  8.30
    alpha[5]    1.45    0.11 0.21  1.04  1.28  1.46  1.66  1.74     4  2.99
    alpha[6]   -0.97    0.08 0.14 -1.13 -1.08 -1.01 -0.92 -0.58     3  2.98
    alpha[7]   -0.07    0.19 0.28 -0.56 -0.20 -0.02  0.15  0.24     2  8.07
    alpha[8]   -0.66    0.03 0.11 -0.81 -0.75 -0.64 -0.60 -0.32    11  1.47
    alpha[9]   -0.66    0.38 0.54 -1.58 -0.93 -0.54 -0.20 -0.12     2 10.10
    alpha[10]  -1.35    0.29 0.41 -1.81 -1.60 -1.52 -1.17 -0.65     2 15.33
    alpha[11]  -0.03    0.58 0.83 -1.51 -0.27  0.36  0.55  0.64     2 13.55
    alpha[12]  -2.91    1.37 2.16 -5.99 -5.32 -2.05 -0.84 -0.17     2  4.69
    alpha[13]   0.89    0.25 0.43 -0.12  0.88  1.01  1.20  1.28     3  4.07
    alpha[14]   0.06    0.51 0.76 -1.28 -0.60  0.47  0.65  0.73     2  6.39
    alpha[15]   0.93    0.21 0.45 -0.42  0.93  1.05  1.24  1.32     5  3.36
    alpha[16]  -0.25    0.06 0.16 -0.75 -0.30 -0.25 -0.13 -0.06     7  2.27
    alpha[17]   1.26    0.14 0.27  0.48  1.16  1.29  1.49  1.57     4  2.82

sum(alpha) ~ normal(0, 0.01); // soft identification of control - Converges, less efficiently than hard

mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha[1]   -2.50    0.00 0.03 -2.57 -2.52 -2.50 -2.48 -2.44   258 1.01
alpha[2]    0.87    0.00 0.01  0.86  0.87  0.87  0.88  0.89   134 1.05
alpha[3]    1.95    0.00 0.01  1.94  1.94  1.95  1.95  1.96    20 1.12
alpha[4]    0.43    0.00 0.01  0.42  0.43  0.43  0.44  0.45   111 1.04
alpha[5]    1.83    0.00 0.01  1.82  1.83  1.83  1.83  1.84    20 1.12
alpha[6]   -0.84    0.00 0.02 -0.87 -0.85 -0.84 -0.83 -0.81   941 1.01
alpha[7]    0.33    0.00 0.01  0.31  0.32  0.33  0.33  0.35   150 1.03
alpha[8]   -0.49    0.00 0.01 -0.51 -0.50 -0.49 -0.48 -0.47  1000 1.02
alpha[9]   -0.03    0.00 0.01 -0.05 -0.04 -0.03 -0.02 -0.01  1000 1.02
alpha[10]  -1.30    0.00 0.02 -1.33 -1.31 -1.30 -1.28 -1.26  1000 1.01
alpha[11]   0.73    0.00 0.01  0.72  0.73  0.73  0.74  0.75   103 1.04
alpha[12]  -6.07    0.01 0.18 -6.43 -6.20 -6.06 -5.94 -5.72  1000 1.00
alpha[13]   1.37    0.00 0.01  1.36  1.37  1.37  1.38  1.38    30 1.08

So in my opinion while in some occasions soft centring confers benefit, in some cases it is not worth the time spent to find the “sweet spot” for such prior.

The model for soft (you can imagine the model for hard)

    data {
      int<lower = 0> F;                   // "fixed" genes
      int<lower = 0> T;                   // tube
      int<lower = 0> y[T, F];             // RNA-seq counts
      int<lower = 0, upper = 1> cond[T];  // condition 0 or 1
    }
    parameters {
      vector[F] alpha;  // control log relative expression
    }
    model {
    	vector[F] phi;
    	
    	phi = softmax( alpha );
      for (t in 1:T) y[t] ~ multinomial(phi);

      alpha ~ normal(0, 1);    // log relative expression in control
      sum(alpha) ~ normal(0, 0.01);  // soft identification of control


    }

#9

I did some tests on the scale of the soft-centering using an ICAR prior and a simple Poisson model:

functions {
 void icar_normal_lp(int N, int[] node1, int[] node2, real s, vector phi) {
   target += -0.5 * dot_self(phi[node1] - phi[node2]);
  // soft sum-to-zero constraint on phi
  // more efficient than mean(phi) ~ normal(0, s)
  sum(phi) ~ normal(0, s * N);
 }
}
data {
  real<lower=0, upper=0.1> s;  // scale close to zero
  int<lower=0> N;
  int<lower=0> N_edges;
  int<lower=1, upper=N> node1[N_edges];  // node1[i] adjacent to node2[i]
  int<lower=1, upper=N> node2[N_edges];  // and node1[i] < node2[i]

  int<lower=0> y[N];              // count outcomes
  vector<lower=0>[N] E;           // exposure
}
transformed data {
  vector[N] log_E = log(E);
}
parameters {
  real beta0;             // intercept
  real<lower=0> sigma;    // overall standard deviation
  vector[N] phi;         // spatial effects
}
model {
  y ~ poisson_log(log_E + beta0 + phi * sigma);
  beta0 ~ normal(0.0, 1.0);
  sigma ~ normal(0.0, 1.0);
  icar_normal_lp(N, node1, node2, s, phi);

}

I ran this over the NYC pedestrian traffic data used in the ICAR case study, and tried scale of 0.1, 0.01, and 0.001, 3 chains, 2000 iterations (default). when the scale was 0.1 the Rhat values indicated failure to converge, however there were no divergences or other warnings, only Rhat values above 1.1 (close but no cigar Rhats).
because the scale 0.1 didn’t really converge well, sampling took much longer, and the overall warmup times seemed longer too. there wasn’t much difference between scale 0.01 and 0.001 - here’s the times for these latter two:

NYC:  1921 regions, intercept only, pois + icar
s = 0.01
3 parallel chains:
 Elapsed Time: 505.259 seconds (Warm-up)
 Elapsed Time: 555.046 seconds (Warm-up)
 Elapsed Time: 575.119 seconds (Warm-up)
               262.570 seconds (Sampling)
               250.358 seconds (Sampling)
               243.568 seconds (Sampling)
               767.829 seconds (Total)
               805.404 seconds (Total)
               818.687 seconds (Total)

****************

s = 0.001
3 parallel chains:
 Elapsed Time: 528.069 seconds (Warm-up)
 Elapsed Time: 543.13 seconds (Warm-up)
 Elapsed Time: 556.438 seconds (Warm-up)
               273.678 seconds (Sampling)
               268.504 seconds (Sampling)
               265.584 seconds (Sampling)
               801.747 seconds (Total)
               811.634 seconds (Total)
               822.021 seconds (Total)

I reran the model with scale 0.1 several times - there the warmup took on the order of 600 - 700 seconds, as did the sampling - given that there were 1000 iterations in both sampling and warmups, the fact that sampling iterations took as long as warmups is also an indication of failure to converge.


#10

this is missing a factor - scale should be multiplied by the length of alpha, because what you’re trying to do is put the soft constraint on the mean of alpha.


#11

True! I forgot about that, let me check again with the right model


#12

re-run soft models

sum(alpha) ~ normal(0, 0.01 * F); - 54.02 seconds (Total)

mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha[1]   -2.50    0.00 0.03 -2.57 -2.52 -2.50 -2.48 -2.44  1000 1.01
alpha[2]    0.86    0.00 0.01  0.85  0.86  0.87  0.87  0.88    29 1.16
alpha[3]    1.94    0.00 0.01  1.92  1.93  1.94  1.94  1.95    18 1.30
alpha[4]    0.43    0.00 0.01  0.41  0.42  0.43  0.43  0.45    34 1.14
alpha[5]    1.82    0.00 0.01  1.81  1.82  1.82  1.83  1.84    20 1.28
alpha[6]   -0.85    0.00 0.01 -0.88 -0.86 -0.85 -0.84 -0.82    74 1.04
alpha[7]    0.32    0.00 0.01  0.30  0.31  0.32  0.33  0.34    38 1.11
alpha[8]   -0.50    0.00 0.01 -0.53 -0.51 -0.50 -0.49 -0.47    66 1.06
alpha[9]   -0.04    0.00 0.01 -0.06 -0.04 -0.04 -0.03 -0.02    42 1.10
alpha[10]  -1.30    0.00 0.02 -1.35 -1.32 -1.30 -1.29 -1.26  1000 1.03
alpha[11]   0.72    0.00 0.01  0.71  0.72  0.72  0.73  0.74    27 1.17
alpha[12]  -6.08    0.01 0.20 -6.46 -6.22 -6.08 -5.94 -5.72  1000 1.00
alpha[13]   1.36    0.00 0.01  1.35  1.36  1.36  1.37  1.38    22 1.22
alpha[14]   0.82    0.00 0.01  0.80  0.81  0.82  0.82  0.83    26 1.18

sum(alpha) ~ normal(0, 0.001 * F); - 61.32 seconds (Total)

mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha[1]   -2.50    0.00 0.03 -2.56 -2.52 -2.50 -2.48 -2.43  1000 1.00
alpha[2]    0.87    0.00 0.01  0.86  0.87  0.87  0.88  0.88  1000 1.04
alpha[3]    1.95    0.00 0.00  1.94  1.94  1.95  1.95  1.95    42 1.09
alpha[4]    0.43    0.00 0.01  0.42  0.43  0.43  0.44  0.45  1000 1.02
alpha[5]    1.83    0.00 0.00  1.82  1.82  1.83  1.83  1.84    30 1.11
alpha[6]   -0.85    0.00 0.01 -0.87 -0.86 -0.85 -0.84 -0.82  1000 1.00
alpha[7]    0.33    0.00 0.01  0.31  0.32  0.33  0.33  0.34  1000 1.01
alpha[8]   -0.49    0.00 0.01 -0.52 -0.50 -0.49 -0.48 -0.47  1000 1.01
alpha[9]   -0.03    0.00 0.01 -0.05 -0.04 -0.03 -0.02 -0.01  1000 1.02
alpha[10]  -1.30    0.00 0.02 -1.33 -1.31 -1.30 -1.28 -1.26  1000 1.00
alpha[11]   0.73    0.00 0.01  0.71  0.72  0.73  0.73  0.74  1000 1.02
alpha[12]  -6.06    0.01 0.19 -6.45 -6.19 -6.05 -5.93 -5.70  1000 1.00
alpha[13]   1.37    0.00 0.01  1.36  1.37  1.37  1.37  1.38   173 1.05
alpha[14]   0.82    0.00 0.01  0.81  0.82  0.82  0.83  0.84  1000 1.05
alpha[15]   1.41    0.00 0.01  1.40  1.41  1.41  1.41  1.42    94 1.05

So in this case I saw an overall increase of performances compared to hard constrain, for NON hierarchical


#13

compared same pois + ICAR model with hard sum-to-zero constraint - results:

NYC:  1921 regions, intercept only, pois + icar

****************  soft center, s = 0.01
3 parallel chains:
 Elapsed Time: 505.259 seconds (Warm-up)
 Elapsed Time: 555.046 seconds (Warm-up)
 Elapsed Time: 575.119 seconds (Warm-up)
               262.570 seconds (Sampling)
               250.358 seconds (Sampling)
               243.568 seconds (Sampling)
               767.829 seconds (Total)
               805.404 seconds (Total)
               818.687 seconds (Total)

****************  hard center
3 parallel chains:
 Elapsed Time: 1226.47 seconds (Warm-up)
 Elapsed Time: 1251.58 seconds (Warm-up)
 Elapsed Time: 1317.24 seconds (Warm-up)
               1060.43 seconds (Sampling)
               1075.43 seconds (Sampling)
               1045.03 seconds (Sampling)
               2301.9 seconds (Total)
               2312.01 seconds (Total)
               2362.27 seconds (Total)

hard center is 3x’s slower.

here’s the model:

data {
  real<lower=0, upper=0.1> s;  // scale close to zero
  int<lower=0> N;
  int<lower=0> N_edges;
  int<lower=1, upper=N> node1[N_edges];  // node1[i] adjacent to node2[i]
  int<lower=1, upper=N> node2[N_edges];  // and node1[i] < node2[i]

  int<lower=0> y[N];              // count outcomes
  vector<lower=0>[N] E;           // exposure
}
transformed data {
  vector[N] log_E = log(E);
}
parameters {
  real beta0;             // intercept
  real<lower=0> sigma;    // overall standard deviation
  vector[N-1] phi_raw;         // spatial effects
}
transformed parameters {
  vector[N] phi;
  phi[1:N-1] = phi_raw;
  phi[N] = -sum(phi_raw);
}
model {
  y ~ poisson_log(log_E + beta0 + phi * sigma);
  beta0 ~ normal(0.0, 1.0);
  sigma ~ normal(0.0, 1.0);
  target += -0.5 * dot_self(phi[node1] - phi[node2]);
}
generated quantities {
  vector[N] eta = log_E + beta0 + phi * sigma;
  vector[N] mu = exp(eta);
}

#14

I realize this isn’t really establishing much here - we’re comparing different models and different data

maybe @Daniel_Simpson has some insight on when hard sum-to-zero constraint is just fine and when a soft constraint would work better and if so, what scale?


#15

Thanks for running all this. @betanalpha may also have some insight—he was also recommending the soft centering.


#16

It’s all going to come down to the geometry. Hard constraints can induce awkward geometries if the data inform the unconstrained parameters in a nonlinear way. At the same time a soft constraint might allow too much slack such that the posterior stretches into unwelcome neighborhoods too much. This is particularly troublesome in high dimensional spaces where a soft constraint has to fight against concentration of measure.


#17

could you elaborate on this? examples? rules of thumb? anything that would help users like me who don’t have a very good sense of geometry.


#18

It’s hard to give specifics because these problems can manifest is so many different ways. Incidentally this is why the diagnostics are so tricky – it’s easy to spot problems, but because the problems could have been caused by a myriad of pathologies there’s no way to isolate the cause.

Whenever I’m worried about the geometry of a model in which I’m not familiar I take a thorough look at the unconstrained space. Running with diagnostic_file=<file_name> will generate <file_name> in the local directory containing comma separated values of the unconstrained parameters, their momenta, and their gradients for each sample. Pairs plots of the unconstrained parameters gives you a sense of what geometry the sampler is struggling with, and correlations with the gradients can help you identify regions that are hard to explore and may be causing divergences.


#19

@stemangiola @mitzimorris

There might be another way to do this from Fraser CJM 1951.
Something like:

transformed data{
  matrix [N,N] A = rep_matrix(-inv(x-1),N,N);
  for(i in 1:N)
    A[i,i] = 1;
  matrix [N,N] A_sqrt = chol(A);
//or
//matrix [N,N] A_sqrt = eigenvectors_sym(A) * diagonal(sqrt(eigenvalues(A)))* inverse(eigenvectors_sym(A));

}
parameters {
  vector [N-1] x_raw; 
}
transformed parameters{
   x =  A_sqrt * append_row(x,0); //structure of the chol matrix, means that this could actually be done in O(n) time.
}
model {
  x_raw ~ normal(0,1/sqrt(1-inv(N)));
}

Some more references:
https://cms.math.ca/openaccess/cjm/v3/cjm1951v03.0363-0366.pdf
http://r.789695.n4.nabble.com/A-vector-of-normal-distributed-values-with-a-sum-to-zero-constraint-td4687954.html


#20

I didn’t understood the Fraser approach in short time.
What speak against using a QR approach?

N <- 10
M <- diag(N)
M <- M - M[,N]
K<-matrix(rnorm((N-1)*100000,0,1/sqrt(1-1/N)),100000,N-1) %*% t(qr.Q(qr(M))[,1:(N-1)])
var(K)
sum(K[1,])