Single Valued Chains

Hello,

I have been trying to use Stan (via the Julia interface) for my thesis work but I keep running into the problem that the chains are all single valued. I think it has something to do with the large range of values with my inputs but I can’t be sure since this is the first real problem I have tried to to solve using Stan. I have listed my model and run summary below.

Any help would be appreciated.

Stan Summary:

Inference for Stan model: model
4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.

Warmup took (565, 392, 432, 356) seconds, 29 minutes total
Sampling took (70, 184, 162, 195) seconds, 10 minutes total

                    Mean     MCSE   StdDev        5%       50%       95%  N_Eff  N_Eff/s    R_hat
lp__            -3.0e+15  3.1e+13  1.9e+15  -6.3e+15  -1.8e+15  -1.7e+15   4000  6.5e+00  7.7e+05
accept_stat__    5.0e-01  1.4e-04  8.8e-03   5.0e-01   5.0e-01   5.0e-01   4000  6.5e+00  1.0e+00
stepsize__       1.8e-08  5.8e-09  8.2e-09   8.2e-09   2.0e-08   3.1e-08    2.0  3.3e-03  3.1e+14
treedepth__      7.5e-04  5.6e-04  3.5e-02   0.0e+00   0.0e+00   0.0e+00   4000  6.5e+00  1.0e+00
n_leapfrog__     1.0e+00  7.1e-04  4.5e-02   1.0e+00   1.0e+00   1.0e+00   4000  6.5e+00  1.0e+00
divergent__      1.0e+00  2.5e-04  1.6e-02   1.0e+00   1.0e+00   1.0e+00   4000  6.5e+00  1.0e+00
energy__         3.0e+15  3.1e+13  1.9e+15   1.7e+15   2.0e+15   6.3e+15   4000  6.5e+00  1.5e+05
F[1]            -5.5e-02  5.0e-01  7.1e-01  -1.2e+00   3.1e-01   6.7e-01    2.0  3.3e-03  4.8e+08
F[2]             8.3e-02  7.3e-01  1.0e+00  -1.4e+00   3.0e-01   1.4e+00    2.0  3.3e-03  4.7e+14
F[3]            -1.6e-02  4.2e-01  6.0e-01  -7.7e-01   4.8e-01   6.5e-01    2.0  3.3e-03  6.1e+14
F[4]            -2.0e-01  7.4e-01  1.0e+00  -1.3e+00   2.7e-01   1.3e+00    2.0  3.3e-03  1.2e+15
F[5]            -7.2e-02  3.6e-01  5.1e-01  -9.2e-01   1.5e-01   4.2e-01    2.0  3.3e-03  2.2e+06
F[6]             2.5e-01  1.5e-01  2.2e-01  -1.7e-02   3.0e-01   5.7e-01    2.0  3.3e-03  3.3e+05
F[7]            -2.5e-01  6.2e-01  8.8e-01  -1.6e+00   4.0e-01   6.8e-01    2.0  3.3e-03  1.5e+06
F[8]            -8.6e-02  3.8e-01  5.4e-01  -5.6e-01   9.2e-03   7.6e-01    2.0  3.3e-03  4.7e+06
F[9]            -8.2e-02  4.0e-01  5.6e-01  -7.4e-01   1.7e-01   7.1e-01    2.0  3.3e-03  6.4e+14

Stan Model

data {
    int<lower=1> N;   // Number of Data Points = 8883
    int<lower=1> K;   // Number of Parameters = 500 but could be up to 4000
    real rho;      // range [0.01,0.3]
    real sig_F;    // 1e16
    vector[N] d;   // extrema(d) = (-2.2e13, 9.7e14) 
    vector[N] err; // extrema(err) = (6.2e12, 2.8e13)
    vector[K] x;   // range [0, 1]
    vector[K] y;   // range [0, 1]
    vector[K] z;   // range [0, 1]
    matrix[N,K] W; // extrema(W[W .> 0]) = (4.17e-23,9e-2), mean(W[W .> 0]) = 2.8e-4
}

transformed data {
    vector[K] mu;
    matrix[N,K] A;
    real matrix_min;
    cov_matrix[K] Sigma_F;
    
    // Scale Transfer matrix to avoid initialization error
    matrix_min = max(W);
    for (i in 1:K){
        for (j in 1:N){
            if ( (W[j,i] != 0.0) && (W[j,i] < matrix_min)){
                matrix_min = W[j,i];
            }
        }
    }   
    for (i in 1:K){
        A[:,i] = (1/matrix_min)*W[:,i];
    }
    
    mu = rep_vector(0.0,K);
    
    // Create Covariance Matrix
    for (i in 1:K){
        for (j in i:K){
            Sigma_F[j,i] = ((sig_F*matrix_min)^2)*exp(-0.5*((x[i] - x[j])^2 + (y[i] - y[j])^2 + (z[i] - z[j])^2)/rho^2);
            Sigma_F[i,j] = Sigma_F[j,i];
        }
    }
}

parameters {
    //real sig_F;
    //real rho; would also like to infer these
    vector[K] F;
}


model {

    F ~ multi_normal(mu, Sigma_F);
    
    d ~ normal(A*F,err);
}

generated quantities {
    vector[K] X;
    
    X = F/matrix_min;
}

In terms of what’s happening with the sampler your stepsize is crashing during warmup. So when you go about solving this problem you can tell whether it worked or not after <100 iterations. I think the only thing that could be going wrong here is the construction of the covariance matrix. Have you tried simulating data using that covariance matrix elsewhere? Does it match your data at all?

K

Just glancing through this model makes me think that you need rescaling here. Those numbers which you quote are huge. Try to scale them to be on the order of unity if possible.

Sebastian

1 Like

@sakrejda This form of the model has an analytic posterior and I have used that covariance matrix without issues. I ran the diagnostic (output below) and it said the derivative using finite differences were zero. Is there some setting I should use to avoid this problem.

 ./model diagnose data file=model_1.data.R
method = diagnose
  diagnose
    test = gradient (Default)
      gradient
        epsilon = 9.9999999999999995e-07 (Default)
        error = 9.9999999999999995e-07 (Default)
id = 0 (Default)
data
  file = model_1.data.R
init = 2 (Default)
random
  seed = 3785454428
output
  file = output.csv (Default)
  diagnostic_file =  (Default)
  refresh = 100 (Default)


TEST GRADIENT MODE

 Log probability=-1.29458e+32

 param idx           value           model     finite diff           error
         0       -0.925556     6.47106e+15               0     6.47106e+15
         1         1.43776     1.42541e+16               0     1.42541e+16
         2        -0.92038     1.07926e+16               0     1.07926e+16
         3        0.937891     4.03935e+15               0     4.03935e+15
...

@wds15 I have edited my model (shown below) so that the inputs are scaled to be more reasonable but the problem persists.

data {
    int<lower=1> N;   // Number of Data Points = 8883
    int<lower=1> K;   // Number of Parameters = 500 but could be up to 4000
    real rho;      // range [0.01,0.3]
    real sig_F;    // 6e15
    vector[N] d;   // extrema(d) = (-2.2e13, 9.7e14) 
    vector[N] err; // extrema(err) = (6.2e12, 2.8e13)
    vector[K] x;   // range [0, 1]
    vector[K] y;   // range [0, 1]
    vector[K] z;   // range [0, 1]
    matrix[N,K] W; // extrema(W[W .> 0]) = (4.17e-23,9e-2), mean(W[W .> 0]) = 2.8e-4
}

transformed data {
    real sig_X;
    vector[K] mu;
    matrix[N,K] A;
    vector[N] b;
    real matrix_scale;
    cov_matrix[K] Sigma_X;
    
    // Scale Transfer matrix
    matrix_scale = (1e18)/K;     //sum(F) ~ 1e18 this should make X to be ~ 1
    for (i in 1:K){
        A[:,i] = (matrix_scale*W[:,i]) ./ err; // extrema(matrix_scale(W./err)[W .> 0]) = (6.7e-20, 81) 
    }
    sig_X = sig_F/matrix_scale;  // around ~ 3
    
    b = d ./ err;                // mean(b) ~ 7, std(b) ~ 8
    mu = rep_vector(0.0,K);
    
    // Create Covariance Matrix
    for (i in 1:K){
        for (j in i:K){
            Sigma_X[j,i] = (sig_X^2)*exp(-0.5*((x[i] - x[j])^2 + (y[i] - y[j])^2 + (z[i] - z[j])^2)/rho^2);
            Sigma_X[i,j] = Sigma_X[j,i];
        }
    }
}

parameters {
    vector[K] X;
}


model {

    X ~ multi_normal(mu, Sigma_X);
    
    d ~ normal(A*X,1);
}

generated quantities {
    vector[K] F;
    
    F = matrix_scale*X;
}

There’s no setting that will save you. The stepsize keeps dropping because
either the sampler can’t hit the target acceptance or your keep getting
divergences during warmup. Both indicate problems with actually computing
the posterior density you are sampling from or the gradient. If the scale
of X is too large you could cause this problem (I did see that you
normalized by “should make” isn’t very convincing), so quantiles of X from
the posterior would be good to look at. … the gradient test also seems
like a problem. Here’s an example from a model that works:

TEST GRADIENT MODE

Log probability=-7697.67

param idx value model finite diff error
0 -1.68578 6.88236 6.88236 -6.55174e-07
1 -1.52712 4.26397 4.26398 -4.34777e-06
2 0.0599942 -717.12 -717.12 7.99157e-06
3 -1.32833 989.971 989.971 -2.56843e-06

So… your gradient calculations are clearly off the rails and I think the
problem is that your absolute gradient is so large that you loose precision
(and get zero as the answer) when you try to do finite diff. Why your
gradient is so large I’m not sure, but it makes it seem like at least your
default inits are way off from “typical” values and that’s, again, an issue
with scaling the model. It certainly could kill adaptation.

Thanks for your help. I found a bug in my version of the Stan code that caused the bad gradient. ( I had the unscaled data distributed to the scaled parameters).

The gradients are fine now and the model is working although it seems to be running slower then I expected. Pre-computing the Cholesky factorization helped but it is still slow. Its currently in the warm up phase averaging around 2 iterations per minute (single chain). Is this normal performance for ~500 dimensions? How many dimensions can Stan handle?

To get a sense of how many dimensions Stan can handle try a model where the
posterior is mvn. Hundreds of thousands of parameters is fine and can be
fast. Posterior geometry is the main determinant of speed, then the
efficiency of the autodiff. Once you have some samples in hand look at the
scaling using the mass matrix to verify you have that problem taken care
of. Check stepsize and tree depth and divergences and acceptance rate.
Something there is still making your model slow.

The run finally finished in about 15 hours. I looked at the diagonals of the mass matrix and all the values are order of magnitude ~1. Below is the Stan summary.

                    Mean     MCSE   StdDev        5%       50%       95%  N_Eff  N_Eff/s    R_hat
lp__            -4.9e+03  8.1e-01  1.6e+01  -4.9e+03  -4.9e+03  -4.9e+03    374  1.4e-02  1.0e+00
accept_stat__    8.7e-01  3.7e-03  1.2e-01   6.3e-01   9.1e-01   9.9e-01   1000  3.6e-02  1.0e+00
stepsize__       5.7e-03  1.3e-17  9.5e-18   5.7e-03   5.7e-03   5.7e-03   0.50  1.8e-05  1.0e+00
treedepth__      1.0e+01  1.2e-15  3.7e-14   1.0e+01   1.0e+01   1.0e+01   1000  3.6e-02  1.0e+00
n_leapfrog__     1.0e+03  7.9e-14  2.5e-12   1.0e+03   1.0e+03   1.0e+03   1000  3.6e-02  1.0e+00
divergent__      0.0e+00  0.0e+00  0.0e+00   0.0e+00   0.0e+00   0.0e+00   1000  3.6e-02     -nan

So it looks like its hitting the tree depth limit.

The manual suggested using the QR Reparameterization to make the columns of the transfer matrix orthogonal.

data {
    int<lower=1> N;
    int<lower=1> K;
    real rho;         
    real sig_F;
    vector[N] d;
    vector[N] err;
    vector[K] x;
    vector[K] y; 
    vector[K] z;
    matrix[N,K] W;
}

transformed data {
    real matrix_scale;
    matrix[N,K] A;
    vector[N] b;
    vector[K] mu;
    real sig_X;
    real mu_X;
    cov_matrix[K] Sigma_X;
    
    matrix[N,K] Q_ast;
    matrix[K,K] R_ast;
    matrix[K,K] R_ast_inv;
    cov_matrix[K] Sigma_RX;
    cholesky_factor_cov[K] L_RX;
    

    //Scale matrix such that X ~ 1
    matrix_scale = (1e18)/K;
    for (i in 1:K){
        A[:,i] = (matrix_scale*W[:,i]) ./ err;
    }
    
    sig_X = sig_F/(matrix_scale);
    
    Q_ast = qr_Q(A)[, 1:K] * sqrt(N - 1);
    R_ast = qr_R(A)[1:K, ] / sqrt(N - 1);
    R_ast_inv = inverse(R_ast);

    
    b = d ./ err;
    mu = rep_vector(0.0,K);
    
    for (i in 1:(K-1)){
        for (j in (i+1):K){
            Sigma_X[j,i] = (sig_X^2)*exp(-0.5*((xi] - x[j])^2 + (y[i] - y[j])^2 + (z[i] - z[j])^2)/rho^2);
            Sigma_X[i,j] = Sigma_X[j,i] ;
        }
    }
    for (i in 1:K){
        Sigma_X[i,i] = (sig_X^2) + 0.01; //jitter
    }
    
    //Transform Sigma_X into new coordinates
    Sigma_RX = R_ast*Sigma_X*R_ast';
    
    L_RX = cholesky_decompose(Sigma_RX);
}

parameters {
    vector[K] RX;
}

model {

    RX ~ multi_normal_cholesky(mu, L_RX);
    b ~ normal(Q_ast*RX,1);
}

generated quantities {
    vector[K] F;
    
    F = matrix_scale*R_ast_inv*RX;
}

Originally this didn’t work because it kept saying that Sigma_RX was not positive definite. This was because my
my design/transfer matrix W is very ill conditioned cond(W) ~ 1e11. I got around this issue by adding a small amount of normally distributed noise to W to reduce the condition number. Doing this got me to 3 iterations per minute. I noticed that increasing the amount of noise I added to W (i.e. lowering the condition number) made it run faster. So it looks like the calculation speed is dependent on the condition number of W.

Is there simple-ish way to improve the conditioning of W. I know about pre-conditioners but I never know how to choose them.

  1. I don’t know about improving conditioning, maybe Ben will have a
    suggestion.
  2. You could try raising the treedepth, sometimes that will at least get it
    to run so you can look at the posterior and see the rpoblems
  3. A stepsize of 1e-3 is not great (it’s often on the order of 1e-1, there
    might still be more to do with the parameterization to speed this up but I
    don’t have an immediate suggestion)
  4. I think what’s likely is that you have a lot of correlations near zero
    and some near +1/-1 and I guess that might make things difficult.

Yeah, if your design matrix is that poorly conditioned then you have bigger problems. And adding noise to the diagonal isn’t going to help.

In a very real sense this isn’t even a statistical problem but a computational one – floating point arithmetic only has so much accuracy and if you’re trying to build a model using data with such extreme scales then you’re never going to get around this limitation.

Standardize your covariates and perhaps even your variates. Or, even better, figure out the appropriate units for each covariate and use those so that everything is O(1).