Avoid extreme values or putting boundaries

hi all, I am using stan to build a Bayesian hierarchical model where individual-level parameters are a function of their individual-characteristics. The structure of my model is similar to the following descriptions. I am trying to use cholesky_factor_corr() to reparameterize it. My question is, how should I put constraints to avoid my individual-level parameters gamma_i to have extremely big/small values? I have already put lower bounds for the scale e.g., lambda_raw, etc., however, it still produces values as small as -10^100. Should I change priors ?

image

data{ //with reparameterization.
    //other irrelevant part...

    matrix[ncust,nchar] Demo;
}

transformed data {
   matrix[nchar+1,ncust] Demo_tran;
   row_vector[ncust] ones = rep_row_vector(1,ncust);
   Demo_tran = append_row(ones,transpose(Demo));
}

parameters{
    vector<lower=-1,upper=1>[3*(nchar+1)] lambda_raw;
    vector<lower=0,upper=1>[3*(nchar+1)] sigma_lambda;
    cholesky_factor_corr[3*(nchar+1)] L_lambda;

    matrix<lower=-1,upper=1>[3,ncust] gamma_raw;

    vector<lower=0,upper=3>[3] tau_gamma;
    cholesky_factor_corr[3] L_gamma;
    //...other irrelevant part
}

transformed parameters {
    vector[3*(nchar+1)] lambda_vec;
    lambda_vec = sigma_lambda.*(L_lambda*lambda_raw);

    matrix[3,nchar+1] Lambda_mat;
    Lambda_mat = to_matrix(lambda_vec,3,nchar+1);

    matrix[3,3] L_V_gamma = diag_pre_multiply(tau_gamma,L_gamma);
    matrix[3,ncust] mu_gamma;
    mu_gamma = Lambda_mat * Demo_tran;

    matrix<lower=-700,upper=50>[3,ncust] Gamma = mu_gamma + L_V_gamma*gamma_raw;
   
//.... other irrelevant part

}

model{
    lambda_raw ~ std_normal();
    sigma_lambda ~ cauchy(0,5);
    L_lambda ~ lkj_corr_cholesky(2);

    tau_gamma ~ cauchy(0,2.5);
    L_gamma ~ lkj_corr_cholesky(2);

    to_vector(gamma_raw) ~ std_normal();

///....
    }

}

Sorry this didn’t get answered sooner—I just saw it as it got promoted to the unanswered category.

I would suggest removing those hard upper and lower interval constraints on parameters like lambda_raw and sigma_lambda—if probability bunches up against the boundary, it will skew parameter estimates vs. softer parameterizations.

Similarly, if you’re getting estimates of parameters so near zero that they’re close to underflowing, it means your model is consistent with the value being zero. You might want to just set such parameters to zero if that’s where most of the mass is. Otherwise, if you really do want values of 1e-100, then it might help to reparameterize so that the unconstrained values are more unit scaled.

The following constraint,

matrix<lower=-700,upper=50>[3,ncust] Gamma = mu_gamma + L_V_gamma*gamma_raw;

is probably not doing what you want it to do. This will evaluate whether the right hand side satisfies the bound, and if not, just reject the draw. This can reduce HMC to a form of super inefficient rejection sampling. Instead, you need to jointly constraint the variables, for example, setting the bounds on mu_gamma to be <lower=-700 - L_V_gamma*gamma_raw, upper = 50 - L_V_gamma*gamma_raw> will ensure Gamma follows the right constraints without any rejection. But again, I’d probably try to avoid upper and lower-bounding this altogether.

This, sigma_lambda ~ cauchy(0,5);, is a very wide prior that does pretty much nothing given your constraint, vector<lower=0,upper=1>[3*(nchar+1)] sigma_lambda;. You might as well just let that be uniform. A more stable alternative would be to use a weakly informative half-normal prior with only a lower=0 constraint on the parameter.

To figure out what’s going on with parameters on the scale of 1e-100 after warmup, you want to figure out how they’re getting there. Is it that the lambda_vec values want to be near zero and you get there by pushing lambda_raw to zero? When the fit crashes, it’s an opportunity to investigate and sharpen up modeling assumptions. Samplers work much better when the model’s well specified for the data—when the model and data don’t match well, the sampler has to struggle with challenging geometry to get an answer.

P.S. You can put these on a single line and reorg to get rid of need for parens (arithmetic is left associative in Stan, so a * b * c is evaluated as (a * b) * c:

vector[3*(nchar+1)] lambda_vec = L_lambda * lambda_raw .* sigma_lambda;

Hi, this is actually very helpful. I didn’t know that we could do a joint constraint on the variables. and I also vaguely remember I read somewhere that uniform can be a improper prior (I forget the source). I will definitely try on this.
I ran into another issue when I was estimating the model. So in the paper I am following, there are many levels of hierarchy as shown in below:

where the Q_ijt, for t =0,1,…, t, and Q_bar_ijt, v^2_ijt can be thought of transformed parameters based on the data and the parameters Q_ij, and sigma^2_D_i.

here is the function that specifies the relationship (for now we can ignore the term that contains the \sigma^2_F, and just focus on the sigma^2_D_I for illustration):

I have also tried to reparameterize the Q_ijt, but it seems that the draws are still being inefficient. The model was fine when I only add heterogeneity for Q_j. The minute I try to add heterogeneity in other parameters (that could interact with the variance), I ran into issues.

Here is the part of the code that is relevant, where M_cus corresponds to the Q_ijt,
and this line (in the transformed parameter block) corresponding to the equation (15)

 Ms[up] = fmin(fmax(Ms_bar[up] + Ms_std[up]*sqrt_Vbar[up],-1e6),1e6);

that these two lines (in the transformed parameter block) corresponding to equation (13) and (14)

       Vs_bar[up] = (K[up]*square(v0_safe)*V)/square(denominator);
        sqrt_Vbar[up] = sqrt(Vs_bar[up]+1e-10);
...
        Ms_bar[up] = exp(log_vs-log_v0)*m0 + exp(log_vs-log(V)+log(K[up]))*M;

I have used the safe_v0 to avoid the nan instances.

data{
///...
    matrix[ncust,ndemo] Demo;
}

parameters {
    vector[ncust] M_cus_std;
    real M_mean;M_mean;
    real<lower=0> M_sig;

    vector[ndemo] m_par;

    real<lower=0> r;
    vector[ndemo] r_par;
    array[nupdt] real Ms_std;
    real<lower=0> V;
//...    
}


transformed parameters {
    vector[ncust] M_cus = M_mean + M_sig * M_cus_std + Demo * m_par;
    vector[ncust] r_cus = r + Demo * r_par;

    array[nupdt] real Ms;
    array[nupdt] real Ms_bar;

    array[nupdt] real<lower= 0> Vs_bar;
    array[nupdt] real <lower= 0> Vs;
    array[nupdt] real <lower= 0> sqrt_Vbar;
    
    for (up in 1:nupdt){
        int cus = Uid_updt[up];
        real M = M_cus[cus];

        real m0 = (Itd_updt[up]==1)? M_mean:Ms[up-1];
        real v0 = (Itd_updt[up]==1)? ini_var:Vs[up-1];
        
        real v0_safe = fmax(v0,1e-8);  //Ensure v0 is not too close to zero
        real denominator = fmax(v0_safe*K[up]+V,1e-8); //prevent division from very small numbers

        Vs[up] = (v0_safe*V)/denominator;
        Vs_bar[up] = (K[up]*square(v0_safe)*V)/square(denominator);

        sqrt_Vbar[up] = sqrt(Vs_bar[up]+1e-10);
        real log_vs = log(Vs[up]);
        real log_v0 = log(v0_safe);
        Ms_bar[up] = exp(log_vs-log_v0)*m0 + exp(log_vs-log(V)+log(K[up]))*M;

        Ms[up] = fmin(fmax(Ms_bar[up] + Ms_std[up]*sqrt_Vbar[up],-1e6),1e6);
   }
}

model{
    
    M_mean ~ normal(0, 1);
    M_sig ~ normal(0,2.5);//exponential(3);
    M_cus_std ~ std_normal();//normal(0, 1);

    m_par ~ normal(1,1);

    r ~ lognormal(log(1.5),0.5);
    r_par ~ normal(1,1);

    V ~ lognormal(log(4),0.5);
    ini_var ~ lognormal(log(10),1);

    Ms_std ~ normal(0,1);
    intcept ~ normal(2,1);
    restpar ~ normal(0,0.5);

I got the following error message, and the R_hat is being really big.

I am getting desparate after trying many different ways and it still didn’t work. Any advices are deeply appreciated!!

Uniform is only proper on a bounded interval. Otherwise the integral diverges.

This can potentially be the model and not the way you coded it. When the are a lot of things to estimate you need a lot of data.

Doing hard mins and maxes like you have is a bad idea in Stan because it disconnects the gradients whenever the condition returns the constant -1e6 or 1e6. Models usually fit better with softer constraints and usually make more sense unless there’s a hard physical constraint. Whenever you have to start fiddling with epsilons, there’s usually a better solution.

The divergences usually come from varying curvature in hierarchical models when the data is consistent with complete pooling (that drives the hierarchical variance parameter toward zero, which requires very small step sizes, leading to most of the divergences).

Then you might be ready to try simulating data from your model and fitting it. If that doesn’t work, you know something’s wrong other than the model being misspecified. It helps to start from a simpler model and build up, but it looks like you’re already doing that.

If you look at trace plots of those values with high R-hat values, are they separated? If so, it may be a multimodal posterior.