How to consider a process in the model without it being a parameter or data? (Spatio-temporal process))

Hello!
I am trying to simulate a 3-state Markov chain and the transition probabilities come from a Gaussian process (we call it X) with spatio-temporal covariance matrix and some other variables.
So far, the program is able to estimate the parameters of the the “fake” or the simulated data. But now that I am trying to test it on the real data, it does not converge. I believe the reason is that before, we were able to generate X process and feed it into the program as data, but now we moved it to the parameter block since we do not have it.

Now, my question is: is there another way of considering X in stan but not as a parameter nor data? I tried defining it as a function or in the transformed parameters but was not successful.

AND, if the answer is no :( , is there any suggestion to make the code converge? I have read the stan document regarding divergence error as well as some tips online any I believe I need to change something in my model since it was working before.

Below is the code for the real data which diverges.

Thanks!


functions {

    array[] matrix cor_sep(real nu, real c, real a, real alpha,
    data array[] matrix u_ar, data array[] matrix h_ar,data int n_var, data int n_block_row) {

        array[n_block_row] matrix[n_var, n_var] corr;

        for (n in 1:n_block_row) {

            corr[n] = (a * u_ar[n] ^ (2 * alpha) + 1) ^ (-1) .* ((1 - nu) * exp(-c * h_ar[n]));

            for (j in 1:n_var) {
                for (i in 1:n_var) {
                    if (i == j)
                        corr[n][i, j] = (a * (u_ar[n][i, j]) ^ (2 * alpha) + 1) ^ (-1);

                }
            }
        }
        return corr;
    }

    matrix cor_joint(real nu, real c, real a, real alpha, array[] matrix corr_ar,
    data int n_var, data int n_block_row) {

        int ind_i_start;
        int ind_i_end;
        int ind_j_start;
        int ind_j_end;
        matrix[n_var * n_block_row, n_var * n_block_row] corr;

        for (i in 1:n_block_row) {

            ind_i_start = ((i - 1) * n_var + 1);
            ind_i_end = (n_var * i);

            for (j in 1:n_block_row) {

                ind_j_start = ((j - 1) * n_var + 1);
                ind_j_end = (n_var * j);

                if (j >= i)
                    corr[ind_i_start:ind_i_end, ind_j_start:ind_j_end] = corr_ar[j - i + 1];
                else
                    corr[ind_i_start:ind_i_end, ind_j_start:ind_j_end] = corr_ar[i - j + 1]';
            }
        }
        return corr;
    }
    
    vector repeat_vector(vector input, int reps) {
        int N = rows(input);
        vector[N*reps] repvec; // stack N-vector reps times
  
        for (k in 1:reps) {
            for (i in 1:N) {
                repvec[ i + (k - 1) * N] = input[i]; 
                // assign i-th value of input to i+(k-1)*N -th value of repvec
            }
        }
      return repvec;
    }
}

data {
    int<lower=1> N;
    int<lower=1> N_mu;
    int<lower=1> lag_u;
    int<lower=1> n_ahead;

    //number of wind farms
    int<lower=1> n_var;

    int<lower=1> lag_max;
    int<lower=1> n_block_row;


    matrix[n_var, n_var] dist;
    
    array[n_block_row] matrix[n_var, n_var] h_ar;
    array[n_block_row] matrix[n_var, n_var] u_ar;



    //the final y
    int<lower = 1, upper = 3> y[N,n_var];
}

transformed data {
    vector[3] zeros = rep_vector(0, 3);
}

parameters {
    // model parameters
    real<lower=0, upper=1> nu;
    real<lower=0> c;
    real<lower=0> a;
    real<lower=0, upper=1> alpha;


    //new_parameters
    matrix[3, 2] beta_raw;

    // mean
    vector[n_var] mu;
    
    //cos & sin coeficients
    vector[4] B;
    
    //X
    vector[N_mu] X[N];
    


}

transformed parameters {

    array[n_block_row] matrix[n_var, n_var] c_sep;
    matrix[n_var * n_block_row, n_var * n_block_row] c_joint;
    matrix[3, 3] beta;
    array[N, n_var] matrix[3, 3] theta;

    c_sep = cor_sep(nu, c, a, alpha, u_ar, h_ar, n_var, n_block_row);
    c_joint = cor_joint(nu, c, a, alpha, c_sep, n_var, n_block_row);


    beta = append_col(beta_raw, zeros);
    
    //mu
    vector[N_mu] mu_final;
    mu_final = repeat_vector(mu, n_block_row);

    for(i in 1:N){
        for(j in 1:n_var){
            for(t in 1:3){
                    // periodic trends for ramp downs
                    theta[i,j,t,1] = beta[t,1] * X[i,j] +B[1]*cos((2*pi()*i*60*12)/(60*60*24))+B[2]*sin((2*pi()*i*60*12)/(60*60*24));
                    theta[i,j,t,2] = beta[t,2] * X[i,j];
                    // periodic trends for ramp ups
                    theta[i,j,t,3] = beta[t,3] * X[i,j] +B[3]*cos((2*pi()*i*60*12)/(60*60*24))+B[4]*sin((2*pi()*i*60*12)/(60*60*24));
                    
                
            }
        }
    }
}


model {
    // flat priors
    nu ~ uniform(0, 1);
    c ~ uniform(0.0001, 5);
    a ~ uniform(0.0001, 5);
    alpha ~ uniform(0, 1);
    mu ~ uniform(-10,10);
    
    B~ uniform(-10,10);

    to_vector(beta) ~ uniform(-10, 10);

    for(i in 2:N){
        X[i] ~ multi_normal(mu_final, c_joint);
        for(j in 1:n_var){
            y[i,j] ~ categorical_logit(to_vector(theta[i,j,y[i-1,j],:]));
        }
    }
}
`

Hi, @MaryamMGh. Sorry this wasn’t answered sooner. I’m afraid I may not be much help, either, as this is a complicated question where it’s not immediately clear to me what you’re asking. Let’s see if we can get to the bottom of it.

I’m not sure what you mean here. What is X—you’re calling it a Gaussian process, by which I assume you mean it is a random variable distributed as a GP. We need a stochastic matrix for an HMM, so you put it through some kind of link?

If it’s a value you observe, it should be data. If it’s an unknown you want to infer, it has to be a parameter. Every variable in Stan is either a data variable, a parameter value, or a transform of data and/or parameters.

In general, a Bayesian model looks like a joint density p(\theta, y) for parameters \theta and data y and then we observe y and sample from p(\theta \mid y)?

When you say you simulate data and it recovers values, I assume you mean you generate data according to the model and then fit it, and the posterior intervals roughly contain the true values at the nominal coverage? Is the real data of exactly the same shape? If the simulated data fits and the real data doesn’t, it may just mean the model is not well specified for the data.

When you say it doesn’t converge, what happens?

This is wrong as defined:

    vector[4] B;
...
    B ~ uniform(-10,10);

Stan requires every parameter to have support wherever it satisfies its constraints. What you should do is define <lower=-10, upper=10> for B. This is a problem for almost every other journal.

This is inefficient as written:

for(i in 2:N){
        X[i] ~ multi_normal(mu_final, c_joint);

because it does a solve of c_joint for every iteration. Instead, you want to do make X an array of vectors and do this:

X ~ multi_normal(mu_final, c_joint);

It’d be even better I you could define c_joint as a Cholesky factor, but I’m not sure that’s tractable given how it’s defined. Then the solves are quadratic rather than cubic.

1 Like

As Bob says, you should constrain the parameters according to their priors.

parameters {
    real<lower=0, upper=1> nu;
    real<lower=0.0001, upper=5> c;
    real<lower=0.0001, upper=5> a;
    real<lower=0, upper=1> alpha;
    ...
}
model {
    nu ~ uniform(0, 1);
    c ~ uniform(0.0001, 5);
    a ~ uniform(0.0001, 5);
    alpha ~ uniform(0, 1);
    ...
}

Note that uniform(-10, 10) prior is really wide in the context of a categorical-logit model. For exaxmple, even B = 1 would imply the cos/sin component has a very strong effect on the transition probabilities.
It’s especially important to make sure beta can’t be too large–otherwise the model might try to explain essentially all the variation in the data as being noise from the gaussian process.

    mu ~ normal(0, 1);
    B ~ uniform(0, 1);
    to_vector(beta) ~ uniform(0, 0.5);

Thanks a lot for taking the time to reply! I am still stuck on this and your insights have been very useful.

When you say you simulate data and it recovers values, I assume you mean you generate data according to the model and then fit it, and the posterior intervals roughly contain the true values at the nominal coverage?

Yes this is right. That is what I meant.

Is the real data of exactly the same shape? If the simulated data fits and the real data doesn’t, it may just mean the model is not well specified for the data.

I don’t think they have the exact same shape. But I don’t know how to confirm that. Since at the end of the day, my data is a bunch of -1,0 and 1s.

I applied the changes you mentioned, thanks again. But I don’t see much improvement.

When you say it doesn’t converge, what happens?

It gives me all sort of warnings including the one about convergence. I examined the pairs plot and through the iterations, they explore all the space for priors and never converge. Increasing the number of iterations don’t help either.

Warning messages:
1: There were 2426 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. 
2: There were 9 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 
3: There were 5 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low 
4: Examine the pairs() plot to diagnose sampling problems
 
5: The largest R-hat is NA, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat 
6: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess 
7: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess 

I am thinking of 2 things:

  1. Change something in the stan code. Maybe the way I am defining my process is not efficient, Based on my research these are my options: De-center X (read a blog post about this), Use Laplace appx. (saw it in a book and in a blog) , Use first order Markov chain appx. for X (my supervisor’s suggestion. do not know the details yet)

  2. The model is to complicated to be able to find all those parameters based on the data I have which is not very descriptive.

This is my first statistical model and I do not have a strong background in statistics so any help is appreciated!