Why does my hierarchical GPs model fail?

kshvov · August 25, 2021, 10:45am

Hi,
I’m trying to fit an inhomogeneous intensity function of a Poisson process using a GP.
There are four different intensity functions I’ve created and five countries are each assigned an intensity function. Observations from a country are then generated from the intensity function assigned.
The following plots are intensity functions for different countries.
1111
When I fit GPs separately for each country(no hierarchy), the resulting posterior predictive check looks good.
3333

But after reading
Multilevel Gaussian Processes and Hidden Markov Models with Stan | Brendan Hasz,
I’ve tried adding a hierarchical structure to it, where the GP parameters and the constant term, a, are drawn from some common population distributions. But the model fails to capture the non-linearity, e.g.
2222
And the posterior values for the GP parameters, \alpha and \rho are extremely small numbers(values like 1e-100 and 0.00005 respectively). When I looked at the posterior distributions of the non-hierarchical model, mean \rho values range from 7 to 30, and mean \alpha values range from 0.5 to 1.2.

I initially thought this was because of the priors I used, so changed the inv_gamma(2,0.5) prior of \rho_m to normal(30,5), but still they converge to values near 0s.

Is anything wrong with my code?

functions {
  vector gp_f(int N,
                      real[] x,
                     real alpha, 
                     real rho,
                     vector eta) {
    vector[N] f;
    {
      matrix[N, N] K;
      matrix[N, N] L_K;
      K =  cov_exp_quad(x, alpha, rho)
                         + diag_matrix(rep_vector(1e-6, N));
      L_K = cholesky_decompose(K);
      f = L_K * eta;
    }
    return f;
  }
}

data {
  int<lower=0> J; // Number of countries 
  int<lower=1> N; // Total number of obs
  int<lower=1,upper=J> country[N];
  real x[N]; // Input
  int<lower=0> y[N]; // Poisson observations
  
  int<lower=1> N1; // Number of country 1 observations
  int<lower=1> N2; 
  int<lower=1> N3; 
  int<lower=1> N4; 
  int<lower=1> N5;
  
  int<lower = 0> N_tilde;
  real x_tilde1[N_tilde]; // np.linspace(0, max(x1)+10, N_tilde) concatenated to x1 vector and used as an input to a GP.
  real x_tilde2[N_tilde];
  real x_tilde3[N_tilde];
  real x_tilde4[N_tilde];
  real x_tilde5[N_tilde];

}

transformed data {
  int<lower=1> N1_m = N1 + N_tilde;
  int<lower=1> N2_m = N2 + N_tilde;
  int<lower=1> N3_m = N3 + N_tilde;
  int<lower=1> N4_m = N4 + N_tilde;
  int<lower=1> N5_m = N5 + N_tilde;

  
  real x1[N1] = x[1:N1]; // country 1 data
  int y1[N1] = y[1:N1];
  
  real x2[N2] = x[N1+1:N1+N2];
  int y2[N2] = y[N1+1:N1+N2];  
  
  real x3[N3] = x[N1+N2+1:N1+N2+N3];
  int y3[N3] = y[N1+N2+1:N1+N2+N3];
  
  real x4[N4] = x[N1+N2+N3+1:N1+N2+N3+N4];
  int y4[N4] = y[N1+N2+N3+1:N1+N2+N3+N4];
  
  real x5[N5] = x[N1+N2+N3+N4+1:N1+N2+N3+N4+N5];
  int y5[N5] = y[N1+N2+N3+N4+1:N1+N2+N3+N4+N5];

  real x1_m[N1_m];
  real x2_m[N2_m];
  real x3_m[N3_m];
  real x4_m[N4_m];
  real x5_m[N5_m];

  
  for (n1 in 1:N_tilde) {
    x1_m[n1] = x_tilde1[n1];
    x2_m[n1] = x_tilde2[n1];
    x3_m[n1] = x_tilde3[n1];
    x4_m[n1] = x_tilde4[n1];
    x5_m[n1] = x_tilde5[n1];
  }
  
  for (n2 in 1:N1) x1_m[N_tilde + n2] = x1[n2];
  for (n2 in 1:N2) x2_m[N_tilde + n2] = x2[n2];
  for (n2 in 1:N3) x3_m[N_tilde + n2] = x3[n2];
  for (n2 in 1:N4) x4_m[N_tilde + n2] = x4[n2];
  for (n2 in 1:N5) x5_m[N_tilde + n2] = x5[n2];

}


parameters {

  // Per-subject parameters (non-centered parameterization)
  real rho_tilde[J];   //non-centered std of length scale
  real alpha_tilde[J]; //non-centered std of signal std dev
  real a_tilde[J];

  vector[N1_m] eta1;
  vector[N2_m] eta2;
  vector[N3_m] eta3;
  vector[N4_m] eta4;
  vector[N5_m] eta5;

  real a_m;
  real<lower=0> a_s;
  
  real<lower=0> alpha_m;
  real<lower=0> rho_m;
  
  real<lower=0> alpha_s;
  real<lower=0> rho_s;
}


transformed parameters {
  
  // Per-subject parameters
  real<lower=0> rho[J];   //length scale
  real<lower=0> alpha[J]; //signal standard deviation
  real a[J]; // GP constant
   
  // Non-centered parameterization of per-subject parameters
  for (s in 1:J) {
    rho[s] = exp(log(rho_m) + rho_s * rho_tilde[s]);
    alpha[s] = exp(log(alpha_m) + alpha_s * alpha_tilde[s]);
    a[s] = a_m + a_s * a_tilde[s];
  }
  vector[N1_m] log_f1 = gp_f(N1_m, x1_m ,alpha[1], rho[1], eta1);
  vector[N2_m] log_f2 = gp_f(N2_m, x2_m ,alpha[2], rho[2], eta2);
  vector[N3_m] log_f3 = gp_f(N3_m, x3_m ,alpha[3], rho[3], eta3);
  vector[N4_m] log_f4 = gp_f(N4_m, x4_m ,alpha[4], rho[4], eta4);
  vector[N5_m] log_f5 = gp_f(N5_m, x5_m ,alpha[5], rho[5], eta5);

}


model {    
    
  // Priors (on population-level params)
  target += inv_gamma_lpdf(rho_m | 2, 0.5);  // tried normal_lpdf(rho_m | 30, 5) but also fails.
  target += normal_lpdf(alpha_m | 0, 4) + log(2);
  target += normal_lpdf(a_m | 0, 2) + log(2);
  target += normal_lpdf(rho_s   | 0, 4) + log(2);
  target += normal_lpdf(alpha_s | 0, 4) + log(2);
  target += normal_lpdf(a_s | 0, 2) + log(2);

  // Subject-level parameters drawn from pop-level distributions
  // (non-centered parameterizations)
  target += normal_lpdf(rho_tilde   | 0, 1); //log(rho) ~ normal(exp(rho_m), rho_s)
  target += normal_lpdf(alpha_tilde | 0, 1); //log(alpha) ~ normal(exp(alpha_m), alpha_s)
  target += normal_lpdf(a_tilde | 0, 1); //log(a) ~ normal(a_m, a_s)
  

  // Jacobian adjustments for GLM parts of model
  target += -sum(log(rho));
  target += -sum(log(alpha));

  target += normal_lpdf(eta1 | 0, 1);
  target += normal_lpdf(eta2 | 0, 1);
  target += normal_lpdf(eta3 | 0, 1);
  target += normal_lpdf(eta4 | 0, 1);
  target += normal_lpdf(eta5 | 0, 1);
   
  // Accumulate evidence over trials
  target += poisson_lpmf(y1 | exp(a[1] + log_f1[N_tilde+1:]));
  target += poisson_lpmf(y2 | exp(a[2] + log_f2[N_tilde+1:]));
  target += poisson_lpmf(y3 | exp(a[3] + log_f3[N_tilde+1:]));
  target += poisson_lpmf(y4 | exp(a[4] + log_f4[N_tilde+1:]));
  target += poisson_lpmf(y5 | exp(a[5] + log_f5[N_tilde+1:]));
}

Any help would be appreciated!

mike-lawrence · August 25, 2021, 12:47pm

You have a bunch of indexes expressed as a sequence via :, but you have computations on either side of : and you don’t use parentheses to make the order of operations clear. I’ve never not done that and am not sure whether Stan will interpret the sequence last as you intend.

mike-lawrence · August 25, 2021, 1:57pm

A few more things:

The jacobians you’re adding are also not necessary; jacobians are only necessary when you are asserting a prior structure on a quantity that is a non-linear one-to-one transform of a parameter.
You’re adding +log(2) in the prior structure unnecessarily (I think)
You’re using the lpdf functions which are slightly more computationally expensive than their lupdf counterparts; do you really need the normalized densities? (if so, maybe that’s why you’re adding +log(2) as well?)

You also have a bunch of bookkeeping code, presumably because x is not measured at the same input values for each country? If so, here’s a version that’s more code-efficient:

functions {
  vector gp_f(int N,
                      real[] x,
                     real alpha, 
                     real rho,
                     vector eta) {
    vector[N] f;
    {
      matrix[N, N] K;
      matrix[N, N] L_K;
      K =  cov_exp_quad(x, alpha, rho)
                         + diag_matrix(rep_vector(1e-6, N));
      L_K = cholesky_decompose(K);
      f = L_K * eta;
    }
    return f;
  }
}

data {
  int<lower=1> J; // Number of countries 
  int<lower=1> maxN; // max number of observations per country
  array[J] int<lower=1> N ; // number of observations per country
  array[J,maxN] int<lower=0> y; // Poisson observations (dummy-filled ragged array)
  array[J,maxN] real x; // Input  (dummy-filled ragged array)
}
transformed data{
	array[J] eta_start ;
	array[J] eta_end ;
	eta_start[1] = 1 ;
	eta_end[1] = N[1] ;
	for(j in 2:J){
		eta_start[j] = eta_end[j-1] +1 ;
		eta_end[j] = eta_start[j] + N[j] ;
	}
}
parameters {

  // Per-subject parameters (non-centered parameterization)
  matrix[3,J] country_gp_par_scaled_deviations ;   //non-centered std of length scale

  vector[eta_end[J]] eta; // std_normal() noise for GP

  real a_m;
  real<lower=0> a_s;
  
  real<lower=0> alpha_m;
  real<lower=0> rho_m;
  
  real<lower=0> alpha_s;
  real<lower=0> rho_s;
}

model {    
    
  // Priors (on population-level params)
  target += inv_gamma_lupdf(rho_m | 2, 0.5); 
  target += normal_lupdf(alpha_m | 0, 4);
  target += normal_lupdf(a_m | 0, 2);
  target += normal_lupdf(rho_s   | 0, 4);
  target += normal_lupdf(alpha_s | 0, 4);
  target += normal_lupdf(a_s | 0, 2);

  // Subject-level parameters drawn from pop-level distributions
  // (non-centered parameterizations)
  target += std_normal_lupdf(country_gp_par_scaled_deviations) ;
  
  // GP noise as std_normal
  target += std_normal_lupdf(eta) ;

  // Non-centered parameterization of per-subject parameters
  vector[J] country_rho = exp(log(rho_m) + rho_s * country_gp_par_scaled_deviations[1]) ;   //length scale
  vector[J] country_alpha = exp(log(alpha_m) + alpha_s * country_gp_par_scaled_deviations[2]); //signal standard deviation
  vector[J] country_a = a_m + a_s * country_gp_par_scaled_deviations[3] ; // GP constant

  // observations as poisson   
  for (j in 1:J) {
  	target += poisson_lpmf(
  		y[j,1:N[j]]
  		|
  		exp(
  			country_a[j] 
  			+ gp_f(
    			N[j]
    			, x[j,1:N[j]] 
    			, country_alpha[j]
    			, country_rho[j]
    			, eta[ eta_start[j]:eta_end[j] ]
    		)
  		)
  	) ;
  }
}
generated quantities{
  // so we can viz the country functions:
  array[J,maxK] a_plus_f ;
  for (j in 1:J) {
    a_plus_f[j,1:N[j]] = country_a[j] + gp_f(
    	N[j]
    	, x[j,1:N[j]] 
    	, country_alpha[j]
    	, country_rho[j]
    	, eta[ eta_start[j]:eta_end[j] ]
    ) 
  }
}

For data, you’d put the y and x in a matrix with each country’s data as spanning as far as it can in each row, finishing the row with an arbitrary value (ex. 0), then N is the number of columns each country has real data for.

Note that I’m a bit worried by the double-exponentiation implied by your model, but I’m not used to doing GPs with poisson observations and maybe this is standard.

kshvov · August 25, 2021, 7:14pm

Thank you so much for the help!!!
I’ve made some minor changes(arrays to vectors where appropriate, added -1 to eta_end
and replaced country_a, rho, alpha in the generated quantities with values in the parameters block etc) to make the code runnable.
But It seems like a_plus_f in the generated quantities block is evaluating the function values at the input points, x's. But my x input data consists of repeated integer values, for example, 1,2,3,4 … ,61, 62, 1, 2, …, 48, 1, 2,… 78 and so on. This is the reason why I concatenated the x_tilde array to the x input array(to plot a smooth line) and plugged that into the gp function. For the purpose of generating posterior predictive plots, wouldn’t it make sense to concatenate the arrays and plug that into the parameters block?
Also, if we use gp_f function in the generated quantities block, wouldn’t it sample function values randomly with the converged GP parameters rather than producing an interpolated function(Kriging)?

I will let you know how the code performs when it finishes sampling. Thanks again!

mike-lawrence · August 25, 2021, 9:39pm

As I said above, my reformatting of your model makes it so that it expects both x and y to now be matrices with one row per country and within the row for a given country the value in the i^{th} column of x is the x paired with the value i^{th} column in y. Maybe it would help to build up intuition: consider a model with just one country. For observations you have a vector[N] y paired with a vector[N] x such that y has all your unique outcome observations and x contains, for each value in y, the associated value for the covariate over which you’re performing a GP. It’d be perfectly fine to have repeated values in x if that indeed reflects that the associated values in y were sampled from the same location of the covariate.

Now, let’s say you have multiple (K) countries, but the same number of observations per country, N. Then you can structure the outcome and covariate data as matrix[K,N] y and matrix[K,N] x, then in the relevant areas of the model index into these for a given country by row. Again, perfectly fine to have repeated x values within and across rows.

In the more general case where there might be an uneven number of observations made within each country, my trick is to construct a matrix with as many columns as necessary to fit the country with the maximum number of observations, then for each country fill each row starting from the first column until you run out of data, taking note of what column contains the last real data for each country, then filling the rest with a dummy value (ex. 0). Then, when it comes time to index into the data for a given country, we additionally index the columns using the data variable representing how many real observations were made for that country (N[j] in my version). Yet again, perfectly ok to have repetition of values in x across columns and rows.

Hm, I think you might not grasp the function of the different blocks. Nothing special is occuring in the GQ block relative to the model/TP blocks when it comes to computing a quantity derived from the parameters. You’re not sampling anything in any of the blocks. The sampling takes place outside the code you write for a stan file, and the stan file simply specifies a mapping from the parameters to a scalar log-probability value that the HMC sampler uses as the target of it’s MCMC explorations.

One can use the _rng(...) function in the GQ block to add stochasticity of some sort, usually for doing things like posterior predictive sampling. But in my code a_plus_f is a deterministic product of the parameters alone, and indeed is identical to what is used to compute the log-probability in the model block. You’d get the same values if you expressed the computation leading to a_plus_f in the transformed parameters block and omitted the GQ block altogether. The only reason I opted for it the way I did is that I’ve come to understand that defining a new variable in either the TP or model blocks will slow things down (something to do with the AD voodoo). So if one doesn’t absolutely need to declare a new variable (some sequential computations involving loops would be hard to achieve without new variable declarations), then it can be more performant to do the computation “twice”, once “in-place” in the model block to use it’s value, then again in the GQ block (and the latter really only if you really want to look at the value’s posterior).

kshvov · August 27, 2021, 2:45pm

Just got my sampling results, it now works perfectly.
I really appreciate your help!!!

mike-lawrence · August 27, 2021, 2:52pm

Awesome! We’re actually looking for a hierarchical GP example to put in posteriorDB; if you feel up to it feel free to submit yours!

Topic		Replies	Views
Gaussian process with hierarchical Gaussian process prior Modeling gaussian-process	15	3009	February 27, 2019
Using 2D Gaussian process predictions within model Modeling gaussian-process	11	2447	February 18, 2022
Gaussian process regression error (rejecting initial value) Modeling	9	1219	July 8, 2017
Formulating a hierarchical multivariate Gaussian process using the Hilbert space approximation Modeling cmdstanr , gaussian-process	8	234	February 7, 2025
Could you please help to find out why my GP model got slower Modeling gaussian-process	12	1227	April 7, 2021

Why does my hierarchical GPs model fail?

Related topics