# Multiple continuous predictors in hierarchical Gaussian process

Hello! I’m looking for a bit of general advice or some examples to learn from rather than model triage (at this point). My question relates to best practices for the specification of multiple continuous predictors in a hierarchical Gaussian process. I have been working from the excellent 2017 StanCon hierarchical GP example from @rtrangucci. The vote-share model example has been extremely helpful in specifying hierarchical GPs, and I’ve successfully used this framework to specify GPs for tree count data with size class (bole diameter) as a continuous predictor across spatial and community hierarchies. A simple version of such a model is here, where count by diameter class is collected on each plot:


data {
int<lower=1> N;
int<lower=1> n_diam;
int<lower=1> n_plots;
int<lower=1> plot_idx[N];
int<lower=1> diam_idx[N];
int<lower=1> diam[N];
real<lower=0> exposure[N];
int<lower=0> y[N];
int<lower=0> prior_only;
}

transformed data {
real diams[n_diam];
for (d in 1:n_diam) {
diams[d] = d;
}
}

parameters {
real mu;
real<lower=0> inverse_phi;

vector[n_diam] GP_mu_std;
real<lower=0> alpha_mu;
real<lower=0> rho_mu;

matrix[n_diam, n_plots] GP_plot_std;
real<lower=0> alpha_plot;
real<lower=0> rho_plot;

vector[n_plots] plot_std;
real<lower=0> sigma_plot;
}

transformed parameters {
vector[n_diam] GP_mu;

vector[n_plots] plot_re;  // plot random effect
matrix[n_diam, n_plots] GP_plot; // plot GP

real eta[N]; // log lambda for likelihood
real phi;

plot_re = sigma_plot * plot_std;
phi = 1 / inverse_phi;

// Plot-level non-centered Gaussian Process (mvnorm std * L_cov)
{
matrix[n_diam, n_diam] cov_mu;
matrix[n_diam, n_diam] L_cov_mu;

matrix[n_diam, n_diam] cov_plot;
matrix[n_diam, n_diam] L_cov_plot;

for (d in 1:n_diam) {
cov_mu[d, d] = cov_mu[d, d] + 1e-9;
cov_plot[d, d] = cov_plot[d, d] + 1e-9;
}

L_cov_mu = cholesky_decompose(cov_mu);
L_cov_plot = cholesky_decompose(cov_plot);

GP_mu = L_cov_mu * GP_mu_std;
GP_plot = L_cov_plot * GP_plot_std;
}

for (n in 1:N) {
eta[n] = mu
+ GP_mu[diam_idx[n]]
+ GP_plot[diam_idx[n], plot_idx[n]]
+ plot_re[plot_idx[n]]
+ log(exposure[n]);
}

}

model {

target += normal_lpdf(mu | 0, 0.1);

// Mean GP
target += std_normal_lpdf(to_vector(GP_mu_std) | );
target += gamma_lpdf(alpha_mu | 5, 20);
target += inv_gamma_lpdf(rho_mu | 5, 40);

// Plot random effect
target += std_normal_lpdf(sigma_plot | );
target += normal_lpdf(plot_std | 0, 0.1);

// Plot GP
target += std_normal_lpdf(to_vector(GP_plot_std) | );
target += gamma_lpdf(alpha_plot | 5, 20);
target += inv_gamma_lpdf(rho_plot | 5, 40);

// Overdispersion
target += gamma_lpdf(inverse_phi | 5, 20);

// Likelihood
if (!prior_only) {
target += neg_binomial_2_log_lpmf(y | eta, phi);
}
}

generated quantities {
int<lower=0> y_new[N];

for (n in 1:N) {
y_new[n] = neg_binomial_2_log_rng(eta[n], phi);
}

}



At this point I would like to learn to extend such a model to include continuous nonlinear effects that occur “above” (and by extension “below” or within) and have interaction with the continuous effect. In the vote-share example I guess this could be variables like median income or median age for regions. For my application, this might be a plot-level variable that I expect to have some interaction with the size-effect in predicting count, like number of years since disturbance, or a climatic variable, or a derived remotely sensed predictor.

In this case, are there better methods than simply including the additional variable(s) in the X matrix? In addition to the fact that the observations occur at different levels in the data, I have vague, somewhat uninformed concerns about increasing the dimensionality of the X matrix, variable scaling for the distance computation, and the practical impacts on computation for large data sets if I can no longer lean on the neatly binned diameter class information to efficiently build the covariance matrix (there are many more plots than diameter bins). Any critiques of my approach, general guidance, or concrete examples of GP models that incorporate continuous predictors at multiple hierarchies would be appreciated.

Thanks!

2 Likes

If you can’t do binning, you can check out a basis function representation to keep the dimensions of your linear algebra under control: Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming

1 Like

Thanks, @bbbales2, this paper looks very interesting and helpful! I will check it out more thoroughly and see if I can come up with an implementation for my application.

(By the way, last year you helped me identify some (now obvious) issues with earlier strategies I had for modeling a related problem, which eventually brought me here - so thanks for your earlier help as well.)

I am still wondering if there is an issue using predictors in the X matrix for the Gaussian process that are measured at different levels in the data structure. Or maybe “issue” is not quite right, but some way of articulating the model to take advantage of the structured nature of the measurements as one might, for example, in a multilevel GLM.

By multiple levels do you mean something like:

\mu \sim N(X_\mu \beta_\mu, \sigma_\mu)\\ y \sim N(X_y \beta_y + \mu, \sigma_y)

(Assuming there are N data points y and J groups \mu and the indexing above makes sense in some way)

Then the alternative would be something like:

\mu \sim N(0, \sigma_\mu)\\ y \sim N(X_y \beta_y + X_\mu \beta_\mu + \mu, \sigma_y)

Where now X_\mu \beta_\mu has tons of duplicate rows (so the dimensions make sense)?

Oh yeah I remember that haha. I never felt totally happy with what we came to (or at least what I came to [edit: in the sense that it sounds like you came to a more satisfactory solution – I don’t ever thing it worked out totally in my head]) but glad it worked out.

I think I’m following your notation, but that looks like additive effects from different levels to me, which is not quite what I’m after.

For example if there are J plots, in addition to the count and bole diameter data that I am modeling hierarchically with a mean GP and a plot-level GP, I am also interested in incorporating a plot-level variable such as time since disturbance that I expect to have an interaction with the size effect in predicting count (you would expect higher count of smaller stems in more recently disturbed areas, more larger trees with longer times since disturbance). The disturbance time variable is just one example of a potential variable, though - there could be many more with potentially complex interactions.

My naive approach is that I could just add that disturbance time variable to the predictor matrix in the GP (and then try the Hilbert space approximation you suggest to get around the increased computation).

Then in that case my data might look something like:

plot_id dist_time size count
1       10        1    5
1       10        2    2
1       10        3    1
...
2       15        1    1
2       15        2    1
2       15        3    5
...


with lots of repeated measurements at the plot-level for disturbance time.

My question is whether or not there is a more clever way to specify the model that still preserves the interaction but also makes use of the structure and different spatial scales of the measurement data.

Can you write out equations for what you mean here?

For instance I could imagine an interaction effect with a gp might be:

y \sim N(g(x_0 \beta_0 + x_1 \beta_1 + x_0 x_1 \beta_{01}), \sigma)

Or:

y \sim N(g(x_0, x_1), \sigma)

Where g is a GP.

I will give it a shot!

My current model is:

y \sim \textbf{NegBinom}(e^{\eta}, \phi)\\ \eta = \mu + \textsf{offset}_{plot} + g_{\mu} + g_{plot} + log(exposure)\\ \mu \sim N(0, 0.1)\\ \textsf{offset}_{plot} \sim N(0, \sigma_{plot})\\ \sigma_{plot} \sim \textbf{Normal}(0, 1) \\ g_{\mu} \sim \textbf{MVN}(0, K(x_0 \mid \alpha_{\mu}, \rho_{\mu})) \\ g_{plot} \sim \textbf{MVN}(0, K(x_0 \mid \alpha_{plot}, \rho_{plot})) \\ \alpha_{mu} \sim \textbf{Gamma}(5, 20) \\ \alpha_{plot} \sim \textbf{Gamma}(5, 20) \\ \rho_{mu} \sim \textbf{InvGamma}(5, 40) \\ \rho_{plot} \sim \textbf{InvGamma}(5, 40) \\ \phi^{-1} \sim \textbf{Gamma}(5, 20)

Here y is the tree count, x_0 is the bole diameter bin, \mu is the expected count across all plots and diameter bins, \textsf{offset}_{plot} is the plot-level random intercept for counts, g_{\mu} is the mean diameter effect on counts across plots (general trend) and g_{plot} is the plot-level diameter effect on counts, the exposure term is an adjustment for plot size, \phi is the overdispersion for the negative binomial, and K is the exponentiated quadratic kernel. The priors come from pretty extensive simulation-based calibration.

My question relates to the best way to add another variable x_1 to the model where x_0 is still bole diameter and x_1 is collected not at the individual tree level (as diameter is), but at the plot-level., e.g. time since disturbance, elevation, average annual rainfall, etc. So two variables, but collected at different spatial scales, that I expect to interact with each other.

My understanding of Gaussian processes is imperfect to say the least, but from what I understand of the quadratic exponentiated kernel, the Euclidean distance computation between two input points in d dimensions x_{i,d} - x_{j,d} in the covariance function:

K(x \mid \alpha, \rho)_{i, j} = \alpha^2 \exp \left( - \dfrac{1}{2 \rho^2} \sum_{d=1}^D (x_{i,d} - x_{j,d})^2 \right)

means that effectively expanding from the univariate GP by just including additional predictors is basically modeling multivariate nonlinear interactions between x_0, x_1,...x_d so that

y \sim N(g(x_0, x_1), \sigma)

or for my model formulation, e.g.

g_{plot} \sim MVN(0, K(x_0, x_1 \mid \alpha_{plot}, \rho_{plot}))

is sufficient to code an interaction without needing to include terms and coefficients explicitly as you have written here:

y \sim N(g(x_0 \beta_0 + x_1 \beta_1 + x_0 x_1 \beta_{01}), \sigma)

But perhaps that is a misconception, or I am missing the nuance between these two formulations, or something more basic about GPs, in which case I would like to be corrected!

And possibly your formulation of the interaction as a linear model inside the GP gives something of a clue to my question, which is basically what considerations (if any) do I need to make regarding the fact that x_1 is collected at a different spatial scale than x_0, or if there is any more efficient way of including such an additional variable in the model than adding it directly (repeated data and all) to the existing covariance structure that I have specified in my existing model.

I’m having trouble writing or even thinking of the equation that I want to encode, and that is I am sure the root of my problem, so maybe I just need to think about this harder. In very plain, general terms, however, I expect these plot level variables to change the shape of the function relating count to diameter, but I’m unsure the best way to include that expected behavior in the model. Maybe a linear model that represents the hierarchical structure inside the GP is a way of doing that?

So in this case, if the GPs model a function that we’re trying to estimate, then g_{\mu} is a function tree diameter to a log density of trees (trying to think of what the units are) and g_{plot} is a function from X to log density of trees.

What is X here? Is this a spatial term, like plots closer to each other produce more trees?

Both g_{\mu} and g_{plot} currently model log tree density (say stems per hectare) as a function of stem density, actually, it’s just that I expect a given observation to be the sum of an average diameter -> log stem density function across many plots g_{\mu} and a plot-level diameter -> log stem density function g_{plot}. I am thinking of it as analogous to the annual effect and the month effect in the birthday problem or the region / state effects in the case study I mentioned in the OP.

The new variable that I would like to add is not a term to model spatial autocorrelation (yet…), but another continuous variable that has a single measurement per plot (like time since disturbance or elevation), whereas there is a count for each diameter class within each plot going into g_{\mu} and g_{plot}. I expect this new variable to impact g_{plot} by changing the shape of the diameter -> log stem density function at each plot.

Oh right, that makes sense.

Oh okay, I think the default thing to do here would be the two dimensional kernel.

You can model the length scales in each dimension separately (if you’re fitting these). That means \rho in your kernel equation gets a d subscript and moves inside the sum.

The:

y \sim N(g(x_0 \beta_0 + x_1 \beta_1 + x_0 x_1 \beta_{01}), \sigma)

model is corresponds to fitting a linear function from your two inputs (tree diameter + the second one) to some sort of intermediate quantity, and then a GP function from that intermediate to the log density of your trees.

Given we don’t really have an intermediate function in mind, I think the other solution is the preferred thing.

1 Like

Great, I will try the straight two dimensional kernel and look into the HS approximation. I also want to try out automatic relevance determination in this case. Thanks again for your time and advice, @bbbales2. I really appreciate it.

Oh yeah if it becomes too difficult to model the full function f(x, y), I think the first thing you could do to simplify that is model f(x, y) as two functions f(x) f(y) (before switching to linear models).

1 Like