User-defined models for gaussian processes with non-gaussian likelihoods (and other generic models)

Greetings, all,

I am somewhat new to Stan, having used it a few times mostly for test purposes, so please bear with me for the lack of code detail at this point. I am having trouble implementing user-defined models, which is one of the things that prevented me from using Stan more extensively before.

Specifically, section 18.3 of the Stan reference guide on fitting gaussian processes describes the implementation of a GP with discrete outcomes by replacing

y ~ normal(f, sigma);

with

y ~ poisson_log(a + f);

all descriptions I could find of gaussian processes with non-gaussian likelihoods require an approximation of the latent f function, or an MCMC estimation of the yi targets (which I think may be the most straightforward way if we are already inferring the rest of the parameters in this way). Maybe I’m missing something, but I don’t understand how this step is taken in the Stan language model description.

More generally – and maybe this belongs on a different thread – can C++ code be used to specify a model likelihood (and HMC gradient, for instance)? I assume that can be done because Stan is written in C++, but I haven’t seen anything that made it as straightforward as the Python and R interfaces, for instance, which on the other hand seem less flexible in cases like this, and others.
Thank you.

I hope to be able to help you with the first question. When you are an R user, you can install the brms package and investigate the Stan code returned by

df <- data.frame(y = 1:30, x = rnorm(30))
make_stancode(y ~ gp(x), data = df, family = poisson())

Thanks for the quick reply. I am normally not an R user, so it took a little bit to get around some errors when getting the ‘brms’ package installed the first time. The output is pasted below. I still don’t see any approximations on the latent f function, but a vector zgp_1 is begin estimated, are these the targets of the latent gaussian process?
In the Stan reference guide (section 18.3 again) there is reference to a eta variable, but at one point it seems like it describes random gaussian draws, and afterwards an estimated parameter in the Stan model, is this inconsistent in the notation, or am I missing something here too? Thanks.

// generated with brms 2.2.0
functions { 

  /* compute a latent Gaussian process
   * Args:
   *   x: array of continuous predictor values
   *   sdgp: marginal SD parameter
   *   lscale: length-scale parameter
   *   zgp: vector of independent standard normal variables 
   * Returns:  
   *   a vector to be added to the linear predictor
   */ 
  vector gp(vector[] x, real sdgp, real lscale, vector zgp) { 
    matrix[size(x), size(x)] cov;
    cov = cov_exp_quad(x, sdgp, lscale);
    for (n in 1:size(x)) {
      // deal with numerical non-positive-definiteness
      cov[n, n] = cov[n, n] + 1e-12;
    }
    return cholesky_decompose(cov) * zgp;
  }
} 
data { 
  int<lower=1> N;  // total number of observations 
  int Y[N];  // response variable 
  int<lower=1> Kgp_1; 
  int<lower=1> Mgp_1; 
  vector[Mgp_1] Xgp_1[N]; 
  int prior_only;  // should the likelihood be ignored? 
} 
transformed data { 
} 
parameters { 
  real temp_Intercept;  // temporary intercept 
  // GP hyperparameters 
  vector<lower=0>[Kgp_1] sdgp_1; 
  vector<lower=0>[Kgp_1] lscale_1; 
  vector[N] zgp_1; 
} 
transformed parameters { 
} 
model { 
  vector[N] mu = rep_vector(0, N) + gp(Xgp_1, sdgp_1[1], lscale_1[1], zgp_1) + temp_Intercept; 
  // priors including all constants 
  target += student_t_lpdf(temp_Intercept | 3, 3, 10); 
  target += student_t_lpdf(sdgp_1 | 3, 0, 10)
    - 1 * student_t_lccdf(0 | 3, 0, 10); 
  target += inv_gamma_lpdf(lscale_1 | 0.741722, 0.001792); 
  target += normal_lpdf(zgp_1 | 0, 1); 
  // likelihood including all constants 
  if (!prior_only) { 
    target += poisson_log_lpmf(Y | mu); 
  } 
} 
generated quantities { 
  // actual population-level intercept 
  real b_Intercept = temp_Intercept; 
} 

There is no “approximation” of the Gaussian process f, but a direct computation of it via gp(Xgp_1, sdgp_1[1], lscale_1[1], zgp_1). Here, Xgp_1 is the coviarate you want to model via a gaussian process, sdgp_1 is the standard deviation parameter of the Gaussian process, lscale_1 the length-scale parameter of the Gaussian process and zgp_1 the parameter vector being multiplied by the (cholesky factor of) GP-implied covariance matrix to form the predictions. Indeed zgp_1 is called eta in the Stan manual.

https://betanalpha.github.io/assets/case_studies/gp_part1/part1.html

Yes, I think you’re clearly right that no approximation of the latent f function is being used, but that being the case the problem remains of getting the posterior of the gaussian process for non-gaussian observations y.
In the previous section for simulating a gaussian process (18.2) eta is a standard gaussian variable, and multiplying it by the L matrix gives you a simulation of the process.
In the following section, where there observations y exist, they can be described as y ~ multi_normal_cholesky(mu, L_K); and there is no eta; eta appears again in the next example for coding a latent function f still with normal observations, but it doesn’t seem like it is the same thing as in in the previous section, where it is simply a random draw – I think it is being estimated in this example (below), although it doesn’t have to.
Finally, in the case of the poisson this would make sense, because you need the targets to get the latent function, and we can’t use y because they are non-gaussian, so we’d infer eta as “pseudo-gaussian” targets to be able to get f, and only then be able to evaluate the likelihood. Without that there would be the need for an approximation, which like you said we’ve established is not the case here. Does that make sense? Thanks

data {
      int<lower=1> N;
      real x[N];
      vector[N] y;
}
transformed data {
    real delta = 1e-9;
}
parameters {
    real<lower=0> rho;
    real<lower=0> alpha;
    real<lower=0> sigma;
    vector[N] eta;
}
model {
      vector[N] f;
      {
      matrix[N, N] L_K;
      matrix[N, N] K = cov_exp_quad(x, alpha, rho);
      // diagonal elements
      for (n in 1:N)
        K[n, n] = K[n, n] + delta;
      L_K = cholesky_decompose(K);
      f = L_K * eta;
}
    rho ~ inv_gamma(5, 5);
    alpha ~ normal(0, 1);
    sigma ~ normal(0, 1);
    eta ~ normal(0, 1);
    y ~ normal(f, sigma);
}

Thanks. I had seen that, but isn’t it gaussian-distributed observations only in this example?

A non-gaussian observation model and the difficulty relative to the conjugate model is discussed in Section 3.3, https://betanalpha.github.io/assets/case_studies/gp_part1/part1.html#33_simulating_from_a_gaussian_process_conditional_on_non-gaussian_observations. The utility of the conjugacy of a Gaussian observation model is discussed in Section 3.2.

It makes sense generally, but it’s still not totally clear to me how you get the value of f without being able to use the y. Are you sampling f_tilde values with MCMC and using those normal variates to get an get an estimate of the latent f and compute its likelihood?
Can you explain it mathematically? I’m new to Stan and its notation is still somewhat confusing to me at times. Thanks a lot in advance already.

Something to keep in mind is that in Stan, other than the generated quantities block (which doesn’t affect inference), there’s no drawing of random numbers in the same way C’s rand function works or anything. y ~ normal(f, sigma) is not drawing random numbers – it’s evaluating the log likelihood of a normal distribution and adding that to a hidden log density variable (target).

All you need to do is write down \ln(p(y | \theta) p(\theta)), and then, if all goes well, you’ll get posterior estimates for all your parameters \theta. Since y goes into that log density, it’ll definitely effect estimates of all the parameters (that includes eta). Swapping that normal distribution for something else doesn’t change how the inference works – it’s all incrementing a log density and the algorithm works off that. That make sense at all?

I’m not sure from where you are coming, but there is definitely confusion as to how Stan, and it’s Hamiltonian Monte Carlo implementation of Markov chain Monte Carlo works. In particular, in Stan all of the parameters are sampled jointly so that we can compute expectations of a posterior distribution consistent with the information both in the data and the model itself. The goal of Stan’s design is that you don’t have to worry about how this is done – you just build your model and let Stan do all of the work, checking the diagnostics at the end to verify that nothing wonky happened.

So, sorry for the confusion, but maybe explaining where I come from will help in this case. Normally I would compute the solution of the model given the parameters, f(x,\theta), and evaluate its likelihood given the data, p(y|f(x,\theta) which combined with the priors will give the posterior density and the MCMC algorithm would do its thing – that’s pretty straightforward, and like you said, I prefer not to have to care about how it’s doing it unless I have to.

Gaussian processes seem to work a little different with non-gaussian likelihoods, because we can’t compute f(x,\theta) (the posterior, not from inference but from the GP itself) from the parameters alone, because it is a function of the data y. The most common approach to get around this seems to be trying to get an approximation of f via Laplace approximation or expectation propagation, but that is not how you are doing it. The other less discussed way is sampling (inferring) f directly with MCMC, which would fit right into the (joint) inference of the hyperparameters. There seems to be no other way around one of those strategies for non-gaussian likelihoods, unless there is something else I never heard of.

I’m guessing that’s what you’re doing when you have f_tilde in the parameter section and its prior in the model section – so my objective question is: are you sampling that with MCMC?

I hope I made that understandable. Thanks.

The bit about the random draws makes perfect sense, in the example I mentioned from the user guide it is indeed in the generated quantities, and that’s what it seems to be doing to simulate from the gaussian process prior.

The problem with this part is that we cannot compute p(y|\theta) as “usual”, because we cannot get f(x,\theta) and choose any distribution for the likelihood on top of it to compute the likelihood of the data y. Instead we have f(x, \theta, y), and if it is gaussian there is a closed form for the likelihood, otherwise we cannot compute f to start with. I’m assuming some \hat{f} is being sampled because this parameter eta is now in the parameters section and getting the cholesky decomposition times something filling in for the real y would allow us to finally compute f(x,\theta) and then take the whatever likelihood distribution on top of it.
Thanks again for the reply.

Conditioned on the covariates x we can absolutely right down the joint density over the observations and the realizations of the Gaussian process regardless of the observation model. The posterior is no longer a multivariate Gaussian, but then really what is these days?

Gaussian processes are no different from any other model in Stan. We write a Stan program that specifies a joint density over data and all unknown parameters which the underlying MCMC algorithms then uses to draw posterior samples.

In the conjugate case where the hyperparameters are assumed fixed we we have a multivariate Gaussian posterior and we can sample realizations using exact sampling in generated quantities. This is all discussed in the case study.

That does make sense, but I don’t see where this is discussed explicitly in the case study. So the unknown parameters include what you call f_tilde and those are sampled by the MCMC, is that right?

For the example in the case study where f_tilde is declared in the parameters block, then yes that’s right.

1 Like

Thanks, that was the missing piece, I guess.

And in the case when f is actually a function of other parameters and computed in the transformed parameters or model blocks, then f is not directly sampled via MCMC but rather deterministically constructed from the MCMC samples of the more primitive parameters.

1 Like

That again makes complete sense. Thanks a lot.

1 Like

Glad that makes sense. And welcome to the Stan Forums!

1 Like