Adaptation issues for the prior predictive distribution with a prior on non-linear, non-invertible function of the data

Say I want to fit a linear regression Y = X\beta + \varepsilon, \beta \sim N(0, 1) and \varepsilon \sim N(0, 1). I also have information on \phi = f(Y) \sim N(29.5, 0.973^2) (where 29.5 and 0.973^2 are arbitrary, precise numbers), and f is some non-linear, non-invertible function. I could specify the following simplified Stan model:

data {
  int <lower = 1> N;
  vector [N] Y;
  int <lower = 1> K;
  matrix [N, K] X;
  int prior_only;

parameters {
  vector [K] beta;
  real <lower = 0> sigma;
  vector [N] latent_y;

transformed parameters {
  real phi = some_nonlinear_noninvertible_function(latent_y);

model {
  if (!prior_only) {
    target += normal_lpdf(Y | X * beta, sigma);
  target += normal_lpdf(beta | 0.0, 1.0);
  target += normal_lpdf(sigma | 0.0, 1.0);

  // ideally
  target += normal_lpdf(phi | 29.5, 0.973);

  // because no _rng's in the transformed_parameters block, use MCMC to sample
  // the prior/posterior distribution.
  // this is a terminal node in the DAG of the model, so has no impact 
  // on the prior/posterior of beta/sigma.
  target += normal_lpdf(latent_y | X * beta, sigma)

The induced / actual / pushforward prior for \phi is a combination of the specific normal prior and the priors for \beta / \sigma. In the process of checking the prior predictive distribution I ran into sampling issues (the prior predictive distribution is important to me). Because I have to include latent_y in the parameters block I have increased the dimension of my model by N (which in my actual use-case is 10,000). Note that we don’t normally have to use MCMC for this, because we can use a suitable _rng function in the generated quantities block, but this is not an option here. The problem is that including latent_y as a parameter, and the corresponding entries in the model block, means my prior-predictive-pairs-plot goes from this:

to this:

where the bottom right block of the latter plot is the same as the upper plot.
There are no divergence/treedepth warnings, but clearly the optimisation/adaptation procedures are struggling with the much expanded parameter space.

Two questions:

  • Can I take the mass matrix from the sampler that produced the ‘good’ pairs plot, and fix the relevant entires in the larger mass matrix (from the sampler that produced the ‘bad’ pairs plot) to the aforementioned?
  • Am I trying to do something completely idiotic here, with no possible hope of success? (almost certainly)

‘Solved’ via a non-centered parameterisation for latent_y.

1 Like