Confusing sampling observations with challenging prior

A problem I am studying has a bivariate latent variable whose prior is uniformly distributed on a regular hexagon. Challenges to obtaining a satisfactory sampling formulation for this variable have led me to discover some behavior that I find odd and confusing, and would like some assistance/advice on.

Background:

It is very challenging to come up with a sampling scheme for this variable that lacks discontinuity in both the formulation of the prior density and its first derivative. After several approaches, I came up with the following.

  1. The hexagon is centered at the origin with a pair of opposite sides which are each parallel to the x axis, with sides distance 2 apart. Therefore that pair of sides intersect the y axis at (0, 1) and (0, -1) respectively. The bivariate density in this region will be uniformly 1 / (2 \sqrt{3}) so total probability is unity.

  2. Find the marginal density of y by integrating out the x variable. This density turns out to have a discontinuous derivative at y=0.

f_Y(y)= \begin{cases} (2+y)/3 & -1 < y < 0 \\ (2-y)/3 & 0 \le y < 1 \end{cases}
  1. However, its CDF should be one order smoother. I calculated that and inverted it to get its quantile function.
F^{-1}_Y(u_y) = \begin{cases} \sqrt{1 + 6 u_y} - 2 & 0 < u_y < 1/2 \\ 2 - \sqrt{7 - 6 u_y} & 1/2 \le u_y < 1 \end{cases}

This is a piecewise function whose conditional depends on the random variable U_Y. However, the function and its first derivative are continuous over its whole range. Therefore the coordinate y of this variable can ostensibly be sampled in stan by an inverse sampling: drawing from a standard uniform random variable U_Y, then plugging into the above to get a sample y.

  1. The conditional density of x|y turns out to be a complicated expression but which is, conditional on y, a uniform distribution in x. But its domain depends on y. Therefore I expect difficulty when sampling in stan.
f_{X|Y}(x|y) = \begin{cases} \frac{\sqrt{3}}{2 y + 4} & -1 \le y \le 0 , -\frac{y+2}{\sqrt{3}} < x < \frac{y+2}{\sqrt{3}} \\ \frac{\sqrt{3}}{4 - 2 y} & 0 < y \le 1 , -\frac{2-y}{\sqrt{3}} < x < \frac{2-y}{\sqrt{3}} \end{cases}
  1. However, one can supposedly sample x by drawing from a \mathrm{Uniform}(-1/2, 1/2) random variable U_X then multiplying the resulting u_x by 2 (y+2) / \sqrt{3} if y \le 0, or 2 (2 - y) / \sqrt{3} if y > 0. (The 2s in the numerators come from the max and min of the uniform only being \pm 1/2.) The sampling of U_X should be straightforward. However, we note the scaling multiplication does not have a first derivative when y=0.

While not perfect with respect to smooth derivatives etc., it seems pretty close. One point I like is that the only sampling statements needed have been constructed to add only a constant (actually zero) to the log likelihood, which is desirable in this case of these latent variables. This could only happen otherwise if we used actual lpdf functions for both y and x|y, so they would telescope to the constant density of the region. But clearly these functions are not suitable for this.

Now for the actual issues I noticed!


The stan code to perform this sampling is as follows. Notice it is stripped down as far as I could; in particular, there are no range declarations on any of the parameters or transformed parameters:


// HexUnif.stan

data {} // data


parameters {
  
  real uY; // latent uniform RV to plug into inverse cdf of y
  real uX; // centered standard uniform to be scaled to x
  
} // parameters


transformed parameters {
  
  real y;
  real x;
  
   // draw from inverted y CDF
  if(uY <= 0.5) y = sqrt(1.0 + 6.0 * uY) - 2.0; else
                y = 2.0 - sqrt(7.0 - 6.0 * uY);  

   // multiply centered standard uniform RV to final range of x
  if(y <= 0.0) x = uX * (y + 2.0) / sqrt(3.0); else
               x = uX * (2.0 - y) / sqrt(3.0);
  
} // transformed parameters


model {
  
   // both uniforms have domain of length 1 so should not contribute to target()
  target += uniform_lpdf(uY |  0.0, 1.0);
  target += uniform_lpdf(uX | -0.5, 0.5);
  
} // model

And run as follows (using cmdstanr in R using RStudio)


HexUnif.mod <-  cmdstan_model("stan/HexUnif.stan", compile = FALSE)

#  HexUnif.mod[['check_syntax']]()

HexUnif.model <- HexUnif.mod[['compile']](pedantic = TRUE)

HexUnif.MCMC <- local({
  
    set.seed(1406686775)
    inits <- {function() list(uY = rbeta(1L, shape1=20.0, shape2=20.0),
                              uX = rbeta(1L, shape1=20.0, shape2=20.0) - 0.5) }
  
    out <- {HexUnif.model[['sample']](
      
                      data = list(),
                      seed = 426361962,
                   refresh = 1000,
                      init = inits,
                output_dir = "<localdir>/HexUnif",
                  sig_figs = 8,
                    chains = 10L,
           parallel_chains = 1L,
               iter_warmup = 200,
             iter_sampling = 2000,
             max_treedepth = 10,
               adapt_delta = 0.80    ) } # out
    
    out
    
})



[Output deleted] This had many problemsā€¦


> HexUnif.MCMC[['cmdstan_diagnose']]()
Processing csv files: <local files>

Checking sampler transitions treedepth.
880 of 20000 (4.40%) transitions hit the maximum treedepth limit of 10, or 2^10 leapfrog steps.
Trajectories that are prematurely terminated due to this limit will result in slow exploration.
For optimal performance, increase this limit.

Checking sampler transitions for divergences.
19120 of 20000 (95.60%) transitions ended with a divergence.
These divergent transitions indicate that HMC is not fully able to explore the posterior distribution.
Try increasing adapt delta closer to 1.
If this doesn't remove all divergences, try to reparameterize the model.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete.

OK not unexpected. Letā€™s see about the log likelihood


>  HexUnif.MCMC[['summary']]() 
# A tibble: 5 Ɨ 10
  variable     mean   median    sd   mad      q5   q95  rhat ess_bulk ess_tail
  <chr>       <dbl>    <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 lp__      0        0       0     0      0      0     NA         NA       NA 
2 uY        0.489    0.482   0.288 0.370  0.0481 0.949  1.01    1374.    2103.
3 uX       -0.00146 -0.00851 0.291 0.378 -0.446  0.457  1.01    1223.    2080.
4 y        -0.0195  -0.0269  0.527 0.620 -0.865  0.857  1.01    1374.    2103.
5 x        -0.00158 -0.00726 0.265 0.327 -0.413  0.422  1.01    1372.    3196.

Identically zero. So far so good. Well not really as this sampling is very poor and untrustworthy.

Yet that is not the first attempt I made to run this sampling.

I originally included range declarations on all variables [note: following also observed if range declarations omitted for transformed parameters]. This is the stan code:


// HexUnifLimits.stan

data {} // data


parameters {

  real<lower= 0.0, upper=1.0> uY; // latent uniform RV to plug into inverse cdf of y
  real<lower=-0.5, upper=0.5> uX; // centered standard uniform to be scaled to x
  
} // parameters


transformed parameters {
  
  real<lower=-1.0          , upper=1.0          > y;
  real<lower=-2.0/sqrt(3.0), upper=2.0/sqrt(3.0)> x;
  
   // draw from inverted y CDF
  if(uY <= 0.5) y = sqrt(1.0 + 6.0 * uY) - 2.0; else
                y = 2.0 - sqrt(7.0 - 6.0 * uY);  

   // multiply centered standard uniform RV to final range of x
  if(y <= 0.0) x = uX * (y + 2.0) / sqrt(3.0); else
               x = uX * (2.0 - y) / sqrt(3.0);
  
} // transformed parameters


model {
  
   // both uniforms have domain of length 1 so should not contribute to target()
  target += uniform_lpdf(uY |  0.0, 1.0);
  target += uniform_lpdf(uX | -0.5, 0.5);
  
} // model

The identical R code as above was used to run this, other than the names, local output directory, and random seeds. Here are the diagnostics:


> HexUnifLimits.MCMC[['cmdstan_diagnose']]()
Processing csv files: <local csvs>

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete, no problems detected.

So no apparent problems, at least yet. Here is the fit summary:


> HexUnifLimits.MCMC[['summary']]() 
# A tibble: 5 Ɨ 10
  variable       mean    median    sd   mad      q5    q95  rhat ess_bulk ess_tail
  <chr>         <dbl>     <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl>    <dbl>    <dbl>
1 lp__     -3.99      -3.64     1.19  0.892 -6.39   -2.84   1.00    7630.    9569.
2 uY        0.503      0.504    0.290 0.370  0.0462  0.954  1.00   14041.   10540.
3 uX        0.000391  -0.000591 0.286 0.364 -0.446   0.448  1.00   13894.   11255.
4 y         0.00519    0.00655  0.530 0.619 -0.870   0.871  1.00   14041.   10540.
5 x         0.0000206 -0.000428 0.261 0.316 -0.416   0.416  1.00   14788.   13064.

As we saw from the diagnostics, the ESS stats look OK, and rhats are unity up to rounding. But the log likelihood is now accumulating something, most certainly not zero.

Also, the ACF plots [not shown] show huge autocorrelation in the draws, with a AC ā€œhalf lifeā€ of about lag 5. But this is not totally unexpected due to the scaling multiplication of the x variable having discontinuous derivative. I might expect the other program was worse (not checked).


Both of these methods seem to sample \{x, y\} correctly, albeit not always with great efficiency. Here are my questions:

  1. Why does the program without range declaration limits give such terrible convergence, much much worse than the one with limits?

  2. Why does the program with range declaration limits accumulate log likelihood? The only two sampling statements for both programs are uniform lpdfs, both whose distributionsā€™ domain have length 1.

  3. Lastly, a more general question which others might find helpful: is there any material online etc. regarding best practices, when inverse sampling a variable whose prior has a discontinuous derivative?

I appreciate being able to somewhat reasonably sample from these somewhat sketchy priors. But now I worry that they will add spurious log likelihood into the full modeling problem, and contaminate my results. Any help/clarity appreciated.

Thanks for taking the time to read - JZ

RStudio 2024.04.2+764 ā€œChocolate Cosmosā€ Release (e4392fc9ddc21961fd1d0efd47484b43f07a4177, 2024-06-05) for macOS
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) RStudio/2024.04.2+764 Chrome/120.0.6099.291 Electron/28.3.1 Safari/537.36, Quarto 1.5.56 (/Applications/quarto/bin/quarto)


> version
               _                           
platform       x86_64-apple-darwin20       
arch           x86_64                      
os             darwin20                    
system         x86_64, darwin20            
status                                     
major          4                           
minor          4.1                         
year           2024                        
month          06                          
day            14                          
svn rev        86737                       
language       R                           
version.string R version 4.4.1 (2024-06-14)
nickname       Race for Your Life 

> packageVersion("cmdstanr") 
[1] ā€˜0.8.1.9000ā€™

Iā€™m curious as to how this arises in practiceā€”is there some kind of physical constraint on the variables?

Another way to do this would be to sample x ~ p(x) and then y ~ p(y | x). It turns out the marginal for x isnā€™t too badā€”itā€™s proportional to the height of the hexagon above x. To figure this out, you need the coordinates of the hexagon, which I got here: https://www.quora.com/How-can-you-find-the-coordinates-in-a-hexagon. Then the marginal for y is uniform given its bounds, so we donā€™t need to increment the target (we get it for free with the transform).

So we only need one function, for the upper bound of y given x. We then declare y to have those bounds (itā€™s symmetric so the lower bound is the negation of the upper bound).

functions {
  real ub_y(real x) {
    if (x < -0.5) {
      return (sqrt(3) / 2) * ((x + 1) / 0.5);
    }
    else if (x > 0.5) {
      return (sqrt(3) / 2) * ((1 - x) / 0.5);
    }
    else {
      return sqrt(3) / 2;
    }
  }
}
parameters {
  real<lower=-1, upper=1> x;
  real<lower=-ub_y(x), upper=ub_y(x)> y;
}
model {
  target += ub_y(x);
}

This is assuming a hexagon with unit-length sides, so it needs to be multiplied by 2 / sqrt(3) if you want the distance between opposite edges to be 2. I took from the Quora post that the position of the vertices are (-1, 0), (-1/2, sqrt(3) / 2), (1/2, sqrt(3)/2), and (1, 0) along the top of the hexagon (I cheated and looked it up on line rather than trying to remember which way around sines and cosines go). This is all you need for the bounds because the bottom is symmetric. The density of x is proportional to the height above the origin. This is sqrt(3) / 2 along the top edge (the else case), and the proportional along the triangle times sqrt(3) / 2 on the left or the right, so thatā€™s (x + 1) / 0.5 and (1 - x) / 0.5.

Hereā€™s code to drive it in Python.

import cmdstanpy as csp
import pandas as pd
import plotnine as pn

m = csp.CmdStanModel(stan_file='hexagon.stan')
f = m.sample()

x_draws = f.stan_variable('x')
y_draws = f.stan_variable('y')
df = pd.DataFrame({'x': x_draws, 'y': y_draws})
scatterplot = (
    pn.ggplot(df, pn.aes(x='x', y='y'))
    + pn.geom_point()
    + pn.xlim(-1.2, 1.2)
    + pn.ylim(-1.2, 1.2)
    + pn.coord_fixed()
)
pn.ggsave(scatterplot)

Hereā€™s the summary (there are 4K drawsā€”1K per chain over 4 chains).

>>> f.summary(sig_figs=2)
        Mean    MCSE  StdDev    5%     50%   95%   N_Eff  N_Eff/s  R_hat
lp__ -1.7000  0.0260    1.10 -4.00 -1.3000 -0.70  1800.0  27000.0    1.0
x     0.0062  0.0081    0.43 -0.67  0.0160  0.66  2700.0  43000.0    1.0
y     0.0031  0.0086    0.46 -0.74  0.0093  0.75  2900.0  46000.0    1.0

And hereā€™s the scatterplot.

1 Like

Thanks Bob for the feedback. Lot of good stuff here.

My context for this polygon sampling is trying to cope with rounding measurement error in compositional data, building on the second approach in this section of the Usersā€™s Guide.


It looks like you approached the telescoping conditional density from the other direction, x then y|x rather than vice versa. However, a big difference if you didnā€™t try to sweep the continuity issues as far downstream into the processing as you could ā€“ you kept them in the sampling statement itself (ub_y has two derivative discontinuities) and in the declared variable limits (y's domain depends on x). Yet your convergence etc. was comparable to mine.

I am intrigued by this, as I had previously coded a very direct y then x|y method, very analogous to yours, and when actually applying it to my problem, the convergence was awful. Thus my contortions here with the CDFs and so on. However, I didnā€™t run a pilot on just the region itself like weā€™re doing here. This raises the possibility that once using either approach in the actual problem, things will break down.

I alluded to this briefly in my OP, but you have the same issue as in my successful method in that the target is accumulating something when we donā€™t want any contribution from this process. The actual ā€œpriorā€ distribution appears fine for both of us but I donā€™t know how this will impact the full problemā€™s posteriors of important parameters. But where yours comes from the variability in ub_y, I cannot tell where mine did, as both my increments were from uniform RVs.

And I am also puzzled as to why adding declared parameter bounds when they are implicit from the sampling statement improve convergence so much, yet they are what contribute to the non constant target() which I would like to avoid.

Thanks again for the feedback.

Stanā€™s sampling algorithms assume support on all of \mathbb{R}^N. If you declare a as real and then apply a ~ uniform(0, 1), there will be problems when steps go outside supportā€”in the worst case, itā€™ll devolve to an inefficient form of rejection sampling.

A constrained variable in Stan is transformed to something on \mathbb{R}^N, then when Stan runs, itā€™s inverse transformed back to the constrained space and the log Jacobian determinant correction is implicitly added to the log density. For instance, we log transform positive-constrained variables, which means when we map them back, we add \log \exp(x^{\textrm{unc}}) to the log density, where x^{\textrm{unc}} is the unconstrained parameter. This ensures the distribution of x = \exp(x^\textrm{unc}) is uniform on (0, \infty).

This ensures that the distribution is uniform over the space of values satisfying the constraint.

Iā€™m not sure what you mean by ā€œaccumulating log likelihoodā€. When there are constraints, we add the Jacobian correction to the target density, but itā€™s not a log-likelihood per se (the term ā€œlikelihoodā€ refers to the data sampling distribution viewed as a function of the parameters, i.e., \mathcal{L}(\theta) = p(y \mid \theta) for fixed data y).

The discontinuous derivatives here shouldnā€™t be a big deal (a) the value wonā€™t differ by that much on either side of the discontinuityā€”the derivative is either 0 or a small constant, (b) locality in constrained space is similar to that in unconstrained space, by which I just mean that if points are near each other in constrained space, theyā€™ll be near each other in unconstrained space.

Thereā€™s no continuity issue from declaring the limits of y to depend on xā€”we use inverse logit and logit (scaled) to do this, which is smooth.

Thanks for the explanations. This was very helpful. Iā€™m also cautiously optimistic that I do not need to be so strict with some of these issuesā€¦I was taking my cues from pedantic modeā€™s warnings.