Jacobian adjustment for parameter with parameter-based upper+lower bounds

I am trying to estimate a copula model where some of the variables are distributed as zero-one-inflated beta. My goal is to use the Chib and Greenberg approach (as with blavaan) where discrete values are modeled as continuous values within the bounds implied by the other model parameters. In this case…

  • Observations where X=0 have a continuous parameter associated with them that are bounded between 0 and the probability that X=0
  • Observations where 0<X<1 are directly transformed by the scaled beta CDF
  • Observations where X=1 have a continuous parameter associated with them that are bounded between 1 - the probability that X=1 and 1

The code snipped below demonstrates this idea.

parameters{
    // Zero-one inflated beta parameters
    simplex[3] theta;
    real<lower = 0, upper = 1> mu;
    real<lower = 0> sigma;

    // Transformation of raw values to cumulative probabilities
    vector<lower = 0, upper = theta[1]>[N0] P0;
    vector<lower = 1.0 - theta[3], upper = 1>[N1] P1;
}
transformed parameters{
    real a = mu * sigma;
    real b = (1.0 - mu) * sigma;
    vector[N_not01] P_not01_implied; // Direct estimates of CDF value based on theta, a, and b

    for(n in 1:N_not01){
        P_not01_implied[n] = theta[1] + beta_cdf(X_not01[n], a, b) * theta[2];
    }
}

However, this produces biased estimates (see attached model2.stan, "Modeled P(X=0) P(X=1) in the plot below). My understanding is that I need a Jacobian adjustment because the parameters are bounded by another parameter and I am only modeling 2 of the 3 segments (X=0 and X=1 but not 0<X<1). If I include a set of parameters for 0<X<1, then I get unbiased estimates (model3.stan, “Modelled P(X=0) P(0<X<1) P(X=1)”). So there seems to be some Jacobian adjustment I need that is a function of those probabilities which should be more efficient than modeling the continuous values for P(0<X<1) which are ignored anyway.

I suspect that the information I need is described here in the user guide or here in the reference manual, but I’m not understanding practically what I need to be doing.

Thanks for your help and time!

example.R (2.1 KB)
model1.stan (842 Bytes)
model2.stan (1.0 KB)
model3.stan (1.1 KB)

[edit: sorry for the initial fragment]

I’m not familiar with their approach. Why use continuous values rather than modeling discrete variables as discrete? Anyway, let’s not worry about that and just try to answer your question.

I took the liberty of refactoring Model 2 to make its structure cleraer and to make it more efficient.

data{
  int<lower = 0> N0;
  int<lower = 0> N_not01;
  int<lower = 0> N1;
  vector<lower = 0, upper = 1>[N_not01] X_not01;
}
parameters{
  simplex[3] theta;
  real<lower = 0, upper = 1> mu;
  real<lower = 0> kappa;
  vector<lower = 0, upper = theta[1]>[N0] P0;
  vector<lower = 1.0 - theta[3], upper = 1>[N1] P1;
}
model {
  X_not01 ~ beta_mean(mu, kappa);
  {N0, N_not01, N1} ~ categorical(theta);
}
generated quantities {
  vector[N_not01] P_not01_implied;
  for(n in 1:N_not01) {
    P_not01_implied[n]
        = theta[1]
        + beta_cdf((X_not01[n] - theta[1])
                   / (theta[2]), a, b);
  }
}

I made the following changes:

  • removed definition of a and b and instead used our built-in beta_proportion distribution (see the doc.
  • replaced the hand-coded categorical distirbution and replaced it with the built-in categorical.
  • renamed sigma to kappa so nobody gets confused thinking is a standard deviation
  • moved the calculation of P_not01_implied to the generated quantities block because it’s not used in the model block. I also removed the _implied because all of our parameters are implied in some sense. Then I conerted to log scale because we don’t actually have the cdf for some reason for beta_proportion (also, it’ll be more stable arithmetically)

As far as a Jacobian adjustment, you don’t need one. You aren’t giving P_not01 a distribution. In fact you’re not doing anything with it in the model rather than reporting its value, which means it would be much moe efficient to move the definition into the generated quantities block where it won’t have to do expensive gradients.

If you had a statement like

P_not01 ~ ...;

in the model block, then you’d need to leave the definition in the transformed parameter block and you’d need to define the transform, say

f:[theta1, theta2] -> [theta1, P_not01]

You wouldn’t use theta3 because that’s determined by theta1 and theta2. Now you need to compute the log of the absolute value of the determinant of the Jacobian of f. I don’t think that’s going to be easy with the cdf in the middle being differentiated against its arguments rather than its outcome.

Thanks for the comments, Bob. I appreciate you taking the time to clean things up (and I apologize in advance for throwing together an updated model for illustration that doesn’t yet take those things into account).

The Chib and Greenberg part is core to the issue, so I apologize for not elaborating. It’s closely related to the Albert and Chib approach described in the manual for multivariate probit. The idea is that multivariate discrete distributions can be intractable, but you can instead model an underlying continuous variable (that would have generated the discrete variable) as multivariate normal.

In the multivariate probit example, all observations get a continuous but bounded parameter based on whether they are 0 or 1. In the ordered example described in the blavaan manual referenced above, the same applies but the bounds are dependent on the estimated cutoff values.

The zero-one-inflated beta case is a little different, since only 0s and 1s are discrete and values between 0 and 1 can deterministically be transformed to standard normal. So if we have N observations total, N_0 of which are 0 and N_1 of which are 1, then we are left with N - (N_0 + N_1) that do not have a corresponding continuous parameter associated with them. It’s this scenario that seems to be leading to biased estimates that happen to go away if you arbitrarily specify N - (N_0 + N_1) continuous parameters that are appropriately bounded between Pr(X=0) and 1 - Pr(X=1). The point here being that, at least as far as I can tell, simply specifying some (but not all) of the underlying continuous parameters can lead to biased results. That’s what I’m trying to address in a more efficient way.

I’m attaching two complete models as examples so that the main intent is (hopefully) clearer. Once again where one is biased and other is unbiased. But a very simple, contrived example looks like this. First, the biased model.

data{
    int N0;
    int N_not01;
    int N1;
}
parameters{
    simplex[3] theta;
    vector<lower = 0, upper = theta[1]>[N0] P0;
    vector<lower = 1-theta[3], upper = 1>[N1] P1;
}
model{

}

And the unbiased model.

data{
    int N0;
    int N_not01;
    int N1;
}
parameters{
    simplex[3] theta;
    vector<lower = 0, upper = theta[1]>[N0] P0;
    vector<lower = theta[1], upper = 1-theta[3]>[N_not01] P_not01;
    vector<lower = 1-theta[3], upper = 1>[N1] P1;
}
model{

}

So the goal is to make the same adjustment that P_not01 is doing without having to specify N-(N_0 + N_1) unused continuous parameters.

Edit: corrected the attached models to correctly subtract the normal marginals.

biased.stan (4.6 KB)
unbiased.stan (4.8 KB)

I’m clearly making at least one other error because the “unbiased” distribution of theta that I’m plotting is still not correct, as it doesn’t align with a Dirichlet([1,1,1]) distribution. I appreciate any other insight others have, but I’ll do some more digging in the meanwhile.

Do you have a definition of the cdfs for the distributions that you’re interested in using? You can use these to estimate the copula model fairly easily,

Here’s an example of this with gamma and exponential distributions: Copulas

1 Like

Hey Andrew. First, thank you for doing all the work of putting together the guide you linked. It, along with @spinkney 's documentation, has been immensely helpful as I’ve been working on this project.

The PDF is

zoib(x) \begin{cases} \theta_1 & \text{if } x=0 \\ \theta_2 \times beta(x | \alpha, \beta) & \text{if } 0<x<1 \\ \theta_3 & \text{if } x = 1 \end{cases}

So the equivalent “CDF” would be something like

zoib(x) \begin{cases} \theta_1 & \text{if } x=0 \\ \theta_1 + \theta_2 \times beta\_cdf(x | \alpha, \beta) & \text{if } 0<x<1 \\ 1 & \text{if } x = 1 \end{cases}

I use the data augmentation approach that you describe to fill in values when x=0 and x=1. I directly transform values 0<x<1 from being beta-distributed to standard normal.

Looking back over your guide and what I’ve done so far, I think I understand the issue that initially motivated my question. From my understanding, the bounded parameters increment the log-likelihood the same as the categorical likelihood. (This is, in retrospect, obvious.) It’s just much less efficient in this simple case because it’s estimating N additional parameters. This also explains why there is no explicit marginal distribution for the discrete variables in your example. I clearly missed your point that, “we only need to specify the joint density function. This means that we simply specify the uniform parameters and use them in the copula density function that we defined above.”!

I think the model below addresses the issue. Since I don’t specify a bounded continuous parameter for the values 0<X<1, it is as if I am doing {N0, 0, N1} ~ categorical(theta). So I need to increment the log-likelihood by N_not01 * log(theta[2]).

data{
    int N0;
    int N_not01;
    int N1;
    vector<lower = 0, upper = 1>[N_not01] X;
}
parameters{
    simplex[3] theta;
    vector<lower = 0, upper = theta[1]>[N0] P0;
    vector<lower = 1-theta[3], upper = 1>[N1] P1;
    real<lower = 0, upper = 1> mu;
    real<lower = 0> sigma;
}
model{
    target += N_not01 * log(theta[2]);
    X ~ beta_proportion(mu, sigma);
}

This is just a data and parameter definition. You can run it, but what you’re calling “biased” and “unbiased” will give the same posteriors for P0 and P1. So I’m not sure what you mean by “bias” here.

Note on terminology: A Bayesian model, by itself, is never biased—Bayes is calibrated by construction. If I generate data from the model and fit using the model, the posteriors will be well calibrated. I take it what you mean is that the model is not well specified for some data? Even if that’s the case, the estimates you get are not biased in the technical sense. Bias assumes systematic variation from their expected values—but the expectations are given in the posterior, and that will work even if the model is poorly specified for the data.

What are the three facets on the plots—they mention X but there is no mention of that in the Stan programs? What is the histogram plotting? The horizontal axis is labeled theta but theta is a simplex.

In both of the Stan programs you supplied (which have no content to their model block), the posterior for theta should be Dirichlet([1, 1, 1]), i.e., uniform.

The densities are the posterior draws of theta from what I had called the unbiased and biased models. The three facets align with each value of theta, which correspond to the ranges X=0, 0<X<1, and X=1 of the original data. As you point out, there is no explicit adjustment to the log-likelihood. So the densities should align with Dirichlet([1,1,1]). The posterior draws from the “biased” model are clearly wrong, since Pr(0<X<1) is concentrated at 0. The posterior draws from the “unbiased” model are unbiased in the sense that they are at least centered at [\frac{1}{3}, \frac{1}{3}, \frac{1}{3}]. But they are clearly not Dirichlet([1,1,1]).

That is what I had assumed based on the documentation but not the results that I am seeing, as the figure was meant to illustrate. Maybe it’s because of the version of Stan I’m using or some setting? I’m using cmdstan version 2.34.1, cmdstanr 0.5.3.

Apologies for the lazy use of terminology. My intention was that the posterior estimates for the “biased” model do not approximate the sample proportions while those from the “unbiased” model do.

I was talking about your post that had two models, what you called “biased” and “unbiased” and they had empty model blocks in the post. I had assumed that’s what you were fitting, and in that case, none of the definitions of P0, P_not01 and P1 will affect theta. But then when I looked at your actual model, there’s all kinds of stuff going on that’s going to pull theta away from a uniform simplex posterior.

Also, when you say “bias”, what do you mean? In statistics, it’s usually the expected (i.e., systematic) error. Bayes is unbiased by construction if the data comes from the data-generating process defined by the model.

The only difference between what you listed as biased and unbiased is whether P_not01 is defined.

$ diff unbiased.stan biased.stan 
103d102
<     vector<lower = theta[1,1], upper = 1.0 - theta[1,3]>[N-(M10+M11)] P1_not01; // Thrown away
107d105
<     vector<lower = theta[2,1], upper = 1.0 - theta[2,3]>[N-(M20+M21)] P2_not01; // Thrown away
Downloads$ 

So it’s just the addition of this variable. There’s only one way this can affect the posterior, which is if the constraints become inconsistent and cause rejections. This will happen whenever

theta[1, 1] >= 1 - theta[1, 3]

or

theta[2, 1] >= 1 - theta[2, 3]

This can happen in simplexes theta, e.g., theta = [0.5, 0.2, 0.3] triggers the first violation. So it’s going to chop off part of your posterior, essentially using rejection sampling.

There’s clearly some things I’m not understanding about the sampler. I’ll try and simplify things and make it clearer what I’m showing.

On bias, I made things more confusing than they needed to be. Let’s pretend I didn’t use that word. More concretely I mean “not what I intended/expected.” In the models where I have a simplex[3] parameter and an empty model block, I expected the posterior distribution to approximate Dirichlet([1,1,1]). I expected that to be true even after adding the constrained parameters, which it did not. This was the source of my confusion.

On sampler dynamics, I hadn’t considered rejection sampling as a potential cause. Is there a way to evaluate whether the rejection process you’re describing is occurring?

Based on the Reparameterization and Change of Variables section of the Users Guide, I assumed that the issue arose because of the way that Stan makes Jacobian adjustments for constrained variables. For example

Stan allows scalar and non-scalar upper and lower bounds to be declared in the constraints
for a container data type. The transforms are calculated and their log Jacobians added to the
log density accumulator

In the Varying Upper and Lower Bounds section

it is important that L and U are constants, otherwise a Jacobian would be required when
multiplying by U-L.

This indicated to me that the problem was because of a missing Jacobian adjustment.

On my models and output, I’ll simplify things further and show results from four models with a Bernoulli-distributed variable with a single parameter theta. I’m hard-coding the sample sizes in the transformed data block to make things more concrete. The intention here is only to demonstrate the impact of specifying the bounded parameters rather than show the full copula-based structure of the model. The posterior distributions of theta are shown in the plot below, followed by descriptions of each model.

In the first model (“No likelihood” in the plot), I only declare theta without any constrained parameters or terms in the model block. As expected, the posterior distribution of theta approximates Uniform(0,1).

transformed data{
    int N0 = 25;
    int N1 = 50;
}
parameters{
    real<lower = 0, upper = 1> theta;
}

The second model (“Bernoulli”) adds the Bernoulli log-likelihood. As expected, the posterior distribution approximates Beta(N1+1,N0+1).

transformed data{
    int N0 = 25;
    int N1 = 50;
}
parameters{
    real<lower = 0, upper = 1> theta;
}
model{
    target += N0 * log(1.0 - theta);
    target += N1 * log(theta);
}

The third model (“Bounded parameters”) adds the constrained parameters to the first model while maintaining an empty model block. My initial expectation was that the results should approximate the first model. However, the results approximate the second model.

transformed data{
    int N0 = 25;
    int N1 = 50;
}
parameters{
    real<lower = 0, upper = 1> theta;
    vector<lower = 0, upper = 1-theta>[N0] P0;
    vector<lower = 1-theta, upper = 1>[N1] P1;
}

Finally, the fourth model (“Hybrid”) is directly relevant to my final model. Here, I specify bounded parameters for the N0 zero observations but include the log-likelihood in the model block for the N1 one observations. The results approximate the previous two models.

transformed data{
    int N0 = 25;
    int N1 = 50;
}
parameters{
    real<lower = 0, upper = 1> theta;
    vector<lower = 0, upper = 1-theta>[N0] P0;
}
model{
    target += N1 * log(theta);
}

The conclusion I draw from this is that the specifying the bounded parameters is proportionally equivalent to the corresponding Bernoulli log-likelihood.

Does that all make sense? Or is there something I’m not understanding with regards to rejection sampling, how bounded parameters work, etc.?

1 Like

Yes, that is true for theta. When P0 and P1 are parameters there’s an additional adjustment for the change of variables there.

For clarity,

transformed data{
    int N0 = 25;
    int N1 = 50;
}
parameters{
    real<lower = 0, upper = 1> theta;
 // model 3 includes the following
 // vector<lower = 0, upper = 1-theta>[N0] P0;
 // vector<lower = 1-theta, upper = 1>[N1] P1;
}
model{
// comment out the target statement for model 3
// because Stan does the adjustment
    target += N0 * log(1.0 - theta);
    target += N1 * log(theta);

  // for model 3 the adjustment that Stan does under the hood is equivalent to
  // target += log(ub - lb) + log_inv_logit(y) + log1m_inv_logit(y)
  // where y are unbounded parameters
  // so the adjustment for P0 is
  //
  // assuming 
  // vector[N0] P0_raw and vector[N1] P1_raw 
  // are declared as parameters is
  // 
  // target += N0 * log(1 - theta) + log_inv_logit(P0_raw) + log1m_inv_logit(P0_raw);
  // target += N1 * log(theta) + log_inv_logit(P1_raw) + log1m_inv_logit(P1_raw);
}
3 Likes

@spinkney gave the authoritative math answer. I just want to chime in and connect this to some intuition.

Let’s consider a simpler model, with only 2 parameters:

parameters {
  real<lower=0, upper=1> theta1;
  real<lower=0, upper=theta1> theta2;
}

@simonbrauer 's initial expectation was that this model should yield a uniform posterior between 0 and 1 for the theta1 margin.

But it doesn’t:

library(cmdstanr)
mod <- cmdstan_model("/Users/Jacob/Desktop/test3.stan")
samps <- mod$sample(iter_sampling = 100000)$draws() |> posterior::as_draws_df()
hist(samps$theta1)

Why not? Because this theta1 plot only shows a margin of a higher-dimensional parameter space. This model yields output that is uniform over the allowable configurations of [theta1, theta2] that respect the declared constraints.

library(ggplot2)
density_plot <- ggplot(samps, aes(x=theta1, y=theta2) ) +
  geom_bin2d() +
  theme_bw()
density_plot

In @simonbrauer 's example, we are looking at the margin for theta of a 1 + N0 + N1 dimensional parameter space, over which we are uniform, truncated by the constraints supplied in the parameter declarations.

Edit to clarify: so there’s no extra or missing Jacobian adjustment at issue here. The adjustments applied under-the-hood by Stan are achieving exactly what they’re supposed to, which is yielding a uniform density over the constrained parameter space, ready to be modified by any further increments to the target.

6 Likes

Thanks @spinkney and @jsocolar, that clarifies a lot!

This is surprising, but I see it clearly now. I recall this coming up in a fairly recent post (which, of course, I can no longer find) in which there was confusion why

theta1 <- runif(1000, 0, 1)
theta2 <- runif(1000,  0, theta1)

did not produce the same result as the “equivalent” Stan model as you showed. But I see the connection now between my problem at that original one.

Thanks again, everyone.

1 Like

Happy to help! And I think the thread you’re looking for is Parameter bounds that depend on other parameters.

1 Like