Observing non-linear function of latent variables

banbh · August 12, 2024, 11:30pm

Consider data generated by the data generating process below. Note that x and y are not observable; V is observable but is non-linear combination of x and y.

library(tidyverse)

# Generative model ####
# Model parameters (we will try to estimate these later)
mean_x <- 1
sd_x <- 0.2
mean_y <- 2
sd_y <- 0.4

# Generate data
set.seed(123)
n <- 100
# Non-observable data
x <- rnorm(n, mean = mean_x, sd = sd_x)
y <- rnorm(n, mean = mean_y, sd = sd_y)

# Observable data
Data <- tibble(w = runif(n), V = w*exp(x) - (1-w)*exp(y))

A very low-tech way to try to estimate the model parameters is as follows:

#| A very low-tech way to estimate parameters is to filter the data down to
#| observations which are mostly composed of x (or y) and then estimate the
#| mean and sd from that.
threshhold <- 0.9

Data |>
  filter(w > threshhold) |>
  mutate(x1 = log(V/w)) |>
  summarise(n(), mean_x = mean(x1), sd_x = sd(x1)) # 0.955, 0.149

Data |>
  filter(1-w > threshhold) |>
  mutate(y1 = log(-V/(1-w))) |>
  summarise(n(), mean_y = mean(y1), sd_y = sd(y1)) # 1.76, 0.455

I’d like to estimate properly using Stan. Below is an attempt at a model

data {
  int<lower=0> n;
  vector<lower=0,upper=1>[n] w;
  vector[n] V;
}
parameters {
  real mean_x, mean_y;
  real<lower=0> sd_x, sd_y;
  vector[n] y;
}
model {
  mean_x ~ normal(0, 1);
  sd_x ~ normal(0, 1);
  mean_y ~ normal(0, 1);
  sd_y ~ normal(0, 1);

  y ~ normal(mean_y, sd_y);
  (V + (1-w).*exp(y))./w ~ lognormal(mean_x, sd_x); // Jacobian needed?
}

However, sampling from this model fails with messages like “Exception: lognormal_lpdf: Random variable[2] is -20.1569, but must be nonnegative!” This is no surprise since coming up with a valid starting y is hard.

One can cheat and initialize y with the unobservable actual values of y; this allows sampling to succeed (with lots of warnings) but the ESS are low and estimates are poor.

Any advice on how to model this in Stan?

Thanks.

Garren_Hermanus · August 13, 2024, 1:42pm

The model you are trying to fit does not seem statistically valid to me. That is you do not have a consistent likelihood function.

This part is trying to define the likelihood using a normal prior on x?

If V is observable (i.e. the data of the model), I would have formulated the problem using something along the lines of
Prior on x:

P(x|\mu_x,\sigma_x)=\mathcal{N}(x|\mu_x,\sigma_x)

Prior on y:

P(y|\mu_y,\sigma_y)=\mathcal{N}(y|\mu_y,\sigma_y)

Likelihood:

P(V|x,y,\sigma_V)=\mathcal{N}(V|w\space exp(x)-(1-w)exp(y),\sigma_V)

And some prior on \sigma_V if required.

The normal likelihood may not be sufficient in your case

banbh · August 13, 2024, 3:29pm

Thanks for your comments. I’m not sure I follow your comment on the likelihood. Note that given x, y, and w the value of V is completely determined. I.e., from the data generating process we have that

V = w*exp(x) - (1-w)*exp(y)

You’ll notice that there is no parameter \sigma_V!

In other words, we observe a deterministic function of the latent variables x and y.

If you have any thoughts I’m all ears!

Garren_Hermanus · August 13, 2024, 4:33pm

If V is deterministic, than you may only provide a single density. That is by defining a density on y, the density of x is fully defined as this can be seen as a transformation of variables.

We have that:

P(y)=\mathcal{N}(y|\mu_y,\sigma_y)

We than determine x:

x=ln(V+(1-w)exp(y))-ln(w)

That is x=f(y) or y=f^{-1}(x). By determining the derivative \frac{dy}{dx} we can obtain the density on x:

P(x)=|\frac{dy}{dx}|P(y)

Hence the model you are defining does not make sense in a Bayesian setting. A problems also arises from this in that V+(1-w)exp(y)>0 for the transformation to be valid.

If we however treat V as a random variable with noise associated with it (hence the incorporation of \sigma_V), we are able to derive a fully Bayesian model as I have it above. For noise free models \sigma_V tends towards zero. For stability of the sampling process, we can set \sigma_V to a small constant, which would prefer values of x,y such that V\approx w\space exp(x) - (1-w)exp(x)

Bob_Carpenter · August 13, 2024, 6:27pm

That’s because you need a joint constraint on the parameters y that ensures (V + (1-w).*exp(y))./w is greater than zero. This shouldn’t be too had given that V and w are data.

V + (1 - w[n]) * exp(y[n]) / w[n] > 0

(1 - w[n]) * exp(y[n]) / w[n] > -V

exp(y[n]) > -V * w[n] / (1 - w[n])

y[n] > log(-V * w[n] / (1 - w[n]))

so you can create the vector of lower bounds for y and add as a constraint on the declaration, assuming your data V and w are consistent with the constraint—if they’re negative, this won’t work. Then you need the change-of-variables adjustment, which is mainly about the exp().

Right. And V is deterministic here in the sense of being given as data.

NikVetr · August 13, 2024, 11:00pm

(wrote this before seeing other replies, and purposefully not reading them to not have the answer spoiled ;])

Your intuitions are correct – you need a Jacobian adjustment – and the exceptions and warnings are because you haven’t been explicit about bounds, which also removes any need for special initialization.

The model is well specified – you have some joint probability distribution Pr(x, y, w, V, mean_x, mean_y, sd_x, sd_y), and having observed u and V, you want to sample from a target distribution Pr(x, y, mean_x, mean_y, sd_x, sd_y | w, V). The model is:

x_mean ~ normal(0, 1)
y_mean ~ normal(0, 1)
x_sd ~ half-normal(0, 1)
y_sd ~ half-normal(0, 1)
x_i ~ normal(mean_x, sd_x)
y_i ~ normal(mean_y, sd_y)
w_i ~ uniform(0, 1)
V_i = w_i * exp(x_i) - (1-w_i) * exp(y)

We can do some algebra to express x in terms of w, V, and y:

V_i = w_i * exp(x_i) - (1-w_i) * exp(y_i)
w_i * exp(x_i) = V_i + (1 - w_i) * exp(y_i)
exp(x_i) = ( V_i + ( 1 - w_i ) * exp( y_i ) ) / w_i
x_i = log( ( V_i + ( 1 - w_i ) * exp( y_i ) ) / w_i )

to go in the transformed parameters block as:

vector[n] x = log( ( V + ( 1 - w ) .* exp( y ) ) ./ w );

As we’ll be putting a normal() on this, we need a Jacobian adjustment. Because the transformation is element-wise, the Jacobian of the transform is diagonal, which means the determinant of the Jacobian is the product of the diagonal. With w and V being observed, these diagonal elements are

dx_i/dy_i = d/dy_i log( ( V_i + ( 1 - w_i ) * exp( y_i ) ) / w_i )

where V_i and w_i are constants.

This can be rewritten as

d/dy_i log( V_i / w_i + ( 1 - w_i ) / w_i * exp( y_i ) )

and we can specify new constants to be:

A_i = V_i / w_i
B_i = ( 1 - w_i ) / w_i

to get

d/dy_i log( A_i + B_i * exp( y_i ) )

We can take this derivative with the chain rule: dx/dy = dx/du * du/dy, where u_i = A_i + B_i * exp( y_i ) and x_i = log( u_i ).

So

dx_i / du_i = d / du_i ( log (u_i ) ) = 1 / u_i

and

du_i / dy_i = d / dy_i ( A_i + B_i * exp( y_i ) )
= 0 + B_i * exp( y_i )
= B_i * exp( y_i )

multiplying dx_i / du_i * du_i / dy_i, we get:

dx_i / dy_i = 1 / u_i * B_i * exp( y_i )

substituting back in for u, we get:

dx_i / dy_i = 1 / ( A_i + B_i * exp( y_i ) ) * B_i * exp( y_i )

we can rearrange this to get:

= B_i * exp( y_i ) / ( A_i + B_i * exp( y_i ) )
= exp ( y_i ) / (A_i / B_i + exp( y_i ) )

and then simplify A_i / B_i:

A_i / B_i = ( V_i / w_i ) / ( (1 - w_i ) / w_i )
= V_i / ( 1 - w_i )

substituting this back in, we at long last obtain:

dx_i / dy_i = exp ( y_i ) / ( V_i / ( 1 - w_i ) + exp( y_i ) )

The product of these across all i in 1:n is our determinant, but since for numerical stability we’re working with the log density of the target distribution, we add the sum of the logs.

log( exp ( y_i ) / ( V_i / ( 1 - w_i ) + exp( y_i ) ) ) )

which simplifies:

y_i - log ( ( V_i / ( 1 - w_i ) + exp( y_i ) ) )

so our Jacobian adjustment is:

target += sum(y - log ( ( V ./ ( 1 - w ) + exp( y ) ) ) );

We’re not done yet, though! Recall that:

x_i = log( ( V_i + ( 1 - w_i ) * exp( y_i ) ) / w_i )

You’re seeing warnings about Stan trying to take the log of a negative number – this is because your y are allowed to take on values that allow for the inner part of the rhs above:

V_i + ( 1 - w_i ) * exp( y_i ) ) / w_i

to be negative. So you need to put a lower bound on each y_i so that the expression above is always positive, enforcing

( V_i + ( 1 - w_i ) * exp( y_i ) ) / w_i > 0

We can do a bit more algebra here:

V_i + ( 1 - w_i ) * exp( y_i ) > 0 * w_i
( 1 - w_i ) * exp( y_i ) > -V_i
exp( y_i ) > -V_i / ( 1 - w_i )

So the lower bound of each y_i is

log( -V_i / ( 1 - w_i ) )

(or where this doesn’t exist, because V_i is positive, that y_i does not have a lower bound)

You could put this in the transformed data block like this:

transformed data {
  vector[n] lb_y;
  for(i in 1:n){
    lb_y[i] = V[i] > 0 ? negative_infinity() : log(-V[i] / (1-w[i])); 
  }
}

and when declaring y as a parameter, write:

vector<lower=lb_y>[n] y;

Putting it all together, we get:

data {
  int<lower=0> n;
  vector<lower=0,upper=1>[n] w;
  vector[n] V;
}

transformed data {
  vector[n] lb_y;
  for(i in 1:n){
    lb_y[i] = V[i] > 0 ? negative_infinity() : log(-V[i] / (1-w[i])); 
  }
}

parameters {
  real mean_x;
  real mean_y;
  real<lower=0> sd_x;
  real<lower=0> sd_y;
  vector<lower=lb_y>[n] y;
}
transformed parameters {
  vector[n] x = log( ( V + ( 1 - w ) .* exp( y ) ) ./ w );
}

model {
  // priors on hyperparameters
  mean_x ~ std_normal();
  sd_x ~ std_normal();
  mean_y ~ std_normal();
  sd_y ~ std_normal();
  
  // centered distributions on latent variables
  x ~ normal(mean_x, sd_x);
  y ~ normal(mean_y, sd_y);

  // Jacobian adjustment
  target += sum(y - log ( ( V ./ ( 1 - w ) + exp( y ) ) ) );
}

which looks to reliably recover your mean_x, mean_y, sd_x, and sd_y when I run it. At n = 100 and adapt_delta = 0.9, Stan does flag a couple divergent iterations, so non-centering would help (will leave that to you :p), and increasing n to 1000 (to tighten up the target geometry) samples fine with no errors or warnings. The centered model with both n = 100 and n = 1000 fits on my old macbook in a few seconds for 2E3 iterations, and in < 1 minute with 1E4 iterations. So there are lots of avenues to improve things there.

banbh · August 14, 2024, 5:59pm

Thank you to everyone who answered. I found your answers very helpful. In particular: I realized I could do more with upper and lower bounds than I realized; in addition, they can be \pm\infty! Also, I tried the model with \sigma_V (a “soft constraint”?) and it did work, although in the end I preferred an approach more similar to what I started with. The code of @NikVetr (which is also what @Bob_Carpenter suggested) worked nicely.

However, there is also mystery! My code differed from that of @NikVetr in that instead of x I had a variable which was its exponential and hence I gave it a lognormal distribution. So, I went back and added the lower bounds and critically added a suitable Jacobian adjustment to my code. The mystery is this: if one adds/removes the Jacobian adjustment to the normal model (@NikVetr 's model) then its estimates are much better with the adjustment than without (as expected). Weirdly, adding the appropriate Jacobian adjustment to the lognormal model (my code) worsens the estimates. This is not expected!

Below, I plot the differences between the MCMC mean and the true value of the parameters based on ten generated datasets with different seeds that used different \mu_x,\sigma_x,\mu_y,\sigma_y chosen at random according to the hyperparameters. As you can see, the lognormal comes up with much better estimates if the Jacobian adjustment is omitted! (In fact, there is a second mystery: the lognormal model without a Jacobian produces better estimates than the normal model! I.e., the best model by far is the lognormal model without a Jacobian adjustment.)

After the plot you can see the exact Stan code I used for this test. I include it, among other reasons, so you can see my Jacobian adjustment for the lognormal model. I also include the exact code I used to generate the data fed to Stan.

I’m happy to share my entire R script if someone wants to easily reproduce this experiment.

// LOGNORMAL MODEL
data {
  // hyperparameters
  real mean_mean_x, mean_sd_x, mean_mean_y, mean_sd_y;
  real<lower=0> sd_mean_x, sd_sd_x, sd_mean_y, sd_sd_y;

  // observed data
  int<lower=0> n;
  vector<lower=0,upper=1>[n] w;
  vector[n] V;
}
transformed data {
  vector[n] lb_y = log(fmax(-V./(1 - w), 0)); // note: log(0) = -Inf
}
parameters {
  real mean_x, mean_y;
  real<lower=0> sd_x, sd_y;
  vector<lower=lb_y>[n] y;
}
transformed parameters {
  vector<lower=0>[n] exp_x = (V + (1 - w).*exp(y))./w;
}
model {
  mean_x ~ normal(mean_mean_x, sd_mean_x);
  sd_x ~ normal(mean_sd_x, sd_sd_x);
  mean_y ~ normal(mean_mean_y, sd_mean_y);
  sd_y ~ normal(mean_sd_y, sd_sd_y);

  y ~ normal(mean_y, sd_y);
  exp_x ~ lognormal(mean_x, sd_x);
  target += sum(y); // Jacobian
}

// NORMAL MODEL
data {
  // hyperparameters
  real mean_mean_x, mean_sd_x, mean_mean_y, mean_sd_y;
  real<lower=0> sd_mean_x, sd_sd_x, sd_mean_y, sd_sd_y;

  // observed data
  int<lower=0> n;
  vector<lower=0,upper=1>[n] w;
  vector[n] V;
}
transformed data {
  vector[n] lb_y = log(fmax(-V./(1 - w), 0)); // note: log(0) = -Inf
}
parameters {
  real mean_x, mean_y;
  real<lower=0> sd_x, sd_y;
  vector<lower=lb_y>[n] y;
}
transformed parameters {
  vector[n] x = log( ( V + ( 1 - w ) .* exp( y ) ) ./ w );
}
model {
  mean_x ~ normal(mean_mean_x, sd_mean_x);
  sd_x ~ normal(mean_sd_x, sd_sd_x);
  mean_y ~ normal(mean_mean_y, sd_mean_y);
  sd_y ~ normal(mean_sd_y, sd_sd_y);

  x ~ normal(mean_x, sd_x);
  y ~ normal(mean_y, sd_y);

  target += sum(y - log ( ( V ./ ( 1 - w ) + exp( y ) ) ) ); // Jacobian
}

# R code used to generate each is the datasets
hyperparameters <- lst(
  mean_mean_x = 0,
  sd_mean_x = 1,
  mean_sd_x = 0,
  sd_sd_x = 0.5,
  mean_mean_y = 0,
  sd_mean_y = 1,
  mean_sd_y = 0,
  sd_sd_y = 0.5,
)

n <- 2000

  set.seed(seed) # seed = 1, ..., 10

  hyperparameters |>
    with(lst(
      mean_x = rnorm(1, mean = mean_mean_x, sd = sd_mean_x),
      sd_x = abs(rnorm(1, mean = mean_sd_x, sd = sd_sd_x)),
      mean_y = rnorm(1, mean = mean_mean_y, sd = sd_mean_y),
      sd_y = abs(rnorm(1, mean = mean_sd_y, sd = sd_sd_y)),
    )) ->
    parameters

  # Non-observable data
  parameters |>
    with(lst(
      x = rnorm(n, mean = mean_x, sd = sd_x),
      y = rnorm(n, mean = mean_y, sd = sd_y),
    )) ->
    latents

  # Observable data
  latents |>
    with(lst(
      w = runif(n),
      V = w*exp(x) - (1 - w)*exp(y),
    )) ->
    data

banbh · August 14, 2024, 10:34pm

One further observation: the normal model with its Jacobian produces very similar results to the lognormal model also with its Jacobian. So, this is as expected.

Given this, the mystery is why the lognormal model without a Jacobian adjustment produces estimates that are so much closer to the true values than the other models.

NikVetr · August 15, 2024, 9:08pm

@banbh hm so the simplest explanation here is that I messed up somewhere 🤔 but looking back, the derivatives check out, and both wolfram alpha and R’s own:

stats::D(expression(log((V + (1 - w) * exp(y)) / w)), "y")

seem to agree.

if you work on the log scale for both variables, though, no Jacobian adjustment is necessary because the transformation is linear – the transformation being

exp_x = (V + (1 - w) .* exp_y) ./ w;

so the partial derivatives are all

(1-w_i) / w_i

which are constants. This seems to get the same fit as the exp_x ~ lognormal model, y ~ normal model, without the Jacobian adjustment, and is thus the “best” fit while also being theoretically correct afaict. So I’m confused here too!

banbh · August 16, 2024, 3:40pm

I agree! I actually made the exact change you suggested (i.e., a model with exp_x, exp_y) and it produces very similar (and very good) results to the model you proposed but without the Jacobian adjustment. Also, I agree that your computation of the Jacobian adjustment seems correct.

In short, the mystery is that in your model (“NORMAL MODEL” above) it seems like it requires a Jacobian adjustment, but it performs noticeably better without the adjustment. I also don’t understand this behavior.

NikVetr · August 19, 2024, 9:59pm

@banbh so I think I figured out the problem, but don’t really have a clear solution. It resembles a problem I’d had a few years ago myself but wasn’t able to figure out. I think it stems from a failure to appropriately encode information about the data-generative process into the model – namely, the independence of variables x and y.

This leads to the conditional distribution of x|V,y to not be independent of y, as we’ve only specified the marginal distributions of x and y, rather than their joint distributions. And even if we had, that doesn’t itself enforce their independence in the target distribution (because specifying eg a multi-normal with identity correlation matrix is equivalent to incrementing target by the univariate log densities). One hack might be to say the studentized sample correlation is t-distributed, or that the sample correlation matrix is some rescaled Wishart or whatever, but idk if that’s theoretically or empirically justified (and would not work as written for scalar parameters).

I think the trouble is clearer with a simpler example, no Jacobians necessary. Suppose:

a ~ normal(0, 1)
b ~ normal(0, 1)
c = a + b

Implicitly,

c ~ normal(0, sqrt(2))

The goal is to use Stan to sample {a, b, c} and retrieve the joint distribution above while only specifying distributions on (or observing outright) one of a OR b (not both), along with c. So expressing c = a + b; as a transformed parameter and giving c a sampling statement c ~ ...; or [a, c] a sampling statement.

My earlier attempts at this were not successful, eg the first and obvious try, not enforcing independence between a and b:

parameters {
    real a;
    real b;
}
transformed parameters {
    real c = a + b;
}
model {
    // priors
    a ~ std_normal();
    c ~ normal(0,sqrt(2));
}

yields:

aka the correct (and specified) marginals on a and c, but not the desired marginal on b nor the desired joint

I can even “cheat” and specify the implied joint on b and c… but without any constraint on b, this also does not produce the correct joint on {a, b, c}:

parameters {
    real a;
    real b;
}
transformed parameters {
    real c = a + b;
}
model {
    // priors
    matrix[2,2] Sigma = [[1, 1/sqrt(2)], [1/sqrt(2), 2]];
    [a, c] ~ multi_normal([0,0], Sigma);
}

If we observe a vector of length n of standard normal draws a and provide them as data (as in your initial example), we “miss” even harder!

data {
  int<lower=0> n;
  vector[n] a;
}
parameters {
  vector[n] b;
}
transformed parameters {
  vector[n] c = a + b;
}
model {
  // implicit marginal distribution if a ~ std_normal()
  c ~ normal(0,sqrt(2));
}

yields these posterior means:

and with the ~ multi_normal() version,

banbh · August 20, 2024, 7:21pm

Thank you @NikVetr, I believe you have identified a (or perhaps the) problem with my approach. I need to mull it over a bit, though. However, I wanted to comment that your “cheat” does, in fact, work if one uses the correct covariance matrix, namely:

matrix[2,2] Sigma = [[1, 1], [1, 2]];

With that Sigma I believe I get the correct joint on \{a,b,c\}. In particular, the correlation between a and b is about zero.

Very interesting and thank you!

banbh · August 20, 2024, 9:23pm

@NikVetr, I’m not sure I understand how (or whether) your (very interesting!) comments apply to the models in this thread. As I understand it the issue you raise is that statements like a ~ std_normal(); c ~ normal(0,sqrt(2)); (from your simple example above) may not encode the desired correlation structure. In this case a and c will be independent while in the true generative model they are not (but a and b are independent). But I think in the models that started this thread (e.g., the “LOGNORMAL MODEL”) the sampling statements end up making the right variables independent. Below is a simple model in the style of your most recent models but where we provide data and estimate a parameter. The idea is that we don’t observe a or b directly but we do observer their sum. The model makes the vector a into a parameter, then it can recover b (by = c - a) and finally it says a ~ normal(0, 1); b ~ normal(0, sigma_b); which I assume results in a and b being independent, which is indeed what the generative model does. The model seems to estimate sigma_b reasonably well. So, I guess I’m confused why this toy model does not have a problem with the joint distribution, but the earlier models (e.g., LOGNORMAL MODEL) does.

Thanks for giving this so much thought!

n <- 1e3
sigma_b <- 1.5
set.seed(123)
a <- rnorm(n, sd = 1)
b <- rnorm(n, sd = sigma_b)
c <- a + b

'
data {
  int<lower=0> n;
  vector[n] c;
}
parameters {
  vector[n] a;
  real<lower=0> sigma_b;
}
transformed parameters {
  vector[n] b = c - a;
}
model {
  a ~ normal(0, 1);
  b ~ normal(0, sigma_b);
}
' |> 
  write_stan_file() |> 
  cmdstan_model() ->
  model3

lst(n, c) |> 
  model3$sample(refresh = 1e3, iter_sampling = 1e4) ->
  mcmc3

# Ok estimate, but true value is outside (q5,q95).
tibble(sigma_b) |> 
  pivot_longer(everything(), names_to = 'variable') |> 
  full_join(mcmc3$summary(variables = 'sigma_b'), join_by(variable))

NikVetr · August 20, 2024, 11:57pm

@banbh oh whoops! hah, that’s right, I had misspecified the covariance matrix. Good catch! That correction does seem to fix things!

I also realized another conceptual error on my end – namely (in the toy example w/ a, b, and c), it is not the posterior means of the bs that we expect to have 1/sqrt(2) correlation with the cs, but rather the posterior cor(b, c) that we expect to be centered on 1/sqrt(2). So the scatterplots and marginal histograms above are looking at the wrong things, and we do in fact expect the posterior means of each b_i to line up nicely with each c_i and a_i.

As such, I think the model below samples correctly:

data {
  int<lower=0> n;
  vector[n] a;
}
parameters {
  vector[n] b;
}
transformed parameters {
  vector[n] c = a + b;
  array[n] row_vector[2] ac;
  for(i in 1:n){
    ac[i] = [a[i], c[i]];
  }
}
model {
  matrix[2,2] Sigma = [[1, 1], [1, 2]];
  ac ~ multi_normal([0,0], Sigma);
}

and it looks to produce equivalent samples to when we reparameterize the bivariate normal as two standard normals:

data {
  int<lower=0> n;
  vector[n] a;
}
transformed data {
  matrix[2,2] sigma = [[1, 1], [1, 2]];
  matrix[2,2] sigma_chol_inv = [[1, -1], [0, 1]];
}
parameters {
  vector[n] b;
}
transformed parameters {
  vector[n] c = a + b;
  matrix[n, 2] ac;
  ac[,1] = a;
  ac[,2] = c;
  matrix[n, 2] ac_norm = ac * sigma_chol_inv;
}
model {
  ac_norm[,1] ~ std_normal();
  ac_norm[,2] ~ std_normal();
}

with the distribution correlations between a and each sampled b being about what you’d expect as the sampling distribution of the correlation coefficient between two independent standard normals

so, uhh, scratch what I had written in the previous message! I think the independence constraint is appropriately encoded into the target density. Sorry for the confusion! 🙃

I’m curious to get to the bottom of the earlier question, though, pertaining to the “goodness” of the model with the Jacobian adjustment vs the one without. One thing to try out would be to check calibration under the prior – if you resample mean_x, sd_x, mean_y, and sd_y from their priors, and then sample x and y conditional on those values, sample w, and transform to V, and then fit the models, do we see appropriate coverage properties in the posterior?

(eg, if we repeat this procedure a bunch of times, is the distribution of each true hyperparamater value’s quantile in its corresponding marginal posterior uniform? Averaged across replicate prior samples, does the {1, 2, 3… 98, 99}% credible interval cover the true parameter value {1, 2, 3… 98, 99}% of the time, falling along the a 1-to-1 line)

That’ll be a more definitive validation of the implementation of the model than checking if for fixed

mean_x <- 1
sd_x <- 0.2
mean_y <- 2
sd_y <- 0.4

the recovered posterior mean is closer (since a misspecified model can just happen to be favorable to those true parameters by chance, though I’d not have expected it to be so strongly favorable!)

I can have a go at it later this week, but you might want to try it out yourself first!

NikVetr · August 21, 2024, 12:03am

@banbh ah it looks like you had written another response. I think you’re right in that it is sampling things appropriately – I’d messed up the order of some non-commutative operations. Well that’s one mystery solved from my past haha. So I don’t think the independence thing is actually influencing things here.

banbh · August 21, 2024, 1:40pm

@NikVetr, thanks for your further comments. The “goodness” check you propose (when you say “you might want to try it out yourself first”) is a good idea. In an earlier post in this thread I did a version of what you propose (see my code that starts # R code used to generate each is the datasets and includes the line set.seed(seed) # seed = 1, ..., 10) however as you can see I only did 10 iterations. So, if I understand your suggestion, your proposal is to iterate over say 100 (or more) seeds so we can see what the coverage properties are; is that right? This should be easy, although it may take several hours to run.

FWIW, increasingly I wonder if the explanation for the mysterious accuracy of the model without a Jacobian adjustment is just “luck” (and this coverage test should answer whether this hunch is correct).

banbh · August 22, 2024, 2:59am

Since the “goodness” check with the above model will take many hours to complete, I wanted to check I understand your suggestion. Does the code below carry out the same idea for a much simpler model? (See chart after the code. Even this code took 2h to run.)

library(tidyverse)
library(cmdstanr)
options(mc.cores = parallel::detectCores()) # So we don't need `parallel_chains = 4`

'
data {
  // hyperparameters
  real mu_sigma;
  real<lower=0> sigma_sigma;
  // data
  int<lower=0> n;
  vector[n] x;
}
parameters {
  real<lower=0> sigma_x;
}
model {
  sigma_x ~ normal(mu_sigma, sigma_sigma);
  x ~ normal(0, sigma_x);
}
' |> 
  write_stan_file() |> 
  cmdstan_model() ->
  model

hyperparameters <- lst(
  mu_sigma = 1,
  sigma_sigma = 0.5,
)

set.seed(123)

tibble(id = 1:1000) |>
  mutate(p = map_dbl(id, \(x) {
    hyperparameters |> 
      with(lst(
        sigma_x = abs(rnorm(1, mean = mu_sigma, sd = sigma_sigma))
      )) ->
      parameters
    
    parameters |> 
      with(lst(
        n = 100,
        x = rnorm(n, mean = 0, sd = sigma_x)
      )) ->
      data
    
    c(hyperparameters, data) |> 
      model$sample(show_exceptions = FALSE, show_messages = FALSE) ->
      mcmc
    
    mean(mcmc$draws(variables = 'sigma_x') < parameters$sigma_x)
  },
  .progress = TRUE)) ->
  Results

Results |> 
  ggplot(aes(p)) +
  geom_histogram(breaks = seq(0, 1, len = 20))

NikVetr · August 23, 2024, 1:45am

@banbh yep! Though over all the parameters to be estimated, and not just one, and constructing separate plots for each (set of) parameter(s) (in this case, the model does only have one parameter, but the model of interest has the means, sds, and both x and y).

to get the curve perspective you can do something like:

p <- rbeta(1E4, 1, 1) #quantiles of true value
wp <- abs(0.5 - p) * 2 #minimum width of middle X% credible interval to cover true value
ws <- 1:99/100 #width of credible intervals to plot
coverage_props <- rowMeans(outer(ws, wp, ">")) #coverage proportions

which would look something like:

par(mfrow = c(2,1), mar = c(5,5,1,2))
hist(p, freq = F, main = "", xlab = "quantile of true value in marginal posterior")
abline(h = 1, lty=2, lwd=2, col = 2)
plot(ws, coverage_props, type = "l", lwd = 2,
     xlab = "width of (middle) confidence interval",
     ylab = "coverage frequency")
abline(0, 1, lty=2, lwd=2, col = 2)

can formally “test” these “frequentist properties” in lots of ways (eg a 1-sample KS test), but chi-by-eye-ing the histogram and curve is usually enough to satisfy me. I find the histogram to be a bit more sensitive personally :] but that the two views are complementary

banbh · August 25, 2024, 1:26am

I did a run with 200 repetitions of choosing parameters according to the hyperparameters, then MCMC sampling and then determining the quantile of the true value in the Stan draws. I checked the lognormal model (as defined earlier in this thread) with and without a Jacobian adjustment. At the end of this post I include my testing script so my results should be easy to reproduce (although it did take an hour and half on my machine).

Summary

Based on the smallish number of repetitions it seems that the model without a Jacobian adjustment is behaving well (and hence may be correct) while the model with a Jacobian adjustment has lots of problems and does not produce good estimates.

Below are some plots. I plot the histogram plot and QQ-plot (as opposed to the coverage plot that @NikVetr suggested).

Model Without a Jacobian Adjustment

Model With a Jacobian Adjustment

library(tidyverse)
library(cmdstanr)
options('mc.cores' = parallel::detectCores()) # So we don't need `parallel_chains = 4`
library(qqplotr)

# Test the lognormal model only
model_source <- '
data {
  // hyperparameters
  real mean_mean_x, mean_sd_x, mean_mean_y, mean_sd_y;
  real<lower=0> sd_mean_x, sd_sd_x, sd_mean_y, sd_sd_y;

  // observed data
  int<lower=0> n;
  vector<lower=0,upper=1>[n] w;
  vector[n] V;
}
transformed data {
  vector[n] lb_y = log(fmax(-V./(1 - w), 0)); // note: log(0) = -Inf
}
parameters {
  real mean_x, mean_y;
  real<lower=0> sd_x, sd_y;
  vector<lower=lb_y>[n] y;
}
transformed parameters {
  vector<lower=0>[n] exp_x = (V + (1 - w).*exp(y))./w;
}
model {
  mean_x ~ normal(mean_mean_x, sd_mean_x);
  sd_x ~ normal(mean_sd_x, sd_sd_x);
  mean_y ~ normal(mean_mean_y, sd_mean_y);
  sd_y ~ normal(mean_sd_y, sd_sd_y);

  y ~ normal(mean_y, sd_y);
  exp_x ~ lognormal(mean_x, sd_x);
  target += sum(y); // Jacobian
}
'

tibble(has_jacobian = c(FALSE, TRUE)) |>
  rowwise() |>
  mutate(model = {
    src <- if (has_jacobian) {
      model_source

    } else {
      model_source |> str_remove('\n.*// Jacobian\n')
    }

    src |>
      write_stan_file() |>
      cmdstan_model() ->
      stan_model

    stopifnot(sum(str_detect(stan_model$code(), 'target')) == as.integer(has_jacobian))

    list(stan_model)
  }) |>
  ungroup() ->
  StanModels

hyperparameters <- lst(
  mean_mean_x = 0,
  sd_mean_x = 1,
  mean_sd_x = 0,
  sd_sd_x = 0.5,
  mean_mean_y = 0,
  sd_mean_y = 1,
  mean_sd_y = 0,
  sd_sd_y = 0.5,
)

n <- 1000

set.seed(123)
results <- list()
for (id in 1:200) {
  cat('id:', id, '\n')

  hyperparameters |>
    with(lst(
      mean_x = rnorm(1, mean = mean_mean_x, sd = sd_mean_x),
      sd_x = abs(rnorm(1, mean = mean_sd_x, sd = sd_sd_x)),
      mean_y = rnorm(1, mean = mean_mean_y, sd = sd_mean_y),
      sd_y = abs(rnorm(1, mean = mean_sd_y, sd = sd_sd_y)),
    )) ->
    parameters

  parameters |>
    as_tibble() |>
    pivot_longer(everything()) ->
    Parameters

  # Non-observable data
  parameters |>
    with(lst(
      x = rnorm(n, mean = mean_x, sd = sd_x),
      y = rnorm(n, mean = mean_y, sd = sd_y),
    )) ->
    latents

  # Observable data
  latents |>
    with(lst(
      w = runif(n),
      V = w*exp(x) - (1 - w)*exp(y),
    )) ->
    data

  for (model_ix in seq_len(nrow(StanModels))) {
    StanModels |>
      slice(model_ix) ->
      StanModel

    cat('has_jacobian:', StanModel$has_jacobian, '\n')

    system.time(
      c(hyperparameters, lst(n), data) |>
        StanModel$model[[1]]$sample(show_messages = FALSE, show_exceptions = FALSE) ->
        mcmc
    ) |>
      pluck('elapsed') ->
      time_sampling

    system.time(
      X <- mcmc$draws(variables = names(parameters))
    ) |>
      pluck('elapsed') ->
      time_draws

    tibble(variable = names(parameters)) |>
      mutate(id, time_sampling, time_draws,
             has_jacobian = StanModel$has_jacobian,
             p = map_dbl(variable, \(v) mean(X[,,v] < parameters[[v]]))) ->
      results[[length(results) + 1]]
  }
}

results |>
  bind_rows() ->
  Results

Results |>
  summarise(time_sampling = unique(time_sampling),
            time_draws = unique(time_draws),
            .by = c(id, has_jacobian)) |>
  summarise(time = sum(time_sampling) + sum(time_draws)) |>
  pull() |>
  prettyunits::pretty_sec() # 1h 27m

num_bins <- 20
breaks <- seq(0,1,len=num_bins + 1)
num_samples <- max(Results$id)
conf_int <- c(5, 95)/100

caption <- str_glue('n={n}',
                    'num_samples={num_samples}',
                    'conf_int={toString(conf_int*100)}%',
                    .sep = '; ')

Results |>
  separate(variable, into = c('type','model_var')) |>
  filter(has_jacobian) |>
  ggplot(aes(p)) +
  facet_grid(type ~ model_var) +
  geom_histogram(breaks = breaks) +
  geom_hline(yintercept = num_samples/num_bins, col = 2, linetype = 'dashed') +
  geom_hline(yintercept = qbinom(conf_int, size = num_samples, prob = 1/num_bins),
             col = 2, linetype = 'dotted') +
  labs(title = 'With Jacobian Adjustment', caption = caption)

Results |>
  separate(variable, into = c('type','model_var')) |>
  filter(!has_jacobian) |>
  ggplot(aes(p)) +
  facet_grid(type ~ model_var) +
  geom_histogram(breaks = breaks) +
  geom_hline(yintercept = num_samples/num_bins, col = 2, linetype = 'dashed') +
  geom_hline(yintercept = qbinom(conf_int, size = num_samples, prob = 1/num_bins),
             col = 2, linetype = 'dotted') +
  labs(title = 'With Jacobian Adjustment', caption = caption)

Results |>
  separate(variable, into = c('type','model_var')) |>
  filter(!has_jacobian) |>
  ggplot(aes(sample = p)) +
  facet_grid(type ~ model_var) +
  geom_qq_band(distribution = 'unif') +
  stat_qq_point(distribution = 'unif',  size = .1) +
  stat_qq_line(distribution = 'unif', color = 2) +
  labs(title = 'Without Jacobian Adjustment', caption = caption)

Results |>
  separate(variable, into = c('type','model_var')) |>
  filter(has_jacobian) |>
  ggplot(aes(sample = p)) +
  facet_grid(type ~ model_var) +
  geom_qq_band(distribution = 'unif') +
  stat_qq_point(distribution = 'unif',  size = .1) +
  stat_qq_line(distribution = 'unif', color = 2) +
  labs(title = 'With Jacobian Adjustment', caption = caption)

banbh · August 25, 2024, 10:58pm

I repeated the above experiment with 1,000 simulations (rather than just 200). It took seven and half hours. Below are the charts. The conclusions remain the same:

Without a Jacobian Adjustment: Seems compatible with the model being correct
With a Jacobian Adjustment: Either the model is incorrect, or the model is correct, but it is hard for the Stan sampler to sample from it

The reason I am hedging my bets above is that it’s possible the Stan model without a Jacobian adjustment is not exactly right, or perhaps “accidentally right” (in the sense of having two offsetting errors). Also note that many of the simulations for both models issued warnings (“chains had an E-BFMI less than 0.2”). However, in spite of the above caveats, for me the mystery remains: the model seems to require a Jacobian adjustment, but the model without the adjustment seems to behave as if it is the correct model.

Topic		Replies	Views
Bayesian Inference on Nongenerative Model Modeling techniques , fitting-issues , specification , performance	1	376	February 24, 2023
Help with non-linear model [Log probability evaluates to log(0)] Modeling rstan	2	358	October 18, 2022
Biased estimate from model with latent variables Modeling fitting-issues , specification	6	109	October 4, 2024
Generalized linear latent variable model in Stan Modeling	3	3792	October 9, 2017
Estimate an indicator variable within a linear regression Modeling	8	1603	June 15, 2018

Observing non-linear function of latent variables

Summary

Model Without a Jacobian Adjustment

Model With a Jacobian Adjustment

Without a Jacobian Adjustment

With a Jacobian Adjustment

Related topics