Biased estimate from model with latent variables

[edit: fixed broken LaTeX]

Below I create two similar Stan models, the first of which seems to yield a biased parameter estimate, and the second does not. I would like to understand what is wrong with the first model.

Consider the following data-generating process.

set.seed(123)

parameters <- lst(
  sigma = abs(rnorm(1)),
)

latents <- with(parameters, lst(
  n = 2e3,
  R1 = rlnorm(n, sdlog = sigma),
  d = runif(n),
  R2 = rlnorm(n, sdlog = sigma),
))

data <- with(latents, lst(
  n,
  D1 = d*R1,
  D2 = (R1 - D1)*R2, # also = (1-d)*R1*R2
))

where I am interested in estimating the parameter in parameters, and data contains the data visible to the model. (What makes it interesting is that it combines a multiplicative process —involving R_1,R_2— and an arithmetic process —involving d.)

If a Stan model includes d in the parameters block then one can solve for R_1 and R_2 and hence derive likelihood statements. It turns out one can solve for these latent variables in a few different ways, and some ways seem to result in biased estimates.

For R_1 there seems to be only one way to proceed. D_1/d = R_1 \sim \mathcal{LN}(0,\sigma) (where X\sim\mathcal{LN}(m, s) means \log X has mean m and standard deviation s). Hence

\boxed{ D_1 \sim \mathcal{LN}(\log d,\, \sigma). }

For R_2 there seem to be a few ways to proceed. One way is to note that D_2/(1-d)=R_1R_2 \sim \mathcal{LN}(0, \sqrt 2 \sigma) hence

\boxed{ D_2 \sim \mathcal{LN}(\log(1-d), \, \sqrt2 \sigma). }

This results in the following Stan code.

data {
  int<lower=0> n;
  vector<lower=0>[n] D1, D2;
}
parameters {
  real<lower=0> sigma;
  vector<lower=0, upper=1>[n] d;
}
model {
  sigma ~ normal(0, 1);
  D1 ~ lognormal(log(d), sigma);
  D2 ~ lognormal(log1m(d), sqrt(2)*sigma); // version 1
}

A variation on the above approach is to note that D_2/(D_1/d - D_1)=R_2 \sim \mathcal{LN}(0,\sigma) hence

\boxed{ D_2 \sim \mathcal{LN}(\log D_1 + \log(1/d-1),\, \sigma). }

This results in a Stan model similar to the previous one but where the line ending in // version 1 is replaced with

  D2 ~ lognormal(log(D1) + log(1/d - 1), sigma); // version 2

Sampling from both these models results in the following plot. The vertical lines include the true value of \sigma as well as a few classical estimates of it, for comparison purposes. It seems to me that in this example (and other examples I have tried by varying the seed and n) that version 1 seems to yield an estimate of \sigma biased high. Hence my question is:

Is there something wrong with Stan model version 1? And, if so, what is it?

I include my R code in its entirety at the end of this post in case anyone wants to quickly copy, paste, and run it.

R code

library(tidyverse)
library(cmdstanr)
options(mc.cores = parallel::detectCores())

# Data generating process ####
set.seed(123)

parameters <- lst(
  sigma = abs(rnorm(1)),
)

latents <- with(parameters, lst(
  n = 2e3,
  R1 = rlnorm(n, sdlog = sigma),
  d = runif(n),
  R2 = rlnorm(n, sdlog = sigma),
))

data <- with(latents, lst(
  n,
  D1 = d*R1,
  D2 = (R1 - D1)*R2,
))

# MCMC ####
create_and_sample <- function(parameter_code, model_code) {
  model_code <- str_flatten(model_code, collapse = ';')
  
  '
data {
  int<lower=0> n;
  vector<lower=0>[n] D1, D2;
} parameters {
  real<lower=0> sigma;
  {{parameter_code}};
} model {
  sigma ~ normal(0, 1);
  {{model_code}};
}' |>
    str_glue(.open = '{{', .close = '}}') |> 
    write_stan_file() |> 
    cmdstan_model() ->
    model
  
  data |> 
    model$sample(refresh = 0) ->
    mcmc
  
  mcmc$draws(variables = names(parameters), format = 'df') |> 
    as_tibble()
}

bind_rows(
  `version 1` = create_and_sample(parameter_code = 'vector<lower=0, upper=1>[n] d', 
                             model_code = c('D1 ~ lognormal(log(d), sigma)',
                                            'D2 ~ lognormal(log1m(d), sqrt(2)*sigma)')),
  `version 2` = create_and_sample(parameter_code = 'vector<lower=0, upper=1>[n] d', 
                             model_code = c('D1 ~ lognormal(log(d), sigma)',
                                            # 'D2 ~ lognormal(log1m(d) - log(d) + log(D1), sigma)'
                                            'D2 ~ lognormal(log(D1) + log(1/d - 1), sigma)'
                                            )),
  .id = 'model') ->
  AllDraws

# Plot ####
AllDraws |> 
  ggplot(aes(sigma, after_stat(ndensity))) +
  facet_grid(rows = vars(str_c('D2 likelihood: ', model))) +
  geom_density(fill = 'grey', color = NA) +
  geom_vline(aes(xintercept = value, color = name), 
             data = tribble(~name, ~value,
                            'true', parameters$sigma,
                            'cheat1', sqrt(mean( log(latents$R1)^2 )),
                            'cheat2', sqrt(mean( log(latents$R2)^2 )),
                            'classical1', sqrt(mean( log(data$D1)^2 ) - 2),
                            'classical2', sqrt(mean( log(data$D2)^2 )/2 - 1),
             )) +
  labs(caption = str_glue('n = {data$n}, iter_sampling = {iter_sampling}',
                          iter_sampling = AllDraws |> 
                            summarise(n = n(), .by = c(model, .chain)) |> 
                            summarise(toString(unique(n))) |> 
                            pull())) +
  theme(panel.grid = element_blank(),
        axis.text.y = element_blank(), 
        axis.ticks.y = element_blank())

It looks like you slipped a sign. If D_1 / d \sim \textrm{lognormal}(0, \sigma), then D_1 \sim \textrm{lognormal}(-\log d, \sigma).

My reasoning (as mathematical statements, as opposed to as snippets of Stan code) was that

\begin{alignat}{2} && D_1/d &\sim \mathcal{LN}(0,\sigma) \\ &\iff\qquad & \log(D_1/d) &\sim \mathcal{N}(0,\sigma) \\ &\iff\qquad & \log D_1 - \log d &\sim \mathcal{N}(0,\sigma) \\ &\iff\qquad & \log D_1 &\sim \mathcal{N}(\log d,\sigma) \\ &\iff\qquad & D_1 &\sim \mathcal{LN}(\log d,\sigma). \end{alignat}

So I believe what I wrote was correct.

However, I did want to make a simple observation about my two models. One can check that the two likelihood functions implied by the two D2 ~ ... statements are numerically different (*). Hence, they define different probabilistic models. So both can’t be right! (And I assume version 1 is wrong and version 2 is right, but I want to think about it a bit more.)

(*) In fact, if n=1 then the model has two parameters: sigma and d[1], so one can plot the joint probability density as a heatmap. If one subtracts the densities from the two models one gets the plot below, which I thought was somewhat pretty.

I think I have partial explanation of what is going on here. The question is what is wrong with “version 1” of the model. The answer, I believe, is that in the data observed by the model D1 and D2 are correlated, but the model is unaware of this fact.

Concretely, version 1 of the model says:

  D1 ~ lognormal(log(d), sigma);
  D2 ~ lognormal(log1m(d), sqrt(2)*sigma); // version 1

which implicitly says that conditional on d, D1 and D2 are independent. However, they are not, since D1 = d*R1 and D2 = (1-d)*R1*R2 so they share a dependence on R2 which the model is unaware of. At the end of the post I include R code for a slightly simplified version of this issue that allows one to explore what happens as one changes the correlation (but leaves the model oblivious to this fact). As you can see in the plot below, when the data is uncorrelated (rho: 0) the estimate seems reasonable (the black vertical line is the true value of \sigma). However, as soon as their correlation is even 0.25 the estimates are quite biased.

This does raise one further question: Why does unmodeled correlation in a model like this result in biased estimates?

R Code

library(tidyverse)
library(cmdstanr)
options(mc.cores = parallel::detectCores())

'
data {
  int<lower=0> n;
  vector[n] y1, y2;
}
parameters {
  real<lower=0> sigma;
  vector<lower=0, upper=1>[n] u;
}
model {
  // prior
  sigma ~ normal(0, 1);
  
  // likelihood
  y1 ~ normal(log(u), sigma);
  y2 ~ normal(log1m(u), sigma);
}
' |>
  write_stan_file() |> 
  cmdstan_model() ->
  model

set.seed(123)
sigma <- abs(rnorm(1))
n <- 1e3
x1 <- rnorm(n, sd = sigma)
x2 <- rnorm(n, sd = sigma)
u <- runif(n)
y1 <- log(u) + x1

tibble(rho = c(-0.25, 0, 0.25, 0.5, 1)) |> 
  rowwise() |> 
  reframe(rho, {
    x2_corr <- rho*x1 + sqrt(1 - rho^2)*x2 # has correlation rho to x1

    lst(n, y1, y2 = log(1 - u) + x2_corr) |> 
      model$sample(refresh = 0) ->
      mcmc
    
    mcmc$draws(variables = 'sigma', format = 'df')
  }) ->
  Draws

Draws |> 
  ggplot(aes(sigma)) +
  facet_grid(vars(rho), labeller = label_both) +
  geom_density(fill = 'grey') +
  geom_vline(xintercept = sigma) +
  labs(caption = str_glue('n = {n}'))

Hmm. You reasoning looks right, but it doesn’t agree with the Wikipedia or with simulation, unless I have something backward (easy to do with all this sign-based algebra). The general rule is that if

X \sim \textrm{lognormal}(\mu, \sigma),

then

c \cdot X \sim \textrm{lognormal}(\mu + \log c, \sigma).

Here your c is 1/d, so \log c = - \log d.

Assuming d is positive, this makes sense, as it will make the values sampled smaller.

I verified with R in case the Wikipedia page on lognormal was wrong:

> y = rlnorm(1e6, 0, 1)

> y2 = y / 10

> y3 = rlnorm(1e6, -log(10), 1)

> mean(y2)
[1] 0.1646022
> mean(y3)
[1] 0.1648226

> sd(y2)
[1] 0.2157505
> sd(y3)
[1] 0.2159368

So it looks like if y \sim \textrm{lognormal}(0, 1), then y /10 \sim \textrm{lognormal}(-\log 10, 1).

I actually agree with your (@Bob_Carpenter) latest post. As you say,

X \sim \text{lognormal}(\mu,\sigma) \iff \frac{X}{d} \sim \mathrm{lognormal}(\mu - \log d, \sigma).

Then if we set \mu=\log d we get

X \sim \text{lognormal}(\log d,\sigma) \iff \frac{X}{d} \sim \mathrm{lognormal}(0, \sigma)

which is what I was arguing in my penultimate post, namely that D_1/d\sim\mathcal{LN}(0,\sigma) \iff \cdots \iff D_1\sim\mathcal{LN}(\log d,\sigma).

***

Aside from the above, I’m hoping to write up a post on “Why does unmodeled correlation in a model like this result in biased estimates?”. I believe I have a decent explanation of what’s going on and I think it’s somewhat interesting.

One of the nice things about Bayesian inference is that If the model matches the data generating process, Bayesian posterior mean estimates are unbiased. This is pretty easy to see if you work through the simulation-based calibration argument.