[edit: fixed broken LaTeX]
Below I create two similar Stan models, the first of which seems to yield a biased parameter estimate, and the second does not. I would like to understand what is wrong with the first model.
Consider the following data-generating process.
set.seed(123)
parameters <- lst(
sigma = abs(rnorm(1)),
)
latents <- with(parameters, lst(
n = 2e3,
R1 = rlnorm(n, sdlog = sigma),
d = runif(n),
R2 = rlnorm(n, sdlog = sigma),
))
data <- with(latents, lst(
n,
D1 = d*R1,
D2 = (R1 - D1)*R2, # also = (1-d)*R1*R2
))
where I am interested in estimating the parameter in parameters
, and data
contains the data visible to the model. (What makes it interesting is that it combines a multiplicative process —involving R_1,R_2— and an arithmetic process —involving d.)
If a Stan model includes d in the parameters
block then one can solve for R_1 and R_2 and hence derive likelihood statements. It turns out one can solve for these latent variables in a few different ways, and some ways seem to result in biased estimates.
For R_1 there seems to be only one way to proceed. D_1/d = R_1 \sim \mathcal{LN}(0,\sigma) (where X\sim\mathcal{LN}(m, s) means \log X has mean m and standard deviation s). Hence
For R_2 there seem to be a few ways to proceed. One way is to note that D_2/(1-d)=R_1R_2 \sim \mathcal{LN}(0, \sqrt 2 \sigma) hence
This results in the following Stan code.
data {
int<lower=0> n;
vector<lower=0>[n] D1, D2;
}
parameters {
real<lower=0> sigma;
vector<lower=0, upper=1>[n] d;
}
model {
sigma ~ normal(0, 1);
D1 ~ lognormal(log(d), sigma);
D2 ~ lognormal(log1m(d), sqrt(2)*sigma); // version 1
}
A variation on the above approach is to note that D_2/(D_1/d - D_1)=R_2 \sim \mathcal{LN}(0,\sigma) hence
\boxed{ D_2 \sim \mathcal{LN}(\log D_1 + \log(1/d-1),\, \sigma). }
This results in a Stan model similar to the previous one but where the line ending in // version 1
is replaced with
D2 ~ lognormal(log(D1) + log(1/d - 1), sigma); // version 2
Sampling from both these models results in the following plot. The vertical lines include the true value of \sigma as well as a few classical estimates of it, for comparison purposes. It seems to me that in this example (and other examples I have tried by varying the seed and n) that version 1
seems to yield an estimate of \sigma biased high. Hence my question is:
Is there something wrong with Stan model
version 1
? And, if so, what is it?
I include my R code in its entirety at the end of this post in case anyone wants to quickly copy, paste, and run it.
R code
library(tidyverse)
library(cmdstanr)
options(mc.cores = parallel::detectCores())
# Data generating process ####
set.seed(123)
parameters <- lst(
sigma = abs(rnorm(1)),
)
latents <- with(parameters, lst(
n = 2e3,
R1 = rlnorm(n, sdlog = sigma),
d = runif(n),
R2 = rlnorm(n, sdlog = sigma),
))
data <- with(latents, lst(
n,
D1 = d*R1,
D2 = (R1 - D1)*R2,
))
# MCMC ####
create_and_sample <- function(parameter_code, model_code) {
model_code <- str_flatten(model_code, collapse = ';')
'
data {
int<lower=0> n;
vector<lower=0>[n] D1, D2;
} parameters {
real<lower=0> sigma;
{{parameter_code}};
} model {
sigma ~ normal(0, 1);
{{model_code}};
}' |>
str_glue(.open = '{{', .close = '}}') |>
write_stan_file() |>
cmdstan_model() ->
model
data |>
model$sample(refresh = 0) ->
mcmc
mcmc$draws(variables = names(parameters), format = 'df') |>
as_tibble()
}
bind_rows(
`version 1` = create_and_sample(parameter_code = 'vector<lower=0, upper=1>[n] d',
model_code = c('D1 ~ lognormal(log(d), sigma)',
'D2 ~ lognormal(log1m(d), sqrt(2)*sigma)')),
`version 2` = create_and_sample(parameter_code = 'vector<lower=0, upper=1>[n] d',
model_code = c('D1 ~ lognormal(log(d), sigma)',
# 'D2 ~ lognormal(log1m(d) - log(d) + log(D1), sigma)'
'D2 ~ lognormal(log(D1) + log(1/d - 1), sigma)'
)),
.id = 'model') ->
AllDraws
# Plot ####
AllDraws |>
ggplot(aes(sigma, after_stat(ndensity))) +
facet_grid(rows = vars(str_c('D2 likelihood: ', model))) +
geom_density(fill = 'grey', color = NA) +
geom_vline(aes(xintercept = value, color = name),
data = tribble(~name, ~value,
'true', parameters$sigma,
'cheat1', sqrt(mean( log(latents$R1)^2 )),
'cheat2', sqrt(mean( log(latents$R2)^2 )),
'classical1', sqrt(mean( log(data$D1)^2 ) - 2),
'classical2', sqrt(mean( log(data$D2)^2 )/2 - 1),
)) +
labs(caption = str_glue('n = {data$n}, iter_sampling = {iter_sampling}',
iter_sampling = AllDraws |>
summarise(n = n(), .by = c(model, .chain)) |>
summarise(toString(unique(n))) |>
pull())) +
theme(panel.grid = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())