Correlation between bias and true parameter values

This is a follow-up question from my previous post.

I would basically like to examine the effects of a predictor on the differences between two proportions (\frac{n_1}{N_1} - \frac{n_2}{N_2}). The input data are the four count variables from which to calculate the two proportions (n_1, N_1, n_2, N_2) and a predictor variable (x).

Based on the suggestions I received on my previous post, I am trying to follow @betanalpha’s simulation-based calibration and a principled Bayesian workflow as explained in Schad et al. (2020), and found that even though the model overall may look acceptable (I’m actually not very sure of this – Do extremely high contraction ratios in the upper left panel in the figure below suggest something like I should use a more informative prior?), there is a negative correlation (r = -0.47) between the true values of beta (i.e., slope for x) and the bias, as shown in the lower left panel.

I thought it might be because the prior is dragging the posterior towards 0, so I tested it again with a fixed true value (0.5) and the prior whose mean is the fixed value. As shown in the lower right panel, however, the bias still remains. An even more strongly negative correlation between bias and true values (r = -0.65) is observed in the intercept parameter (figure not shown).

I am wondering if there is a way to figure out possible reasons for the correlation and bias, and how I can go about fixing them.

Here is the Stan code used (identical to the code in my earlier post):

data {
int<lower=0> N; // total number of observations
int<lower=1> denom1[N]; // denominator of the first proportion
int<lower=1> denom2[N]; // denominator of the second proportion
int<lower=0> nom1[N]; // nominator of the first proportion
int<lower=0> nom2[N]; // nominator of the second proportion
real x[N]; // predictor variable
}

parameters {
real<lower=0, upper=1> theta1[N]; // the first proportion
real<lower=0, upper=1> theta2[N]; // the second proportion
real alpha; // intercept
real beta; // slope parameter for x
real<lower=0> sigma; // SD of the error term
}

transformed parameters {
real<lower=-1, upper=1> deltaP[N]; // Delta P
for (i in 1:N) {
deltaP[i] = theta1[i] - theta2[i];
}
}

model {
// priors
theta1 ~ beta(1, 1);
theta2 ~ beta(1, 1);
alpha ~ normal(0, 1);
beta ~ normal(0, 1);   //beta ~ normal(0.5, 1); for the lower right panel
sigma ~ normal(0, 1) T[0, ];

for (i in 1:N) {
// estimating thetas based on denoms and noms
nom1[i] ~ binomial(denom1[i], theta1[i]);
nom2[i] ~ binomial(denom2[i], theta2[i]);
// deltaP is sampled from the truncated normal distribution whose mean is alpha + beta * x and the SD is sigma
deltaP[i] ~ normal(alpha + beta * x[i], sigma) T[-1, 1];
}
}


And here is the R cod used to perform simulation:

library("rstan")
library("brms")

# number of simultations
nsim <- 1000
# sample size per simulation
N <- 100
alpha_cf <- beta_cf <- sigma_cf <- Rhats <- div_trans <- neffRatio <- sbc_rank <- z_score <- contraction <- sbc_rank_alpha <- z_score_alpha <- contraction_alpha <- numeric(nsim)
x_cf <- deltaP_cf <- theta1_cf <- theta2_cf <- denom1_cf <- denom2_cf <- nom1_cf <- nom2_cf <- vector("list", nsim)
t <- proc.time()
for (j in 1:nsim) {
set.seed(j)

while (TRUE) {
# predictor values
x_cf[[j]] <- runif(N, -1, 1)
# True parameter values
alpha_cf[j] <- rnorm(1, 0, 1)
beta_cf[j] <- rnorm(1, 0, 1) # 0.5 for the lower right panel
sigma_cf[j] <- abs(rnorm(1, mean = 0, sd = 1))
deltaP_cf[[j]] <- alpha_cf[j] + beta_cf[j] * x_cf[[j]] + rnorm(N, sd = sigma_cf[j])
if (all(deltaP_cf[[j]] <= 1) & all(deltaP_cf[[j]] >= -1)) break
}

# theta values
theta1_cf[[j]] <- theta2_cf[[j]] <- numeric(N)
for (i in 1:N) {
if (deltaP_cf[[j]][i] > 0) {
theta1_cf[[j]][i] <- runif(1, deltaP_cf[[j]][i], 1)
theta2_cf[[j]][i] <- theta1_cf[[j]][i] - deltaP_cf[[j]][i]
} else {
theta2_cf[[j]][i] <- runif(1, abs(deltaP_cf[[j]][i]), 1)
theta1_cf[[j]][i] <- theta2_cf[[j]][i] + deltaP_cf[[j]][i]
}
}

# denoms and noms
denom1_cf[[j]] <- sample(10:100, size = N, replace = TRUE)
denom2_cf[[j]] <- sample(10:100, size = N, replace = TRUE)
nom1_cf[[j]] <- rbinom(N, denom1_cf[[j]], theta1_cf[[j]])
nom2_cf[[j]] <- rbinom(N, denom2_cf[[j]], theta2_cf[[j]])

### fit the model
fit_cf <- sampling(model_cf,
data = list(
N = N,
denom1 = denom1_cf[[j]],
denom2 = denom2_cf[[j]],
nom1 = nom1_cf[[j]],
nom2 = nom2_cf[[j]],
x = x_cf[[j]]
))

# extracting/calculating metrics indicating convergence issues
Rhats[j] <- rhat(fit_cf, pars = "beta")
div_trans[j] <- sum(subset(nuts_params(fit_cf), Parameter == "divergent__")$Value) neffRatio[j] <- neff_ratio(fit_cf, pars = "beta") ## Slope parameter # posterior beta_post <- posterior_samples(fit_cf, pars = "beta") thinned.pos <- seq(1, nrow(beta_post), 4) # thin to remove autocorrelation sbc_rank[j] <- sum(beta_cf[j] < beta_post$beta[thinned.pos])
# sensitivity
post_mean_beta <- mean(beta_post$beta) post_sd_beta <- sd(beta_post$beta) # posterior sd
z_score[j] <- (post_mean_beta - beta_cf[j]) / post_sd_beta
contraction[j] <- 1 - (post_sd_beta^2 / 1^2)
## Do the same for the intercept parameter
alpha_post <- posterior_samples(fit_cf, pars = "alpha")
thinned.pos_alpha <- seq(1, nrow(alpha_post), 4)
sbc_rank_alpha[j] <- sum(alpha_cf[j] < alpha_post$alpha[thinned.pos_alpha]) post_mean_alpha <- mean(alpha_post$alpha)
post_sd_alpha <- sd(alpha_post\$alpha) # posterior sd
z_score_alpha[j] <- (post_mean_alpha - alpha_cf[j]) / post_sd_alpha
contraction_alpha[j] <- 1 - (post_sd_alpha^2 / 1^2)
}


Firstly everything but the SBC plot looks pretty good superficially. I think I’ve seen patterns like the negative bias that you see before but I wouldn’t be able to offer hypotheses for why you see this behavior here without spending much more time with your model.

That SBC, however, does suggest a problem. It could be computational but looking through your code I think that it may be due to how you are simulating the data.

In your R simulation you simulate the two probabilities theta1 and theta2 for each trial by sampling each from the unit interval independently and then adding deltaP while subtracting deltaP from the other. This has a few important consequences, in particular that the difference theta1 - theta2 won’t be distributed according to normal(alpha + beta * x[i], sigma).

Firstly you need to take into account the excess variation from the difference of two uniform distributions from which you sampled before applying the offsets, which would require convolving the normal with a triangle distribution which I may not be possible analytically. In other words the variation in the difference is larger than sigma and it’s not exactly shaped like a normal density.

Secondly the centrality of the difference distribution as you’ve defined it wouldn’t be mu_deltaP = alpha + beta * x[i] but rather - 2 * mu_deltaP! In the simulation you add the linear term to theta2 but subtract it from theta1. That means that theta1 - theta2 gets twice the difference on average, and with a minus sign.

Those differences between your simulated data generating process and the ones implies in your model are a strong candidate for the source of the problem.

Ultimately though a much more straightforward way to model the behavior would be something like

    nom1[i] ~ binomial(denom1[i], theta[i]);
nom2[i] ~ binomial_logit(denom2[i], logit(theta[i]) + deltaP[i]);


in other words theta models the first process while deltaP models the difference between the second and first process.

Thank you for having a look at my figures and code. I thought a large value at the beginning of the SBC plot might indicate a problem but couldn’t figure out what specifically it means and what is causing it.

Doesn’t the following part of the code only involve one uniform distribution for each observation in each trial due to the if-statement? The value of one theta is drawn from a uniform distribution, while the other theta is deterministic. What I’m trying to do here is to come up with the values of theta1 and theta2, where each of their values is between 0 and 1 and whose difference equals to deltaP (i.e., deltaP = theta1 - theta2). I’m sorry if my R code was confusing.

if (deltaP_cf[[j]][i] > 0) {
theta1_cf[[j]][i] <- runif(1, deltaP_cf[[j]][i], 1)
theta2_cf[[j]][i] <- theta1_cf[[j]][i] - deltaP_cf[[j]][i]
} else {
theta2_cf[[j]][i] <- runif(1, abs(deltaP_cf[[j]][i]), 1)
theta1_cf[[j]][i] <- theta2_cf[[j]][i] + deltaP_cf[[j]][i]
}


I did the following simulation just to make sure that variation approximately equals to sigma and that deltaP approximately equals to alpha + beta * x[i].

# number of simultations
nsim <- 10000
# sample size per simulation
N <- 100
alpha_cf <- beta_cf <- sigma_cf <- sigma_bias <- deltaP_center <- resids <- numeric(nsim)
x_cf <- deltaP_cf <- theta1_cf <- theta2_cf <- denom1_cf <- denom2_cf <- nom1_cf <- nom2_cf <- vector("list", nsim)
for (j in 1:nsim) {
set.seed(j)

while (TRUE) {
# predictor values
x_cf[[j]] <- runif(N, -1, 1)
# True parameter values
alpha_cf[j] <- rnorm(1, 0, 1)
beta_cf[j] <- rnorm(1, 0, 1)
sigma_cf[j] <- abs(rnorm(1, mean = 0, sd = 1))
deltaP_cf[[j]] <- alpha_cf[j] + beta_cf[j] * x_cf[[j]] + rnorm(N, sd = sigma_cf[j])
if (all(deltaP_cf[[j]] <= 1) & all(deltaP_cf[[j]] >= -1)) break
}

# theta values
theta1_cf[[j]] <- theta2_cf[[j]] <- numeric(N)
for (i in 1:N) {
if (deltaP_cf[[j]][i] > 0) {
theta1_cf[[j]][i] <- runif(1, deltaP_cf[[j]][i], 1)
theta2_cf[[j]][i] <- theta1_cf[[j]][i] - deltaP_cf[[j]][i]
} else {
theta2_cf[[j]][i] <- runif(1, abs(deltaP_cf[[j]][i]), 1)
theta1_cf[[j]][i] <- theta2_cf[[j]][i] + deltaP_cf[[j]][i]
}
}

# calculate delta P from theta1 and theta2
dp <- theta1_cf[[j]] - theta2_cf[[j]]
# bias in sigma
sigma_bias[j] <- sd(dp - (alpha_cf[j] + beta_cf[j] * x_cf[[j]])) - sigma_cf[j]
# centrality of the difference
deltaP_center[j] <- mean(dp - (alpha_cf[j] + beta_cf[j] * x_cf[[j]]))
}

# negligibly small
mean(abs(sigma_bias))
[1] 0.007235564
sd(sigma_bias)
[1] 0.01119515

# negligibly small
mean(abs(deltaP_center))
[1] 0.01038692
sd(deltaP_center)
[1] 0.01610849


They look OK to me.

Plesaet let me know if I’m missing something.

Sorry, I missed the logic of that conditional statement.

The challenge here is that your simulation imposes constraints that are not encoded in your Stan program (the Stan program assigns non-zero target density to configurations that could not be generated by your simulation; the Stan program has two theta parameters that influence each observation where as the simulation only uses on at a time), and those inconsistencies would technically render SBC invalid (i.e. you wouldn’t expect uniform rank histograms).

One thing you can do is run your Stan program with the two binomial lines commented out and compare to your R simulations of the model configurations. Because those should just be two ways of sampling from the same prior any differences would indicate that the priors aren’t actually consistent.

Ultimately I would encourage you to find a generative process that capture the constraints more naturally. An immediate option would be to model differences with with a logits to ensure that the constraint is always satisfied. This would implicit model the saturation that occurs when trying to model differences between two probabilities and one of the probabilities is really close to the boundary.

Potential differences between the simulation and the Stan code were indeed one of the worries I have had, but does the Stan program assign non-zero density to a case that can theoretically not be generated by my R simulation? Would you mind giving a specific case as I thought their theoretical space is identical?

Thank you for this suggestion. I have done it and found that the SD is much smaller in posterior distributions than in R simulations (e.g., 0.04 vs 0.36 for beta), while holding the predictor values (x) constant. I have not been able to identify the source of this difference yet, but it might be related to the difference between the use of truncated normal distribution in the Stan program and the heuristic specification of deltaP in R simulations. I will try to explore this further (though it may turn out to be unnecessary if I model deltaP in logits).

Thank you for suggesting this as well. I had a feeling it is odd to use a while-loop to keep generating deltaP until its values fall within the theoretically possible range (i.e., -1 \leq deltaP \leq 1).

I had not thought about modelling in logits, but it sounds like an excellent idea. Am I right in understanding that you are suggesting I should model deltaP in the logit scale (i.e., mu_deltaP = logit(theta1) – logit(theta2) = alpha + beta * x)? I need to think about this a little more but will definitely try it.

Correct. For example

parameters {
real logit_theta1;
real alpha;
real beta;
}

transformed parameters {
real theta1 = inv_logit(logit_theta1);
real theta2 = inv_logit(logit_theta1 + alpha + beta * x);
}


although for numerical stability you might want to use the binomial_logit density function which would take int logit_theta1 and logit_theta1 + alpha + beta * x directly.

Yes, I think I got the idea and will try it. Thank you so much for your help!