Modeling Cutpoint for Noisy Covariate

icostley · June 27, 2024, 9:10pm

Thank you @jsocolar, that was incredibly help!

Even though it might not be the best choice to dichotomize the data, I’ll still see the modeling exercise through to end. One, for the fun of it and two for any other weary travelers that might want to go down the same path:

I went the route of modeling the transition between states as a smooth function as suggested in the latter half of the comment. I chose to use a cumulative normal as the smoother with a scale passed through in the data block. If we visualize the likelihood function before and after the smoothing, we can see the apparent discontinuities that caused the tree depth errors:

inv_logit <- function(u) 1 / (1 + exp(-u))

n <- 1e3
x_latent <- rlnorm(n, 5, 0.8)

cut_point <- 80
beta_0 <- -1
beta_1 <- 2.5

prob <- inv_logit(beta_0 + as.numeric(x_latent > cut_point) * beta_1)
y <- rbinom(n, 1, prob=prob)

# Visualize discontinuity in likelihood
log_lik_bern <- function(cut_point){
  prob <- inv_logit(beta_0 + as.numeric(x_latent > cut_point) * beta_1)
  lik <- ifelse(y == 1, prob, 1 - prob)
  log(lik) |> sum()
}

log_lik_cut_cdf_norm <- function(cut_point, norm_sd){
  prob <- inv_logit(beta_0 + pnorm(x_latent,cut_point, norm_sd) * beta_1)
  lik <- ifelse(y == 1, prob, 1 - prob)
  log(lik) |> sum()
}

par(mfrow=c(1,2))
cut_seq <- seq(60,100, length.out=800) 
plot(cut_seq, vapply(cut_seq, log_lik_bern, numeric(1)), type='l', lwd=2, xlab='Cutpoint', ylab='Log Lik', main='Original Log Likelihood (Discontinuous)')
plot(cut_seq, vapply(cut_seq, \(x) log_lik_cut_cdf_norm(x, 0.1), numeric(1)), type='l', lwd=2, col=1, xlab='Cutpoint', 
     ylab='', main='Log Likelihood with smoothing')
col <- 1
for (s in c(0.5, 1, 5, 10, 50)){
  lines(cut_seq, vapply(cut_seq, \(x) log_lik_cut_cdf_norm(x, s), numeric(1)), type='l', lwd=2, col=col <- col + 1)
}
legend('bottomright', legend = paste0('smoothing scale = ',c(0.,0.5,1,5,10,50)), col=1:6, lwd=2)

The stan code that uses the smooth transition:

data {
  int<lower=0> N;           // Number of observations
  vector[N] x_latent;         // Latent variable
  array[N] int<lower=0,upper=1> y; // Binary outcome
  real<lower=0> cut_point_scale;
}

parameters {
  real cut_point;           // Cut point for latent variable
  real beta_0;              // Intercept
  real beta_1;              // Coefficient for latent variable
}

model {
  vector[N] alpha;
  
  for (i in 1:N){
    alpha[i] = beta_0 + normal_cdf(x_latent[i] | cut_point, cut_point_scale) * beta_1;
  }
  
  // Likelihood
  y ~ bernoulli_logit(alpha);

  cut_point ~ normal(80,5);  // Prior for cut_point
  beta_0 ~ normal(0, 5);       // Prior for beta_0
  beta_1 ~ normal(0, 5);       // Prior for beta_1
}

and the final R code

model <- cmdstan_model('dynamic_cutpoints.stan')

stan_data <- list(
  N = n,
  x_latent = x_latent,
  y = y,
  cut_point_scale = 0.5
)

fit <- model$sample(stan_data)

fit$summary()

It ran very fast and recovered the parameters nicely with beautiful \hat{R}

Topic		Replies	Views
Modelling the indicator that a sum of Bernoulli random variables is positive Modeling	2	387	January 3, 2021
Parameter Estimation issue Modeling specification , hmm , mixture , time-series	4	124	January 13, 2025
Estimating expectations of marginalised discrete variable Modeling	1	31	October 21, 2024
Bernoulli process with observational error Modeling rstan , techniques	4	485	December 8, 2023
Prediction using point estimates in rstan Modeling	13	1672	October 22, 2017

Modeling Cutpoint for Noisy Covariate

Related topics