(First off, I apologize if this is a long post. Also most parameters [unless specified] are assumed flat. Disclaimer: I do of course also have a more complicated dateset similar to the problem described below)
So I am trying to solve this simple toy example in Stan, but I would be very interested if anyone had another interesting approach to solving this type of problem.
The idea is simple, you have collected some Y,X values, which regresses nicely but exhibits two trends (with e.g., a shift in intercept; assumed for simplicity; example data at the end of the post). Now, you would like to infer both the two categories/trends (or the probability) and the regression coefficients. I.e., y_i\sim \mathcal{N}(\beta_0 + x_i\beta_x + c_i\beta_c,\sigma_y) where c_i\in\{0,1\} and \beta s are unknown. A simple idea could be the following model:
Which I am probably implementing wrong in some way, because I only get rejection of initial value due to Log probability evaluates to log(0). I think I am inputting the model/marginalizing C wrong. Here is my Stan code attempt:
model <- rstan::stan_model(
model_code = "
// The input data is a vector 'y' of length 'N'.
data {
int<lower=0> N;
vector[N] y;
vector[N] x;
}
// Find min/max indices
transformed data {
int y_min_idx;
int y_max_idx;
real y_min;
real y_max;
y_min = min(y);
y_max = max(y);
for (i in 1:N) {
if (y[i] == y_min) {
y_min_idx = i;
} else if (y[i] == y_max) {
y_max_idx = i;
}
}
}
parameters {
real<lower=0> sigma;
real<lower=0> sigma_c;
real beta_0;
real beta_x;
real beta_c;
real alpha_0;
real alpha_x;
real alpha_y;
}
transformed parameters {
real<lower = 0, upper = 1> p[N];
for (i in 1:N) {
p[i] = normal_cdf(alpha_0 + alpha_y*y[i] + alpha_x*x[i], 0, sigma_c);
}
// Force two points to avoid bi-modality
p[y_min_idx] = 0;
p[y_max_idx] = 1;
}
model {
sigma ~ gamma(1,1);
sigma_c ~ gamma(1,1);
real lps[2];
lps[1] = bernoulli_lpmf(0 | p) + normal_lpdf(y | beta_0 + beta_x*x, sigma);
lps[2] = bernoulli_lpmf(1 | p) + normal_lpdf(y | beta_0 + beta_c + beta_x*x, sigma);
target += log_sum_exp(lps);
}
"
)
Either way, I believe I could re-write/mimic the above model as follows:
Which I tried to implement (again definitely wrongly due to all the warnings, slow sampling, and essentially all p=0) as follows:
// The input data is a vector 'y' of length 'N'.
data {
int<lower=0> N;
vector[N] y;
vector[N] x;
}
transformed data {
int y_min_idx;
int y_max_idx;
real y_min;
real y_max;
y_min = min(y);
y_max = max(y);
for (i in 1:N) {
if (y[i] == y_min) {
y_min_idx = i;
} else if (y[i] == y_max) {
y_max_idx = i;
}
}
}
parameters {
real<lower=0> sigma;
real<lower=0> sigma_c;
real<lower=0> sigma_c2;
real mu_c;
real beta_0;
real beta_x;
real beta_c;
real alpha_0;
real alpha_x;
real alpha_y;
}
transformed parameters {
vector<lower = 0, upper = 1>[N] p;
for (i in 1:N) {
p[i] = normal_cdf(alpha_0 + alpha_y*y[i] + alpha_x*x[i], 0, sigma_c);
}
// Force two points to avoid bimodality
p[y_min_idx] = 0;
p[y_max_idx] = 1;
}
model {
sigma ~ gamma(1,1);
sigma_c ~ gamma(1,1);
sigma_c2 ~ gamma(1,1);
y ~ normal(beta_0 + beta_c + beta_x*x, sigma);
real lps[2];
lps[1] = bernoulli_lcdf(0 | p);
lps[2] = bernoulli_lcdf(1 | p) + normal_lpdf(beta_c | mu_c, sigma_c2);
target += log_sum_exp(lps);
}
And here is some dummy data in R:
# install.package('tibble')
set.seed(1)
N <- 2000
data <- tibble::tibble(
x = rnorm(N),
c = sample(0:1, N, T),
y = 5 + 10*x + 5*c + rnorm(N)
)