Latent variable inference in regression model

(First off, I apologize if this is a long post. Also most parameters [unless specified] are assumed flat. Disclaimer: I do of course also have a more complicated dateset similar to the problem described below)
So I am trying to solve this simple toy example in Stan, but I would be very interested if anyone had another interesting approach to solving this type of problem.
The idea is simple, you have collected some Y,X values, which regresses nicely but exhibits two trends (with e.g., a shift in intercept; assumed for simplicity; example data at the end of the post). Now, you would like to infer both the two categories/trends (or the probability) and the regression coefficients. I.e., y_i\sim \mathcal{N}(\beta_0 + x_i\beta_x + c_i\beta_c,\sigma_y) where c_i\in\{0,1\} and \beta s are unknown. A simple idea could be the following model:

Y|C=0\sim\mathcal{N}(\beta_0+\beta_xX,\sigma)\\ Y|C=1\sim\mathcal{N}(\beta_0+\beta_C+\beta_xX,\sigma)\\ C\sim Bern(p)\\ p_i= \begin{cases} 1, &\text{ if } y_i=\max(Y)\\ 0, &\text{ if } y_i=\min(Y)\\ \Phi_{0,\sigma_p}(\alpha_0 + \alpha_yY+\alpha_xX), &\text{ otherwise} \end{cases}\\ \sigma\sim\Gamma(1,1)\\ \sigma_p\sim\Gamma(1,1)

Which I am probably implementing wrong in some way, because I only get rejection of initial value due to Log probability evaluates to log(0). I think I am inputting the model/marginalizing C wrong. Here is my Stan code attempt:

model <- rstan::stan_model(
model_code = "
// The input data is a vector 'y' of length 'N'.
data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x;
}
// Find min/max indices
transformed data {
  int y_min_idx;
  int y_max_idx;
  real y_min;
  real y_max;
  y_min = min(y);
  y_max = max(y);
  for (i in 1:N) {
    if (y[i] == y_min) {
      y_min_idx = i;
    } else if (y[i] == y_max) {
      y_max_idx = i;
    }
  }
}

parameters {
  real<lower=0> sigma;
  real<lower=0> sigma_c;
  real beta_0;
  real beta_x;
  real beta_c;
  real alpha_0;
  real alpha_x;
  real alpha_y;
}

transformed parameters {
  real<lower = 0, upper = 1> p[N];
  for (i in 1:N) {
    p[i] = normal_cdf(alpha_0 + alpha_y*y[i] + alpha_x*x[i], 0, sigma_c);
  }
  // Force two points to avoid bi-modality
  p[y_min_idx] = 0;
  p[y_max_idx] = 1;
}

model {
  sigma   ~ gamma(1,1);
  sigma_c ~ gamma(1,1);
  real lps[2];
  lps[1] = bernoulli_lpmf(0 | p) + normal_lpdf(y | beta_0 + beta_x*x, sigma);
  lps[2] = bernoulli_lpmf(1 | p) + normal_lpdf(y | beta_0 + beta_c + beta_x*x, sigma);
  target += log_sum_exp(lps);
}
"
)

Either way, I believe I could re-write/mimic the above model as follows:

Y\sim\mathcal{N}(\beta_0+\beta_C+\beta_xX,\sigma)\\ \beta_C|C=0=0\\ \beta_C|C=1\sim\mathcal{N}(\mu_c,\sigma_c)\\ \sigma_c\sim\Gamma(1,1)\\ \text{otherwise the same}

Which I tried to implement (again definitely wrongly due to all the warnings, slow sampling, and essentially all p=0) as follows:

// The input data is a vector 'y' of length 'N'.
data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x;
}

transformed data {
  int y_min_idx;
  int y_max_idx;
  real y_min;
  real y_max;
  y_min = min(y);
  y_max = max(y);
  for (i in 1:N) {
    if (y[i] == y_min) {
      y_min_idx = i;
    } else if (y[i] == y_max) {
      y_max_idx = i;
    }
  }
}

parameters {
  real<lower=0> sigma;
  real<lower=0> sigma_c;
  real<lower=0> sigma_c2;
  real mu_c;
  real beta_0;
  real beta_x;
  real beta_c;
  real alpha_0;
  real alpha_x;
  real alpha_y;
}

transformed parameters {
  vector<lower = 0, upper = 1>[N] p;
  for (i in 1:N) {
    p[i] =  normal_cdf(alpha_0 + alpha_y*y[i] + alpha_x*x[i], 0, sigma_c);
  }
  // Force two points to avoid bimodality
  p[y_min_idx] = 0;
  p[y_max_idx] = 1;
}

model {
  sigma   ~ gamma(1,1);
  sigma_c ~ gamma(1,1);
  sigma_c2 ~ gamma(1,1);
  y ~ normal(beta_0 + beta_c + beta_x*x, sigma);
  real lps[2];
  lps[1] = bernoulli_lcdf(0 | p);
  lps[2] = bernoulli_lcdf(1 | p) + normal_lpdf(beta_c | mu_c, sigma_c2);
  target += log_sum_exp(lps);
}

And here is some dummy data in R:

# install.package('tibble')
set.seed(1)
N <- 2000
data <- tibble::tibble(
  x = rnorm(N),
  c = sample(0:1, N, T),
  y = 5 + 10*x + 5*c + rnorm(N)
)

I’m not sure how this works generatively, for example to simulate values of Y, C given values of \alpha, \beta, \sigma, because the value of C depends on Y through the all three cases for p_i and then Y depends on C through the regression. I also don’t understand why p_i is getting set to 1 or 0 if you hit the min or max.

If you removed the dependence on Y in the probit regression for p_i, you wouldn’t have a loop in the graphical model. But you might also just want to treat this as a simple mixture model and take p to be a univariate probability. Then the likelihood is just

p(y_n \mid \alpha, \beta, \gamma, \sigma) = p \cdot \textrm{normal}(y_n \mid \alpha + x_n \beta, \sigma) + (1 - p) \cdot \textrm{normal}(y_n \mid \alpha + x_n \beta + \gamma, \sigma).

and you can follow the examples in the User’s Guide to implement the mixture model and do posterior inference for C_n \sim \textrm{bernoulli}(p).

Thank you for your feedback. I think I forced p_i to try to avoid bi-modality. If I would flip C for all values and change \gamma to -\gamma, I would get the same solution (i.e., instead of increasing the intercept for one cluster, I can decrease it for the other). I was trying to avoid this by fixing two values (I though the largest value of Y is likely to be in the increased intercept and the smallest in the lower). The reason (in my mind) for having both x and y in the probit regression was that the clusters are along the diagonal. In general I think your approach is way better since it does not need discrete values in the parameters. Thank you again.