I am using Stan via RStan. I have a dataset, and I am unsure which likelihood to specify.
Imagine I am a company with a very polarizing product: People tend to hate it or love it. Now, imagine I do a randomized experiment for an experimental advertising campaign (control coded 0, experimental coded 1) to try to increase how much people like my brand (measured from -2 to 2). The dependent variable takes this distributional shape:
I try a typical regression to assess the effect of this advertising campaign. The Stan script is:
data {
int n_obs;
int x[n_obs];
vector[n_obs] y;
}
parameters {
vector[2] beta;
real<lower=0> sigma;
}
transformed parameters {
vector[n_obs] residual;
vector[n_obs] yhat;
for (i in 1:n_obs)
yhat[i] = beta[1] + beta[2] * x[i];
for (i in 1:n_obs)
residual[i] = y[i] - yhat[i];
}
model {
y ~ normal(yhat, sigma);
beta ~ normal(0, 3);
sigma ~ cauchy(0, 10);
}
The issue with the U-shaped distribution of the dependent-variable, along with me only including a dichotomous predictor (experimental campaign versus control), is that my residuals are not normally distributed:
set.seed(1839)
dat_stan <- list(n_obs = nrow(dat), x = dat$x, y = dat$y)
mod <- rstan::stan(file = "bimodal_issue.stan", data = dat_stan,
iter = 1000, chains = 4)
posterior_means <- rstan::get_posterior_mean(mod)
resids <- posterior_means[grepl("residual", rownames(posterior_means)), 5]
qqnorm(resids)
qqline(resids)
The residuals, just like my dependent variable, take a U-shaped distribution, where the tails are over-represented. This violates the assumption I set with my likelihood y ~ normal(yhat, sigma)
.
I have asked others, and some recommended rescaling and using a beta regression. I tried this using betareg::betareg
for R, and had the same problem with the residuals.
Which likelihood should I specify instead of normal
to address this pattern of residuals? I feel as if there should be a way to address this without artificially binning the dependent variable, but I have not found a way. It seems like a common enough problem (having a polarizing dependent variable) that should have a solution, but I haven’t found one I am satisfied with.
Sample data—generated based on characteristics another dataset I ran into a while back that had this problem—are attached here: bimodal_issue.csv (7.4 KB)