Selecting a likelihood for my model—dependent variable is distributed U-shaped

specification

#1

I am using Stan via RStan. I have a dataset, and I am unsure which likelihood to specify.

Imagine I am a company with a very polarizing product: People tend to hate it or love it. Now, imagine I do a randomized experiment for an experimental advertising campaign (control coded 0, experimental coded 1) to try to increase how much people like my brand (measured from -2 to 2). The dependent variable takes this distributional shape:

I try a typical regression to assess the effect of this advertising campaign. The Stan script is:

data {
  int n_obs;
  int x[n_obs]; 
  vector[n_obs] y; 
}
parameters {
  vector[2] beta; 
  real<lower=0> sigma;
}
transformed parameters {
  vector[n_obs] residual;
  vector[n_obs] yhat;
  for (i in 1:n_obs) 
    yhat[i] = beta[1] + beta[2] * x[i];
  for (i in 1:n_obs)
    residual[i] = y[i] - yhat[i];
}
model {
  y ~ normal(yhat, sigma);
  beta ~ normal(0, 3);
  sigma ~ cauchy(0, 10);
}

The issue with the U-shaped distribution of the dependent-variable, along with me only including a dichotomous predictor (experimental campaign versus control), is that my residuals are not normally distributed:

set.seed(1839)
dat_stan <- list(n_obs = nrow(dat), x = dat$x, y = dat$y)
mod <- rstan::stan(file = "bimodal_issue.stan", data = dat_stan, 
                   iter = 1000, chains = 4)

posterior_means <- rstan::get_posterior_mean(mod)
resids <- posterior_means[grepl("residual", rownames(posterior_means)), 5]
qqnorm(resids)
qqline(resids)

The residuals, just like my dependent variable, take a U-shaped distribution, where the tails are over-represented. This violates the assumption I set with my likelihood y ~ normal(yhat, sigma).

I have asked others, and some recommended rescaling and using a beta regression. I tried this using betareg::betareg for R, and had the same problem with the residuals.

Which likelihood should I specify instead of normal to address this pattern of residuals? I feel as if there should be a way to address this without artificially binning the dependent variable, but I have not found a way. It seems like a common enough problem (having a polarizing dependent variable) that should have a solution, but I haven’t found one I am satisfied with.

Sample data—generated based on characteristics another dataset I ran into a while back that had this problem—are attached here: bimodal_issue.csv (7.4 KB)


#2

Beta with both alpha < 1 and beta < 1. Or if you reparameterize as mu = alpha / (alpha + beta) and phi = alpha + beta + 1, then phi < 2.


#3

Thanks. Do you have example code for how I might accomplish this? (I am unfamiliar with the ins-and-outs of beta regression).


#4

Just search for beta distribution in the manual.


#5

I read the manual for the beta distribution, and it says that the sampling statement is theta ~ beta(alpha, beta), where 0 < theta < 1. I feel like it cannot be as simple as simply changing the code to say: yhat ~ beta(0.6, 0.6), because it doesn’t include the observed y. It seems like pages 286-7 touch on the reparameterization of the shape parameters for the beta distribution, but not in the context of beta regression.

I could not find model examples on the stan-dev/examples-models GitHub, either. Are there examples of beta regression in Stan you could point me toward? Thank you!


#6

You can also see the first few pages here for an introduction to beta regressions.

A beta regression implies
y ~ beta(alpha,beta).
As Ben wrote, for a regression you need to reparameterize the beta distribution in terms of mean and dispersion (instead of alpha and beta), e.g.:

mu = alpha/(alpha+beta)
phi = alpha + beta

Solving for alpha and beta gives

alpha = mu * phi
beta = phi - mu*phi

then, in your regression model mu = beta * X and phi is a dispersion parameter

adapting your model:

data {
  int n_obs;
  int x[n_obs]; 
  vector<lower = 0, upper = 1> [n_obs] y; 
}
parameters {
  vector[2] beta; 
  real<lower=0> phi;
}
transformed parameters {
  vector[n_obs] mu;
  vector[n_obs] alpha;
  vector[n_obs] beta;
  for (i in 1:n_obs) {
    mu[i] = inv_logit(beta[1] + beta[2] * x[i]); 
    alpha[i] = mu[i] * phi;
    beta[i] =  phi - mu[i] * phi;
  }
  model {
    y ~ beta(alpha, beta);
    beta ~ normal(0, 3);
    phi ~ cauchy(0, 10);
  }
}

I didn’t run the model, so I don’t guarantee that there is no typo in the model ;-).
If you rescale your outcome variable y for the beta regression, make sure that the largest and smallest values are not exactly 0 and 1, because beta_lpdf(0 | alpha, beta) and beta_lpdf(1 | alpha, beta) can be -Inf or +Inf.

Lastly, the histogram you attached shows that the most extreme values are not the most likely ones, which is what one would expect if the data would truly come from a beta distribution. The beta distribution is an approximation, and some domain knowledge would be required to determine if it is a reasonable approximation.


#7

Awesome, thank you! The only thing I have a question about are the priors. In that code, there are two separate vectors with the same name: beta (as in the regression coefficients), and beta (as in the shape parameter for the beta distribution). In my code, I renamed the vector of coefficients to be called vector[2] coef; (and updated the equation for mu[i] as such.

Now, it looks to me like I only need to specify priors for coef and phi, because alpha and beta are nodes determined by the coefs and phi, correct?

That is:

data {
  int n_obs;
  int x[n_obs]; 
  vector<lower=0, upper=1>[n_obs] y;
}
parameters {
  vector[2] coef; 
  real<lower=0> phi;
}
transformed parameters {
  vector[n_obs] mu;
  vector[n_obs] alpha;
  vector[n_obs] beta;
  vector[n_obs] residual;
  for (i in 1:n_obs) {
    mu[i] = inv_logit(coef[1] + coef[2] * x[i]); 
    alpha[i] = mu[i] * phi;
    beta[i] =  phi - mu[i] * phi;
  }
  for (i in 1:n_obs)
    residual[i] = y[i] - mu[i];
}
model {
  y ~ beta(alpha, beta);
  coef ~ normal(0, 10); // unsure about proper priors
  phi ~ cauchy(0, 10); // unsure about proper priors
}

Nevermind about if the propers are prior currently—I’m still doing some reading about it (and diagnostics because the residuals are still not normally distributed, but I think that’s alright because folks talk about beta regression being “naturally heteroskedastic”)—but, is my intuition correct that priors are set for coef and phi, and not alpha and beta?


#8

Yes, although it is not really a matter of intuition: always put priors on things declared in the parameters block not intermediate variables declared in the transformed parameters or model blocks. Otherwise, you have to deal with Jacobians.


#9

I should have paid better attention to the naming of the variables in the transformed parameters block.

Just rename the vector “beta” which is defined in the transformed parameters block to shape1 (and alpha to shape2).

...
transformed parameters {
  vector[n_obs] mu;
  vector[n_obs] shape1;
  vector[n_obs] shape2;
  for (i in 1:n_obs) {
    mu[i] = inv_logit(beta[1] + beta[2] * x[i]); 
    shape1[i] = mu[i] * phi;
    shape2[i] =  phi - mu[i] * phi;
  }
  model {
    y ~ beta(shape1, shape2);
    beta ~ normal(0, 3);
    phi ~ cauchy(0, 10);
  }

#10

Thanks! I got it up and running with:

data {
  int n_obs;
  int x[n_obs]; 
  vector<lower=0, upper=1>[n_obs] y;
}
parameters {
  vector[4] coef;
}
transformed parameters {
  vector[n_obs] mu;
  vector<lower=0>[n_obs] phi;
  vector[n_obs] alpha;
  vector[n_obs] beta;
  for (i in 1:n_obs) {
    mu[i] = inv_logit(coef[1] + coef[2] * x[i]); 
    phi[i] = exp(coef[3] + coef[4] * x[i]);
    alpha[i] = mu[i] * phi[i];
    beta[i] =  phi[i] - mu[i] * phi[i];
  }
}
model {
  y ~ beta(alpha, beta);
  coef ~ normal(0, 1);
}

The only main change being so that both the location and precision are predicted from x now.


#11

You also get this with a normal(0, sigma) on the log odds scale. You just do y_raw ~ normal(mu, sigma) and then if you transform using y = inv_logit(y_raw) back to a constrained representation, as sigma grows, the resulting y curve will be U-shaped. I’m only suggesting this because sometimes it’s easier to put priors and add predictors on the log odds scale.