Robit regression not robust

Solomon · March 11, 2021, 6:15pm

Operating System: MacOS
brms Version: 2.14.4

I am trying to reproduce the robit regression example in Gelman et al (2020, pp. 278–279; website), Figure 15.7. It does not appear the authors have made the data or code available. However, I was able to simulate data that looked very close to theirs. I have fit a robit model to the data using brms code based on two prior threads (here and here). The issue is the resulting robit model does not appear to be robust to the unusual case, as advertised in the text. I’d appreciate any help explaining why.

Simulate the data:

library(tidyverse)
library(brms)

n <- 50

set.seed(2)

d <-
  tibble(x = runif(n, min = -9, max = 9)) %>% 
  mutate(y = rbinom(n, size = 1, prob = plogis(0 + 1 * x))) %>% 
  arrange(x) %>% 
  mutate(id = 1:n()) %>% 
  # make the contaminated version of y
  mutate(y_c = ifelse(id == 5, 1, y))

glimpse(d)

Rows: 50
Columns: 4
$ x   <dbl> -8.812539, -7.650370, -6.928487, -6.675138, -6.617304…
$ y   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ id  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ y_c <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Fit a robit model and a conventional logistic model to the contaminated y_c data with x as the sole predictor.

stan_inv_robit <- "
  real inv_robit(real y, real nu) {
    return(student_t_cdf(y, nu, 0, (nu - 2) / nu));
  }
"
stanvar_inv_robit <- stanvar(scode = stan_inv_robit, block = "functions")

robit_formula <- 
  bf(y_c | trials(1) ~ inv_robit(eta, nu),
     nlf(eta ~ b0 + b1 * x),
     b0 + b1 ~ 1,
     nu ~ 1,
     nl = TRUE)

# robit
m1 <- 
  brm(data = d,
      family = binomial("identity"),
      formula = robit_formula, 
      prior = c(prior(normal(0, 1), nlpar = b0),
                prior(normal(0, 1), nlpar = b1),
                prior(gamma(2, 0.1), nlpar = nu, lb = 2)),
      stanvars = stanvar_inv_robit,
      control = list(adapt_delta = .95),
      seed = 15)

# logistic
m2 <- 
  brm(data = d,
      family = binomial,
      y_c | trials(1) ~ 1 + x, 
      prior = c(prior(normal(0, 1), class = Intercept),
                prior(normal(0, 1), class = b)),
      seed = 15)

Plot the fitted lines of the two models.

expose_functions(m1, vectorize = T)

# extract and wrangle the fitted draws
nd <- data.frame(x = seq(from = -9, to = 9, length.out = 100))

tibble(`robit regression`    = fitted(m1, newdata = nd)[, 1],
       `logistic regression` = fitted(m2, newdata = nd)[, 1]) %>% 
  bind_cols(nd) %>% 
  pivot_longer(-x, values_to = "y_c") %>% 
  
  # plot!
  ggplot(aes(x = x, y = y_c)) +
  geom_line(aes(linetype = name),
            size = 1/4) +
  geom_point(data = d,
             shape = 1, size = 2) +
  scale_linetype_manual(NULL, values = c(2, 1)) +
  scale_y_continuous("y", breaks = 0:1) +
  labs(subtitle = "Contaminated data") +
  theme_bw() +
  theme(legend.position = c(.75, .3),
        legend.box.background = element_rect(color = "black"),
        panel.grid = element_blank())

Compare that plot with the one from the text, Figure 15.7a:

Where am I going wrong?

avehtari · March 11, 2021, 8:00pm

Pinging @andrewgelman as he has made the plot years ago.

While waiting his answer, I checked that the posterior mean for nu is about 20, so the t distribution is close to normal and that robit is actually closer to probit than logistic. logistic is close to student t with nu=7, so the plot in the book indicates nu<7. I tested fixing nu=4, which produces something more similar to what the book shows. I also tested with unknown nu, optimizing which estimated nu as 7. But I’ll let Andrew tell how he actually did the computation for that one.

andrewgelman · March 11, 2021, 9:49pm

I did this in 2003 using Bugs! I’ll redo now from scratch in Stan.

Aki: what prior should I use for the df?

martinmodrak · March 11, 2021, 11:30pm

Interesting stuff. I don’t have anything to say about the substance, I’ll just note that for efficiency and extensibility, you would likely be better of with implementing the “robit” as a custom family (as in Define Custom Response Distributions with brms • brms) as that would let you use normal linear formula syntax.

Best of luck with the model.

andrewgelman · March 12, 2021, 2:17am

OK. I was able to reproduce the example . . . kinda. Aaand . . . I don’t understand these robust models as much as I thought I did!

I’ll tell the story, then share the graphs and the code.

For the robit model, I used Aki’s suggestion of a gamma(2, 0.1) prior on the degrees of freedom parameter, with a lower bound of 2. I first tried a lower bound of 1 but then Stan was having problems converging; weird things were happening at df near 1. I didn’t have a strong moral commitment to allowing such low values of the degrees of freedom so I just set 2 as the lower bound, and then the model fit fine.

I couldn’t exactly replicate the data from the graph in the book because they were random, but as you can see in the graphs, I was able to simulate something that looked very similar. To make the contaminated data I switched the data point on the far left of the graph (the lowest value of x) from 0 to 1, and as you can see the robit does that robust thing. The degrees of freedom parameter had posterior median 4.4, posterior mean 8.2, and 90% posterior interval [2.1, 27.9]. For the plot I used the posterior median.

But the underlying story is not so clean. For one thing, if I used the posterior mean rather than median, the fitted robit curve would look a lot more like the fitted logit. Also, if you look at the example in the book, you’ll see that in that example the “contaminated” point was not the lowest value of x; it was something like the 5th-lowest. I guess I’d picked the 5th-lowest because it made the graph cleaner to read; the outlier showed up more clearly. Anyway, when trying to replicate the example, the first thing I tried was to contaminate by flipping the 5th-lowest point. But when I did that, the robit and logit fits were nearly identical, just as in Solomon’s example above. The posterior median of df was 9.1.

So, first, the robit is not as robust as I’d thought! Switch that 5th-lowest point and it’s not enough to get the model to discount the outlier.

In that case, why did the robit model I fit in 2003 do such a reasonable job? It’s because I’d been using a very strong prior on the degrees of freedom. I think my prior was uniform(0,1) on 1/df.

What are the lessons from this example?

The robit does not pick up on outliers very well.
The posterior uncertainty in the degrees of freedom is huge, so it’s a mistake to take too seriously the point fit from 50 data points.
The discussion of robit and the example on page 278 of Regression and Other Stories is misleading.
I think in this sort of example it would actually be better to fit a logistic model with probabilities going from 0.01 to 0.99, as this would catch contamination more cleanly.

logistic2b.pdf (8.0 KB)

logistic2a.pdf (8.1 KB)

Here’s my R script:

library("cmdstanr")
set.seed(1234)

logit <- qlogis
invlogit <- plogis

N <- 50
x <- runif(N, -9, 9)
a <- 0
b <- 0.7
p <- invlogit(a + b*x)
y <- rbinom(N, 1, p)
data_1 <- list(N=N, x=x, y=y)

logit_model <- cmdstan_model("logit.stan")
robit_model <- cmdstan_model("robit.stan")

fit_logit_1 <- logit_model$sample(data=data_1, parallel_chains=4, refresh=0)
print(fit_logit_1)
a_hat_logit_1 <- median(fit_logit_1$draws("a"))
b_hat_logit_1 <- median(fit_logit_1$draws("b"))

fit_robit_1 <- robit_model$sample(data=data_1, parallel_chains=4, refresh=0)
print(fit_robit_1)
a_hat_robit_1 <- median(fit_robit_1$draws("a"))
b_hat_robit_1 <- median(fit_robit_1$draws("b"))
df_hat_robit_1 <- median(fit_robit_1$draws("df"))

pdf("logistic2b.pdf", height=4, width=6)
par(mar=c(3,3,2,1), mgp=c(1.5,.5,0), tck=-.01)
plot(data_1$x, data_1$y, yaxt="n", main="Data from a logistic regression", xlab="x", ylab="y")
axis(2, c(0,1))
curve(invlogit(a_hat_logit_1 + b_hat_logit_1*x), add=TRUE, lty=2)
curve(pt(a_hat_robit_1 + b_hat_robit_1*x, df_hat_robit_1), add=TRUE, lty=1)
legend (1, .3, c("fitted logistic regression", "fitted robit regression"), lty=c(2,1), cex=.8)
dev.off()

low_value <- (1:N)[x==min(x[y==0])]
data_2 <- data_1
data_2$y[low_value] <- 1

fit_logit_2 <- logit_model$sample(data=data_2, parallel_chains=4, refresh=0)
print(fit_logit_2)
a_hat_logit_2 <- median(fit_logit_2$draws("a"))
b_hat_logit_2 <- median(fit_logit_2$draws("b"))

fit_robit_2 <- robit_model$sample(data=data_2, parallel_chains=4, refresh=0)
print(fit_robit_2)
a_hat_robit_2 <- median(fit_robit_2$draws("a"))
b_hat_robit_2 <- median(fit_robit_2$draws("b"))
df_hat_robit_2 <- median(fit_robit_2$draws("df"))

pdf("logistic2a.pdf", height=4, width=6)
par(mar=c(3,3,2,1), mgp=c(1.5,.5,0), tck=-.01)
plot(data_2$x, data_2$y, yaxt="n", main="Contaminated data", xlab="x", ylab="y")
axis(2, c(0,1))
curve(invlogit(a_hat_logit_2 + b_hat_logit_2*x), add=TRUE, lty=2)
curve(pt(a_hat_robit_2 + b_hat_robit_2*x, df_hat_robit_2), add=TRUE, lty=1)
legend (1, .3, c("fitted logistic regression", "fitted robit regression"), lty=c(2,1), cex=.8)
dev.off()

And my Stan program logit.stan:

data {
  int N;
  vector[N] x;
  int y[N];
}
parameters {
  real a;
  real b;
}
model {
  y ~ bernoulli_logit(a + b*x);
}

And probit.stan:

data {
  int N;
  vector[N] x;
  int y[N];
}
parameters {
  real a;
  real b;
  real df;
}
model {
  vector[N] p;
  df ~ gamma(2, 0.1);
  for (n in 1:N) p[n] = student_t_cdf(a + b*x[n], df, 0, 1);
  y ~ bernoulli(p);
}

Solomon · March 12, 2021, 3:17pm

Thank you @avehtari and @andrewgelman for working this through. Though it’s disappointing the robit isn’t as robust as we’d thought, I’m glad to have some clarity. However, I’m a little confused about the version of the robit model @andrewgelman experimented with, above. It was my understanding that we wanted the errors to follow \epsilon \sim t(\nu, 0, (\nu - 2) / \nu). Based on the model block on the Stan code, above, it looks like you used \epsilon \sim t(\nu, 0, 1). Am I misreading something?

andrewgelman · March 12, 2021, 9:19pm

Yes, that makes sense, although this would not work when nu is less than 1. We’ll have to think more about this.

Solomon · March 12, 2021, 10:36pm

Good deal.

andrewgelman · March 18, 2021, 3:23pm

Aki and I discussed, and we agreed that it’s just too hard to estimate the t degrees of freedom from just 50 data points. The inference depends very strongly on the prior, and it’s cleaner to just set the df at a fixed value. So I re-ran things setting df=4 and also did the degrees of freedom correction for the scale parameter.

Graphs and code are below.

logistic2a.pdf (8.1 KB)
logistic2b.pdf (8.0 KB)

robit.stan:

data {
  int N;
  vector[N] x;
  int y[N];
  real df;
}
parameters {
  real a;
  real b;
}
model {
  vector[N] p;
  for (n in 1:N) p[n] = student_t_cdf(a + b*x[n], df, 0, sqrt((df - 2)/df));
  y ~ bernoulli(p);
}

R script:

library("cmdstanr")
library("ggdist")
set.seed(1234)

logit <- qlogis
invlogit <- plogis

N <- 50
x <- runif(N, -9, 9)
a <- 0
b <- 0.8
p <- invlogit(a + b*x)
y <- rbinom(N, 1, p)
df <- 4
data_1 <- list(N=N, x=x, y=y, df=df)

logit_model <- cmdstan_model("logit.stan")
robit_model <- cmdstan_model("robit.stan")

fit_logit_1 <- logit_model$sample(data=data_1, parallel_chains=4, refresh=0)
print(fit_logit_1)
a_hat_logit_1 <- median(fit_logit_1$draws("a"))
b_hat_logit_1 <- median(fit_logit_1$draws("b"))

fit_robit_1 <- robit_model$sample(data=data_1, parallel_chains=4, refresh=0)
print(fit_robit_1)
a_hat_robit_1 <- median(fit_robit_1$draws("a"))
b_hat_robit_1 <- median(fit_robit_1$draws("b"))

pdf("logistic2b.pdf", height=4, width=6)
par(mar=c(3,3,2,1), mgp=c(1.5,.5,0), tck=-.01)
plot(data_1$x, data_1$y, yaxt="n", main="Data from a logistic regression", xlab="x", ylab="y")
axis(2, c(0,1))
curve(invlogit(a_hat_logit_1 + b_hat_logit_1*x), add=TRUE, lty=2)
curve(pstudent_t(a_hat_robit_1 + b_hat_robit_1*x, data_1$df, 0, sqrt((data_1$df-2)/data_1$df)), add=TRUE, lty=1)
legend (1, .3, c("fitted logistic regression", "fitted robit regression"), lty=c(2,1), cex=.8)
dev.off()

low_value <- (1:N)[x==sort(x)[4]]
data_2 <- data_1
data_2$y[low_value] <- 1

fit_logit_2 <- logit_model$sample(data=data_2, parallel_chains=4, refresh=0)
print(fit_logit_2)
a_hat_logit_2 <- median(fit_logit_2$draws("a"))
b_hat_logit_2 <- median(fit_logit_2$draws("b"))

fit_robit_2 <- robit_model$sample(data=data_2, parallel_chains=4, refresh=0)
print(fit_robit_2)
a_hat_robit_2 <- median(fit_robit_2$draws("a"))
b_hat_robit_2 <- median(fit_robit_2$draws("b"))

pdf("logistic2a.pdf", height=4, width=6)
par(mar=c(3,3,2,1), mgp=c(1.5,.5,0), tck=-.01)
plot(data_2$x, data_2$y, yaxt="n", main="Contaminated data", xlab="x", ylab="y")
axis(2, c(0,1))
curve(invlogit(a_hat_logit_2 + b_hat_logit_2*x), add=TRUE, lty=2)
curve(pstudent_t(a_hat_robit_2 + b_hat_robit_2*x, data_2$df, 0, sqrt((data_2$df-2)/data_2$df)), add=TRUE, lty=1)
legend (1, .3, c("fitted logistic regression", "fitted robit regression"), lty=c(2,1), cex=.8)
dev.off()

Solomon · March 18, 2021, 9:08pm

Got it. I can confirm this works on the brms end. I understand why you’d want to manually set \nu with such a small sample. I also get it that you’d want a \nu value lower than 7, for the reasons stated, above. But why use \nu = 4 rather than, say, \nu = 3?

andrewgelman · March 18, 2021, 11:55pm

Yes, df=4 is arbitrary, indeed df=3 might be better! I’ve not thought hard about these issues. Also, our intuitions for the t distribution for continuous data might not apply so well to the robit model for binary data. Lots of open questions here.

Solomon · March 19, 2021, 12:06am

Thanks, @andrewgelman. All you stats/methods grad students, looks like there’s a dissertation, here.

martinmodrak · March 19, 2021, 9:18am

In that vein, wouldn’t just using a “cauchit” link be - in some sense - the “robustest” solution? :-D Or is there some reason to not use \nu = 1? Asking for a friend :-)

Solomon · March 19, 2021, 1:37pm

I believe you want to use \nu \geq 2 for robit models. See this part from an earlier thread.

andrewgelman · March 19, 2021, 9:07pm

Cauchy could be fine too. We needed to restrict nu > 2 only because of issues with estimating nu under a weak prior. If you want to set it to a fixed value, then, sure, maybe nu=1 is better. I haven’t really thought about it much, except to realize that my intuition on such models is not so great!

avehtari · March 20, 2021, 4:09pm

Thanks @Solomon for telling us about your observation. @andrewgelman made a new example and I added it to the book website Regression and Other Stories: Robit. The fixed example will appear also in the future printings of the book (and we thank you in the preface).

Solomon · March 20, 2021, 7:42pm

I’m glad y’all were willing to help clarify this, for me, and it’s great there is finally an author-supported code-through of the robit model. For those interested, here’s a complimentary brms version of the robit workflow from my notes on ROS: Working-through-Regression-and-other-stories/15.md at main · ASKurz/Working-through-Regression-and-other-stories · GitHub

Topic		Replies	Views
Correlated random effects in nonlinear model brms brms	1	490	June 24, 2021
Unable to fit nonlinear robit model to new data brms	4	702	August 9, 2020
Translating robust logistic regression from rstan to brms brms rstan	9	3040	July 26, 2020
Fit a `brm()` model with an array of predictors brms	3	969	January 27, 2021
Error and warning messages when using brm function brms fitting-issues	2	745	March 2, 2021

Robit regression not robust

Related topics