Regression discontinuity with binomial logit model


#1

Hi All,
I am trying to predict whether the introduction of a paywall changes the anger level of comments on, say, a youtube chanel. We have already run a sentiment analysis and have data of the format:

y = # of angry comments
n = # of total comments
delta = 0 if post was published before paywall, 1 if post was poublished after
distPre = weeks before paywall (introduction is 0)
distPost = weeks after paywall (introduction is 0)

I am attempting to model this using a binomial_logit model and predicting the difference at the introduction of the paywall. I would like to account for possible pre- and post-treatment trends and thus included two variables that indicate how many weeks the current post is away from the introduction of the paywall. I then predict at the point where both distances are 0 and compare the predictive densities. Is this a reasonable approach for causal inference?
I’d be very greatful for any help/ suggestions (also other models?).

Thanks!

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

pewdat <- read.csv("./data.csv")
pewdat$post <- ifelse(as.Date(pewdat$created_at) > as.Date("2016-02-10"), 1, 0)
y <- pewdat$anger
n <- pewdat$total
T <- length(y)
delta <- pewdat$post
#distPost <- c(rep(0, T-sum(delta)),(1:sum(delta))/sum(delta))
#distPre <- c(((T-sum(delta)):1)/(T-sum(delta)), rep(0, sum(delta)))

distPost <- c(rep(0, T-sum(delta)),(1:sum(delta)))
distPre <- c(((T-sum(delta)):1), rep(0, sum(delta)))


mod <- stan(file = "./logbinom2.stan", thin = 5, iter = 20000)

predPre <- as.data.frame(mod, pars = "y_hat_pre")
predPost <- as.data.frame(mod, pars = "y_hat_post")
plot(density(as.matrix(predPre)), col = 'red', xlim = c(0, 1))
lines(density(as.matrix(predPost)), col = 'blue')
legend('topleft', legend = c("pre", "post"), col = c('red', 'blue'), lty = 1)

library(shinystan)
launch_shinystan(mod)
data {
  int<lower=1> T;
  // number of data points
  int y[T];
  // success
  int n[T];
  // total
  vector<lower=0, upper=1>[T] delta;
  // treatment
  vector[T] distPre;
  vector[T] distPost;
}
parameters {
  real beta;
  real alpha;
  real gamma1;
  real gamma2;
}
model {
  beta ~ normal(0, 100);
  alpha ~ normal(0, 100);
  gamma1 ~ normal(0, 100);
  gamma2 ~ normal(0, 100);
  y ~ binomial_logit(n, alpha + delta * beta + distPre * gamma1 + distPost * gamma2);
}
generated quantities {
  real y_hat_pre;
  real y_hat_post;
  y_hat_pre = inv_logit(alpha);
  y_hat_post = inv_logit(alpha + beta);
}

#2

Sorry for the slow response. That model’s reasonable other than for the priors. You’re rarely going to see effects larger than 4 or 5 on the log odds scale, so normal(0, 100) seems unrealistic. We usually want at least weakly informative priors to identify the scale of the result, say normal(0, 5) or thereabouts.

If your predictors aren’t standardized, that’s a big help, too.

I’d be trying to build a hierarchical model here for each of the items in order to estimate the angry comment rate and then I’d try to do some measurements directly on that quantity of interest.

I’m not really sure how to do the causal part of it, which is what you’re after—does the paywall treatment have an effect? I’d suggest the Gelman and Hill book chapters on that.


#3

No problem. Thank you for your answer!
Sounds good! I’ll tweak the priors.

As for the hierarchical model you mean a common distribution for anger and then separate intercept for each channel?

I’ve read the relevant chapters in Gelman & Hill and they are very informative. Thanks for the suggestion!!


#4

That’d be one approach. If you have predictors per channel, then you can also have varying slopes. There’s another example of that in Gelman and Hill’s discussion of the red-state/blue-state model (with a covariance prior on the multiple random effects).