there is great difference
Well I guess this is a simple question but I actually have about zero experience. Here is the situation as I understand it though:
-
Selectively leaving out data points is the path to tricking ourselves about our model fit
-
If there are outliers, our first intention should be to think about why that point might be so different from the rest and model it
-
If we do not have the data to model this (extra covariates or whatever that could be driving this), then weâll just need to model outliers by changing our likelihood to something else. A student-t is the usual thing. Search for âoutliersâ here: http://sumsar.net/files/posts/2017-01-15-bayesian-computation-with-stan-and-farmer-jons/stan_exercise.html . I think that explanation seems fair.
Regardly outliers, thereâs a post on the Gelblog if youâre bored: https://statmodeling.stat.columbia.edu/2020/05/01/hey-you-yeah-you-stop-what-youre-doing-right-now-and-read-this-stigler-article-on-the-history-of-robust-statistics/
Itâs on robustness, but whenever outliers come up so does robust regression. The Gelblog can be super confusing to read, but there are links to previous discussions on robustness and whatnot. It can be interesting to skim just to see what people are vaguely talking about.
Thank you very much for your kind guidance.
i am overwhelmed with appreciation and inspired to continue learningďźalthough the road is long.
Just to add a small thing for perspective - there is relatively big literature trying to discover what is a good response/distribution to model citation counts and they mostly agree that negative binomial is too concentrated and doesnât do a good job at capturing the rare papers with very huge citation counts. This is possibly what is happening with the year 2018 in your data - some paper got cited hugely.
The sad part is that Stan doesnât have any of those less known, so called âoverdispersedâ distributions implemented (and I am actually slowly working on implementing them, but it turns out it is not trivial). So you canât easily improve your model much. The best you can do to model this âoverdispersionâ easily would be to add a varying intercept for each measurement (e.g., + (1 || year)
).
If you want to go deeper, a good reference is IMHO the paper on the Generalized inverse-Gaussian Poisson distribution (also called the Sichel distribution) which references some previous work and is referenced by others. But this might be just a distraction - your model seems to work good enough for many purposes.
The only package I am aware of that allows for fitting models using the Sichel distribution is gamlss
Best of luck with your Stan adventures! :-)
@martinmodrak, thank you for your advice. I recode the multilevel model and it seemed to work well
data {
int<lower=1> N; // number of observations
int Y[N]; // response variable
real X[N]; // predictor variable
int<lower=1> M; // number of grouping levels
int<lower=1, upper=M> J[N]; // grouping indicator per observation
}
transformed data {
real Xc[N];
real means_X = mean(X);
for (i in 1:N) {
Xc[i] = X[i] - means_X;
}
}
parameters {
real beta; // fixed slope
real Intercept; // fixed intercept
vector[M] u; // varying intercept
real<lower=0> phi; // shape parameter
}
model {
real mu;
target += student_t_lpdf(Intercept | 3, 5, 10);
target += gamma_lpdf(phi | 0.01, 0.01);
target += normal_lpdf(u | 0, 1);
target += normal_lpdf(beta | 0, 10);
for (n in 1:N) {
mu = Intercept + u[J[n]] + Xc[n] * beta;
Y[n] ~ neg_binomial_2_log(mu, phi);
}
}
generated quantities {
// actual population-level intercept
real b_Intercept = Intercept - means_X * beta;
}
and then
stan_dat <- list(
N = nrow(d_eng),
Y = d_eng$n_cited,
X = d_eng$year,
M = nlevels(as.factor(d_eng$year)),
J = as.integer(as.factor(d_eng$year))
)
fit <- stan(
file = "nb_model.stan",
data = stan_dat,
iter = 4000,
chains = 4,
warmup = 2000,
cores = 4,
control = list(
adapt_delta = 0.99,
max_treedepth = 15
))
fit
stan_hist(fit, pars = c("b_Intercept", "beta", "phi"))
stan_hist(fit, pars = c("u"))
But now I have two new problems. In this Stan model
- I donât know how to do Posterior Predictive Checks, like this?
generated quantities {
int y_rep[N];
for (n in 1:N) {
y_rep[n] = neg_binomial_2_rng(exp(Intercept + u[J[n]] + X[n] * beta), phi);
}
}
- In multilevel modelďźhow to predict given new year, for example 2020.
Yes, that looks like a good basis for PP checks.
If year has a varying effect/intercept, the most common option is to sample the hyperpriorâŚ
Nevertheless, your model doesnât really look like a multilevel model, since you donât have any âhyperpriorâ, a classical multilevel model would have a parameter (say tau
) that works like:
parameters {
real tau;
vector[M] u_raw;
// .... and the rest as you have it
}
transformed parameters {
vector[M] u = u_raw * tau;
}
model {
target += normal_lpdf(tau | 0, 1);
target += normal_lpdf(u_raw | 0, 1);
// .... and the rest as you have it
}
Anyway, I am slightly puzzled why you opted to move away from brms
as the model you are using is well within its capabilities. Obviously if your goal is to get better at Stan, than this is a good approach :-)
I think Stan can help me understand Bayes better, although I only know a little
I am curious how this works.
It might be more straightforward in the end, but Stan code is fun!
Not sure what youâre trying to say - is there a mistake in my code/explanation? This is well possible, but I canât see it, so Iâd need more details :-)
I am just curious to see the difference. I agree with everything you said!