As a beginner, I want to ask a question about Bayesian prediction. There is a data about the number of papers published and cited at Engineering discipline, which have four variables:
- discipline
- year (2000-2019)
- number of published papers
- citation frequency
``` r
library(tidyverse)
library(brms)
library(tidybayes)
d_eng <- tibble::tribble(
~discipline, ~year, ~n_paper, ~n_cited,
"Engineering", 2000L, 3L, 10L,
"Engineering", 2001L, 4L, 38L,
"Engineering", 2002L, 5L, 84L,
"Engineering", 2003L, 10L, 175L,
"Engineering", 2004L, 5L, 96L,
"Engineering", 2005L, 12L, 76L,
"Engineering", 2006L, 7L, 53L,
"Engineering", 2007L, 11L, 79L,
"Engineering", 2008L, 7L, 105L,
"Engineering", 2009L, 11L, 109L,
"Engineering", 2010L, 15L, 143L,
"Engineering", 2011L, 14L, 207L,
"Engineering", 2012L, 16L, 93L,
"Engineering", 2013L, 15L, 95L,
"Engineering", 2014L, 15L, 102L,
"Engineering", 2015L, 13L, 208L,
"Engineering", 2016L, 13L, 103L,
"Engineering", 2017L, 20L, 431L,
"Engineering", 2018L, 49L, 1166L,
"Engineering", 2019L, 56L, 231L
)
If the total number cited for 10 years exceeds a certain threshold (suppose the threshold is 2843), then this discipline is a first-class subject.
my idea, assuming that the threshold line remains unchanged or changes little, can I predict how likely the engineering discipline will exceed the threshold line in 2020?
I do like this:
-
It is simply assumed that the cited quantity is a linear relationship with time
-
modeling
brms_eng <- brms::brm(
data = d_eng,
family = gaussian,
n_cited ~ 1 + year,
iter = 41000, warmup = 40000, chains = 4, cores = 4,
seed = 1024,
control = list(
adapt_delta = 0.99,
max_treedepth = 13
)
)
#> Compiling the C++ model
#> Start sampling
- predict the quantity to be cited in 2020, i.e. the increment in 2020
tb <- tibble(
year = c(2020)
)
post_draw <- tb %>% add_predicted_draws(brms_eng)
post_draw
#> # A tibble: 4,000 x 6
#> # Groups: year, .row [1]
#> year .row .chain .iteration .draw .prediction
#> <dbl> <int> <int> <int> <int> <dbl>
#> 1 2020 1 NA NA 1 489.
#> 2 2020 1 NA NA 2 672.
#> 3 2020 1 NA NA 3 277.
#> 4 2020 1 NA NA 4 346.
#> 5 2020 1 NA NA 5 201.
#> 6 2020 1 NA NA 6 439.
#> 7 2020 1 NA NA 7 90.0
#> 8 2020 1 NA NA 8 -328.
#> 9 2020 1 NA NA 9 258.
#> 10 2020 1 NA NA 10 425.
#> # ... with 3,990 more rows
- Calculate the accumulated quantity in the previous 9 years
exist_nine_year_cum_cited <- d_eng %>%
filter(between(year, 2011, 2019)) %>%
summarise(
exist_cum_cited = sum(n_cited)
) %>%
pull()
exist_nine_year_cum_cited
#> [1] 2636
- The predict value of increment plus the accumulated quantity in the previous 9 years, thus constitutes the predict value of the accumulated value in 2020
post_pred <- post_draw %>%
mutate(
.prediction = .prediction + exist_nine_year_cum_cited
)
post_pred
#> # A tibble: 4,000 x 6
#> # Groups: year, .row [1]
#> year .row .chain .iteration .draw .prediction
#> <dbl> <int> <int> <int> <int> <dbl>
#> 1 2020 1 NA NA 1 3125.
#> 2 2020 1 NA NA 2 3308.
#> 3 2020 1 NA NA 3 2913.
#> 4 2020 1 NA NA 4 2982.
#> 5 2020 1 NA NA 5 2837.
#> 6 2020 1 NA NA 6 3075.
#> 7 2020 1 NA NA 7 2726.
#> 8 2020 1 NA NA 8 2308.
#> 9 2020 1 NA NA 9 2894.
#> 10 2020 1 NA NA 10 3061.
#> # ... with 3,990 more rows
- finally calculate the probability of exceeding the threshold
exceed_prob <- post_pred %>%
group_by(year) %>%
summarise(
pred_mean = mean(.prediction),
prob_above_line = mean(.prediction >= 2843) # 2843 is threshold
)
exceed_prob
#> # A tibble: 1 x 3
#> year pred_mean prob_above_line
#> <dbl> <dbl> <dbl>
#> 1 2020 3026. 0.786
Created on 2020-04-20 by the reprex package (v0.3.0)
Question: now can we say that 78.6% probability of engineering disciplines will enter first-class disciplines in 2020?