 # Can Bayesian predict cumulative-quantity?

As a beginner, I want to ask a question about Bayesian prediction. There is a data about the number of papers published and cited at Engineering discipline, which have four variables:

• discipline
• year (2000-2019)
• number of published papers
• citation frequency
````````` r
library(tidyverse)
library(brms)
library(tidybayes)

d_eng <- tibble::tribble(
~discipline, ~year, ~n_paper, ~n_cited,
"Engineering", 2000L, 3L, 10L,
"Engineering", 2001L, 4L, 38L,
"Engineering", 2002L, 5L, 84L,
"Engineering", 2003L, 10L, 175L,
"Engineering", 2004L, 5L, 96L,
"Engineering", 2005L, 12L, 76L,
"Engineering", 2006L, 7L, 53L,
"Engineering", 2007L, 11L, 79L,
"Engineering", 2008L, 7L, 105L,
"Engineering", 2009L, 11L, 109L,
"Engineering", 2010L, 15L, 143L,
"Engineering", 2011L, 14L, 207L,
"Engineering", 2012L, 16L, 93L,
"Engineering", 2013L, 15L, 95L,
"Engineering", 2014L, 15L, 102L,
"Engineering", 2015L, 13L, 208L,
"Engineering", 2016L, 13L, 103L,
"Engineering", 2017L, 20L, 431L,
"Engineering", 2018L, 49L, 1166L,
"Engineering", 2019L, 56L, 231L
)
``````

If the total number cited for 10 years exceeds a certain threshold (suppose the threshold is 2843), then this discipline is a first-class subject.

my idea, assuming that the threshold line remains unchanged or changes little, can I predict how likely the engineering discipline will exceed the threshold line in 2020?

# I do like this:

1. It is simply assumed that the cited quantity is a linear relationship with time

2. modeling

``````brms_eng <- brms::brm(
data = d_eng,
family = gaussian,
n_cited ~ 1 + year,
iter = 41000, warmup = 40000, chains = 4, cores = 4,
seed = 1024,
control = list(
max_treedepth = 13
)
)
#> Compiling the C++ model
#> Start sampling
``````
1. predict the quantity to be cited in 2020, i.e. the increment in 2020
``````tb <- tibble(
year = c(2020)
)

post_draw

#> # A tibble: 4,000 x 6
#> # Groups:   year, .row 
#>     year  .row .chain .iteration .draw .prediction
#>    <dbl> <int>  <int>      <int> <int>       <dbl>
#>  1  2020     1     NA         NA     1       489.
#>  2  2020     1     NA         NA     2       672.
#>  3  2020     1     NA         NA     3       277.
#>  4  2020     1     NA         NA     4       346.
#>  5  2020     1     NA         NA     5       201.
#>  6  2020     1     NA         NA     6       439.
#>  7  2020     1     NA         NA     7        90.0
#>  8  2020     1     NA         NA     8      -328.
#>  9  2020     1     NA         NA     9       258.
#> 10  2020     1     NA         NA    10       425.
#> # ... with 3,990 more rows
``````
1. Calculate the accumulated quantity in the previous 9 years
``````exist_nine_year_cum_cited <- d_eng %>%
filter(between(year, 2011, 2019)) %>%
summarise(
exist_cum_cited = sum(n_cited)
) %>%
pull()

exist_nine_year_cum_cited
#>  2636
``````
1. The predict value of increment plus the accumulated quantity in the previous 9 years, thus constitutes the predict value of the accumulated value in 2020
``````post_pred <- post_draw %>%
mutate(
.prediction = .prediction + exist_nine_year_cum_cited
)
post_pred
#> # A tibble: 4,000 x 6
#> # Groups:   year, .row 
#>     year  .row .chain .iteration .draw .prediction
#>    <dbl> <int>  <int>      <int> <int>       <dbl>
#>  1  2020     1     NA         NA     1       3125.
#>  2  2020     1     NA         NA     2       3308.
#>  3  2020     1     NA         NA     3       2913.
#>  4  2020     1     NA         NA     4       2982.
#>  5  2020     1     NA         NA     5       2837.
#>  6  2020     1     NA         NA     6       3075.
#>  7  2020     1     NA         NA     7       2726.
#>  8  2020     1     NA         NA     8       2308.
#>  9  2020     1     NA         NA     9       2894.
#> 10  2020     1     NA         NA    10       3061.
#> # ... with 3,990 more rows
``````
1. finally calculate the probability of exceeding the threshold
``````exceed_prob <- post_pred %>%
group_by(year) %>%
summarise(
pred_mean = mean(.prediction),
prob_above_line = mean(.prediction >= 2843) # 2843 is threshold
)

exceed_prob
#> # A tibble: 1 x 3
#>    year pred_mean prob_above_line
#>   <dbl>     <dbl>           <dbl>
#> 1  2020     3026.           0.786

``````

Created on 2020-04-20 by the reprex package (v0.3.0)

Question: now can we say that 78.6% probability of engineering disciplines will enter first-class disciplines in 2020?

1 Like

This is a good start. I like to start with the simplest model and then build out a more complex one from there.

I would show the upper and lower bounds not just the point estimate (which is what I think I am looking at). I need to read back through all this to see if anything else stands out.

That’s what the model predicts, at least. So conditioning on your model and data, that’s the prediction.

It seems like predicting the year quantities might be more informative than doing a threshold.

In the context of publishing papers, I guess this negative prediction is weird:

``````#> 8 2020 1 NA NA 8 -328.
``````

which suggests the model isn’t perfect, which means the prediction probably isn’t perfect. It’s not clear what having the wrong model here means though in terms of the 78.6% prediction (Does it go up? Does it go down? How much? etc.)

2 Likes

@bbbales2 Thank you for pointing out my mistake。According to your suggestion, two places have been modified：

• the cited quantity must be greater than 0，so the formula adjust as `n_cited | trunc(lb = 0) ~ 1 + year`
• I am optimistic that the discipline is developing in a positive way, so add prior as `set_prior("normal(22, 10)", lb = 0, class = "b")`
``````brms_eng <- brms::brm(
data = d_eng,
family = gaussian,

n_cited | trunc(lb = 0) ~ 1 + year,
set_prior("normal(22, 10)", lb = 0, class = "b"),

iter = 41000, warmup = 40000, chains = 4, cores = 4,
seed = 1024,
control = list(
max_treedepth = 13
)
)
``````

thus, we can calculate the posterior summary as follows

``````brms_eng %>% posterior_summary()

#                Estimate    Est.Error         Q2.5        Q97.5
# b_Intercept -65211.19719 16630.772336 -96572.83551 -30663.08546
# b_year          32.46937     8.272460     15.31237     48.05529
# sigma          227.62141    41.051703    165.97844    328.18123
# lp__          -134.79629     1.508954   -139.12025   -133.12979
``````
1. finally calculate the probability of exceeding the specific threshold
``````exceed_prob <- post_pred %>%
group_by(year) %>%
summarise(
pred_mean = mean(.prediction),
prob_above_line = mean(.prediction >= 2843) # 2843 is threshold
)

exceed_prob
# A tibble: 1 x 3
year pred_mean prob_above_line
<dbl>     <dbl>           <dbl>
1  2020     3052.           0.815
``````

the result changes to 81.5%.

1 Like

Not checking the code specifically, but that sounds like you’re doing the right thing to make a posterior prediction.

``````iter = 41000, warmup = 40000
``````

That’s a huge amount of warmup. Does the default warmup = 1000, iter = 2000 not work?

``````seed = 1024
``````

In general I think it’s a bad idea to set seeds. The random chain to chain variation is helpful for finding problems. This should be off except for very specific debugging cases.

1 Like

I would say seeds are good, when you publish results.

1 Like

``````brms_eng <- brms::brm(
data = d_eng,
family = gaussian,
n_cited | trunc(lb = 0) ~ 1 + year,
set_prior("normal(22, 10)", lb = 0, class = "b"),
iter = 4000, warmup = 2000, chains = 4, cores = 4,
max_treedepth = 13)
)
``````

which give a quite close result

``````brms_eng %>% posterior_summary()
#                Estimate    Est.Error         Q2.5        Q97.5
# b_Intercept -65211.19719 16630.772336 -96572.83551 -30663.08546
# b_year          32.46937     8.272460     15.31237     48.05529
# sigma          227.62141    41.051703    165.97844    328.18123
# lp__          -134.79629     1.508954   -139.12025   -133.12979
``````

and

``````exceed_prob

# A tibble: 1 x 3
year pred_mean prob_above_line
<dbl>     <dbl>           <dbl>
1  2020     3050.           0.822
``````

Thanks again.

@bbbales2 I try to convert the brms code about truncated data to Stan

``````excode <- '
data {
int<lower=1> N;
vector[N] year;
vector<lower=0>[N] cited;
}

parameters {
real alpha;
real beta;
real<lower=0> sigma;
}

model {
// Priors
beta ~ normal(22, 10) T[0, ];

// Likelihood
for (n in 1:N){
cited[n] ~ normal(alpha + beta * year[n], sigma) T[0, ];
}

}
'

stan_dat <- list(
N = nrow(d_eng),
year = d_eng\$year,
cited = d_eng\$n_cited
)

stan(model_code = excode,
data = stan_dat,
iter = 4000,
chains = 4,
warmup = 2000,
cores = 4,
control = list(
max_treedepth = 13
))
``````

However, I get back some warning messages and the final result is obviously incorrect.

``````Warning messages:
1: There were 6216 divergent transitions after warmup. Increasing adapt_delta above 0.99 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
2: Examine the pairs() plot to diagnose sampling problems

3: The largest R-hat is 1.13, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat
4: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
5: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess
``````

Can you tell me how to write this Stan code？

thanks

``````cited[n] ~ normal(alpha + beta * year[n], sigma) T[0, ];
``````

This truncation makes sense – it’s saying there won’t be negative citations in a year.

Another distribution that might work would be the negative binomial: https://mc-stan.org/docs/2_23/functions-reference/nbalt.html

``````beta ~ normal(22, 10) T[0, ];
``````

I don’t think you’d want to truncate your slope (which is where the error is coming from).

You could make the error go away by:

1. Removing the truncation (this is what I think you should do)
2. Adding a lower = 0 constraint to beta (`real<lower = 0.0> beta;`)
2 Likes

Thank you kindly.
you are right. The negative binomial distribution is more suited to citations in a year.

``````excode <- '
data {
int<lower=1> N;
int Y[N];
int<lower = 1> K;
matrix[N, K] X;
}

transformed data {
int Kc = K - 1;
matrix[N, Kc] Xc;
vector[Kc] means_X;
for (i in 2:K) {
means_X[i - 1] = mean(X[, i]);
Xc[, i - 1] = X[, i] - means_X[i - 1];
}
}

parameters {
vector[Kc] beta;
real alpha;
real<lower=0> shape;
}

model {
// Priors
target += student_t_lpdf(alpha | 3, 5, 10);
target += gamma_lpdf(shape | 0.01, 0.01);

// Likelihood
target += neg_binomial_2_log_glm_lpmf(Y | Xc, alpha, beta, shape);
}

generated quantities {
real b_alpha = alpha - dot_product(means_X, beta);

int y_rep[N];
for (n in 1:N) {
y_rep[n] = neg_binomial_2_log_rng(b_alpha + X[n, 2:] * beta, phi);
}

}
'
``````

then

``````X <- unname(model.matrix(~ 1 + year, d_eng))
attr(X, "assign") <- NULL
ncol(X)

stan_dat <- list(
N = nrow(d_eng),
Y = d_eng\$n_cited,
K = ncol(X),
X = X
)

fit <- stan(
model_code = excode,
data = stan_dat,
iter = 4000,
chains = 4,
warmup = 2000,
cores = 4,
control = list(
max_treedepth = 13
))
``````

The result is good for me

``````fit
mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
beta    0.12    0.00  0.03    0.07    0.10    0.12    0.14    0.17  4761    1
alpha      4.96    0.00  0.15    4.68    4.86    4.96    5.06    5.28  4682    1
shape      2.45    0.01  0.78    1.18    1.90    2.36    2.90    4.21  5034    1
b_alpha -233.89    0.75 52.05 -339.20 -266.54 -232.76 -200.05 -130.52  4758    1
lp__    -124.39    0.03  1.31 -127.79 -125.00 -124.04 -123.43 -122.91  2692    1
``````

Posterior Predictive Checks

``````library(bayesplot)
y_rep <- as.matrix(fit, pars = "y_rep")
y_rep

ppc_dens_overlay(y = d_eng\$n_cited, y_rep[1:100,])
``````

2 Likes

I like the posterior predictive plot! Looks like the predictions are larger than the data suggests.

I don’t know if this would help, but can you add priors on beta?

Also can you just go with normal priors on everything? It’s much easier to interpret the parameters (if there isn’t a particular reason to use a gamma/student_t here)

as a beginner i do not know how to set priors on parameters, so i choose the easy way borrowing from

``````brms::make_stancode(
n_cited ~ 1 + year,
data = d_eng,
family = "negbinomial"
)
``````

which give me

``````model {
target += student_t_lpdf(Intercept | 3, 5, 10);
target += gamma_lpdf(shape | 0.01, 0.01);
}
``````

however, it seems hard to explain although it turns out to be good result.

Following @bbbales2 advices, i change the priors to

``````target += normal_lpdf(alpha | 0, 100);
target += cauchy_lpdf(phi | 0, 2);
``````

it also work well.

However the `prior on beta` looks sensitive, and any setting will cause the model to fail, for example

``````target += normal_lpdf(beta | 0, 10);
``````
``````Error in unserialize(socklist[[n]]) : error reading from connection
Error in serialize(data, node\$con, xdr = FALSE) : error writing to connection
``````
1 Like

according here, set

``````library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
Sys.setenv(LOCAL_CPPFLAGS = "-march=corei7 -mtune=corei7")
``````

the error goes away. So rewrite priors as

``````target += normal_lpdf(alpha | 0, 100);
target += normal_lpdf(beta | 0, 10);
target += cauchy_lpdf(phi | 0, 2);
``````

model works again and now looks near perfect.

1 Like

Good find! I’ve seen this error before and not known what to do. Do you still get the error if the only thing you set is:

``````rstan_options(auto_write = TRUE)
``````

By any chance?

On the posterior predictives?

@bbbales2 Thank you for your patience, you are the best teacher.

1. if set only
``````rstan_options(auto_write = TRUE)
``````

The Stan model is also working very well.

1. posterior predictive is not very good, which are still larger than the data suggests.
i am sorry that i have misunderstand it.

Cool!

Hmm, are there other posterior prediction plots you could make? Like per year? Or per journal, or something?

fit at each year，like this?

``````rep_draw <- fit %>% tidybayes::spread_draws(y_rep[condition])
rep_draw
# A tibble: 160,000 x 5
# Groups:   condition 
condition y_rep .chain .iteration .draw
<int> <dbl>  <int>      <int> <int>
1         1    94      1          1     1
2         1    93      1          2     2
3         1    45      1          3     3
4         1    72      1          4     4
5         1    43      1          5     5
6         1    96      1          6     6
7         1   193      1          7     7
8         1    56      1          8     8
9         1    33      1          9     9
10         1    34      1         10    10
# ... with 159,990 more rows

d_rep <- rep_draw %>%
group_by(condition) %>%
summarise(
pred_mean = mean(y_rep, na.rm = T),
Q2.5 = quantile(y_rep, probs = c(0.025)),
Q97.5 = quantile(y_rep, probs = c(0.975))
) %>%
mutate(year = c(2000:2019))

d_rep
# A tibble: 20 x 5
condition pred_mean  Q2.5 Q97.5  year
<int>     <dbl> <dbl> <dbl> <int>
1         1      48.3   5    149.  2000
2         2      53.8   5    159.  2001
3         3      59.9   7    180   2002
4         4      68.2   7    196   2003
5         5      75.3   9    216   2004
6         6      85.0  10    240   2005
7         7      95.6  10    275   2006
8         8     106.   13    294   2007
9         9     120.   13    331.  2008
10        10     137.   17    383.  2009
11        11     154.   18    441.  2010
12        12     175.   22    481   2011
13        13     193.   25.0  538.  2012
14        14     221.   27    624   2013
15        15     247.   26    702.  2014
16        16     280.   33    793.  2015
17        17     314.   37    917.  2016
18        18     362.   41   1052   2017
19        19     405.   46   1186.  2018
20        20     452.   54   1345.  2019

d_rep %>%
ggplot(aes(x = year)) +
geom_pointrange(
aes(y = pred_mean,
ymin = Q2.5,
ymax = Q97.5),
color = "red",  shape = 20
) +
geom_point(
data = d_eng,
aes(y = n_cited), color = "black"
)
``````

Yeah, yeah, this is it.

I wonder why 2018 is so extreme? It really looks like a model fit without this datapoint would predict is very poorly.

After deleting this point, rerun the model, and the posterior result shows that it is much better and seems more reasonable.

So, if i want to predict the situation in 2020, which data should we use? can i choose the data without the extreme point?