Can Bayesian predict cumulative-quantity?

As a beginner, I want to ask a question about Bayesian prediction. There is a data about the number of papers published and cited at Engineering discipline, which have four variables:

  • discipline
  • year (2000-2019)
  • number of published papers
  • citation frequency
``` r
library(tidyverse)
library(brms)
library(tidybayes)

d_eng <- tibble::tribble(
  ~discipline, ~year, ~n_paper, ~n_cited,
  "Engineering", 2000L, 3L, 10L,
  "Engineering", 2001L, 4L, 38L,
  "Engineering", 2002L, 5L, 84L,
  "Engineering", 2003L, 10L, 175L,
  "Engineering", 2004L, 5L, 96L,
  "Engineering", 2005L, 12L, 76L,
  "Engineering", 2006L, 7L, 53L,
  "Engineering", 2007L, 11L, 79L,
  "Engineering", 2008L, 7L, 105L,
  "Engineering", 2009L, 11L, 109L,
  "Engineering", 2010L, 15L, 143L,
  "Engineering", 2011L, 14L, 207L,
  "Engineering", 2012L, 16L, 93L,
  "Engineering", 2013L, 15L, 95L,
  "Engineering", 2014L, 15L, 102L,
  "Engineering", 2015L, 13L, 208L,
  "Engineering", 2016L, 13L, 103L,
  "Engineering", 2017L, 20L, 431L,
  "Engineering", 2018L, 49L, 1166L,
  "Engineering", 2019L, 56L, 231L
)

If the total number cited for 10 years exceeds a certain threshold (suppose the threshold is 2843), then this discipline is a first-class subject.

my idea, assuming that the threshold line remains unchanged or changes little, can I predict how likely the engineering discipline will exceed the threshold line in 2020?

I do like this:

  1. It is simply assumed that the cited quantity is a linear relationship with time

  2. modeling

brms_eng <- brms::brm(
  data = d_eng,
  family = gaussian,
  n_cited ~ 1 + year,
  iter = 41000, warmup = 40000, chains = 4, cores = 4,
  seed = 1024,
  control = list(
    adapt_delta = 0.99,
    max_treedepth = 13
  )
)
#> Compiling the C++ model
#> Start sampling
  1. predict the quantity to be cited in 2020, i.e. the increment in 2020
tb <- tibble(
  year = c(2020)
)

post_draw <- tb %>% add_predicted_draws(brms_eng)
post_draw

#> # A tibble: 4,000 x 6
#> # Groups:   year, .row [1]
#>     year  .row .chain .iteration .draw .prediction
#>    <dbl> <int>  <int>      <int> <int>       <dbl>
#>  1  2020     1     NA         NA     1       489. 
#>  2  2020     1     NA         NA     2       672. 
#>  3  2020     1     NA         NA     3       277. 
#>  4  2020     1     NA         NA     4       346. 
#>  5  2020     1     NA         NA     5       201. 
#>  6  2020     1     NA         NA     6       439. 
#>  7  2020     1     NA         NA     7        90.0
#>  8  2020     1     NA         NA     8      -328. 
#>  9  2020     1     NA         NA     9       258. 
#> 10  2020     1     NA         NA    10       425. 
#> # ... with 3,990 more rows
  1. Calculate the accumulated quantity in the previous 9 years
exist_nine_year_cum_cited <- d_eng %>%
  filter(between(year, 2011, 2019)) %>%
  summarise(
    exist_cum_cited = sum(n_cited)
  ) %>%
  pull()

exist_nine_year_cum_cited
#> [1] 2636
  1. The predict value of increment plus the accumulated quantity in the previous 9 years, thus constitutes the predict value of the accumulated value in 2020
post_pred <- post_draw %>%
  mutate(
    .prediction = .prediction + exist_nine_year_cum_cited
  )
post_pred
#> # A tibble: 4,000 x 6
#> # Groups:   year, .row [1]
#>     year  .row .chain .iteration .draw .prediction
#>    <dbl> <int>  <int>      <int> <int>       <dbl>
#>  1  2020     1     NA         NA     1       3125.
#>  2  2020     1     NA         NA     2       3308.
#>  3  2020     1     NA         NA     3       2913.
#>  4  2020     1     NA         NA     4       2982.
#>  5  2020     1     NA         NA     5       2837.
#>  6  2020     1     NA         NA     6       3075.
#>  7  2020     1     NA         NA     7       2726.
#>  8  2020     1     NA         NA     8       2308.
#>  9  2020     1     NA         NA     9       2894.
#> 10  2020     1     NA         NA    10       3061.
#> # ... with 3,990 more rows
  1. finally calculate the probability of exceeding the threshold
exceed_prob <- post_pred %>%
  group_by(year) %>%
  summarise(
    pred_mean = mean(.prediction),
    prob_above_line = mean(.prediction >= 2843) # 2843 is threshold
  )

exceed_prob
#> # A tibble: 1 x 3
#>    year pred_mean prob_above_line
#>   <dbl>     <dbl>           <dbl>
#> 1  2020     3026.           0.786

Created on 2020-04-20 by the reprex package (v0.3.0)

Question: now can we say that 78.6% probability of engineering disciplines will enter first-class disciplines in 2020?

1 Like

This is a good start. I like to start with the simplest model and then build out a more complex one from there.

I would show the upper and lower bounds not just the point estimate (which is what I think I am looking at). I need to read back through all this to see if anything else stands out.

That’s what the model predicts, at least. So conditioning on your model and data, that’s the prediction.

It seems like predicting the year quantities might be more informative than doing a threshold.

In the context of publishing papers, I guess this negative prediction is weird:

#> 8 2020 1 NA NA 8 -328.

which suggests the model isn’t perfect, which means the prediction probably isn’t perfect. It’s not clear what having the wrong model here means though in terms of the 78.6% prediction (Does it go up? Does it go down? How much? etc.)

2 Likes

@bbbales2 Thank you for pointing out my mistake。According to your suggestion, two places have been modified:

  • the cited quantity must be greater than 0,so the formula adjust as n_cited | trunc(lb = 0) ~ 1 + year
  • I am optimistic that the discipline is developing in a positive way, so add prior as set_prior("normal(22, 10)", lb = 0, class = "b")
brms_eng <- brms::brm(
  data = d_eng,
  family = gaussian,

  n_cited | trunc(lb = 0) ~ 1 + year,
  set_prior("normal(22, 10)", lb = 0, class = "b"),

  iter = 41000, warmup = 40000, chains = 4, cores = 4,
  seed = 1024,
  control = list(
    adapt_delta = 0.99,
    max_treedepth = 13
  )
)

thus, we can calculate the posterior summary as follows

brms_eng %>% posterior_summary()

#                Estimate    Est.Error         Q2.5        Q97.5
# b_Intercept -65211.19719 16630.772336 -96572.83551 -30663.08546
# b_year          32.46937     8.272460     15.31237     48.05529
# sigma          227.62141    41.051703    165.97844    328.18123
# lp__          -134.79629     1.508954   -139.12025   -133.12979
  1. finally calculate the probability of exceeding the specific threshold
exceed_prob <- post_pred %>%
  group_by(year) %>%
  summarise(
    pred_mean = mean(.prediction),
    prob_above_line = mean(.prediction >= 2843) # 2843 is threshold
  )

exceed_prob
# A tibble: 1 x 3
   year pred_mean prob_above_line
  <dbl>     <dbl>           <dbl>
1  2020     3052.           0.815

the result changes to 81.5%.

1 Like

Not checking the code specifically, but that sounds like you’re doing the right thing to make a posterior prediction.

A couple comments:

iter = 41000, warmup = 40000

That’s a huge amount of warmup. Does the default warmup = 1000, iter = 2000 not work?

seed = 1024

In general I think it’s a bad idea to set seeds. The random chain to chain variation is helpful for finding problems. This should be off except for very specific debugging cases.

1 Like

I would say seeds are good, when you publish results.

2 Likes

Thank you for your suggestion, I have adjusted accordingly

brms_eng <- brms::brm(
    data = d_eng,
    family = gaussian,
    n_cited | trunc(lb = 0) ~ 1 + year,
    set_prior("normal(22, 10)", lb = 0, class = "b"),
    iter = 4000, warmup = 2000, chains = 4, cores = 4,
    control = list(adapt_delta = 0.99, 
                   max_treedepth = 13)
  )

which give a quite close result

brms_eng %>% posterior_summary()
#                Estimate    Est.Error         Q2.5        Q97.5
# b_Intercept -65211.19719 16630.772336 -96572.83551 -30663.08546
# b_year          32.46937     8.272460     15.31237     48.05529
# sigma          227.62141    41.051703    165.97844    328.18123
# lp__          -134.79629     1.508954   -139.12025   -133.12979

and

exceed_prob

# A tibble: 1 x 3
   year pred_mean prob_above_line
  <dbl>     <dbl>           <dbl>
1  2020     3050.           0.822

Thanks again.

@bbbales2 I try to convert the brms code about truncated data to Stan

excode <- '
data {
  int<lower=1> N; 
  vector[N] year;
  vector<lower=0>[N] cited;
}

parameters {
  real alpha;   
  real beta;   
  real<lower=0> sigma;  
}

model {
  // Priors
  beta ~ normal(22, 10) T[0, ];

  // Likelihood
  for (n in 1:N){
    cited[n] ~ normal(alpha + beta * year[n], sigma) T[0, ];
  }

}
'


stan_dat <- list(
	   N = nrow(d_eng),
	year = d_eng$year,
   cited = d_eng$n_cited
)

stan(model_code = excode,
			data = stan_dat,
			iter = 4000, 
			chains = 4, 
			warmup = 2000, 
			cores = 4,
			control = list(
				adapt_delta = 0.99,
				max_treedepth = 13
			))

However, I get back some warning messages and the final result is obviously incorrect.

Warning messages:
1: There were 6216 divergent transitions after warmup. Increasing adapt_delta above 0.99 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: Examine the pairs() plot to diagnose sampling problems
 
3: The largest R-hat is 1.13, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat 
4: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess 
5: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess 

Can you tell me how to write this Stan code?

thanks

cited[n] ~ normal(alpha + beta * year[n], sigma) T[0, ];

This truncation makes sense – it’s saying there won’t be negative citations in a year.

Another distribution that might work would be the negative binomial: https://mc-stan.org/docs/2_23/functions-reference/nbalt.html

beta ~ normal(22, 10) T[0, ];

I don’t think you’d want to truncate your slope (which is where the error is coming from).

You could make the error go away by:

  1. Removing the truncation (this is what I think you should do)
  2. Adding a lower = 0 constraint to beta (real<lower = 0.0> beta;)
2 Likes

Thank you kindly.
you are right. The negative binomial distribution is more suited to citations in a year.
After a lot of google

excode <- '
data {
  int<lower=1> N; 
  int Y[N];
  int<lower = 1> K;
  matrix[N, K] X; 
}

transformed data {
 int Kc = K - 1;
 matrix[N, Kc] Xc;
 vector[Kc] means_X;
 for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
 }

parameters {
  vector[Kc] beta;   
  real alpha;   
  real<lower=0> shape;  
}

model {
  // Priors
  target += student_t_lpdf(alpha | 3, 5, 10);
  target += gamma_lpdf(shape | 0.01, 0.01);
  
  // Likelihood
  target += neg_binomial_2_log_glm_lpmf(Y | Xc, alpha, beta, shape);
}

generated quantities {
 real b_alpha = alpha - dot_product(means_X, beta);

  int y_rep[N]; 
   for (n in 1:N) {
    y_rep[n] = neg_binomial_2_log_rng(b_alpha + X[n, 2:] * beta, phi);
  }
  
}
'

then

X <- unname(model.matrix(~ 1 + year, d_eng))
attr(X, "assign") <- NULL
ncol(X)


stan_dat <- list(
	   N = nrow(d_eng),
	   Y = d_eng$n_cited,
	   K = ncol(X),
       X = X  
)


fit <- stan(
     model_code = excode,
     data = stan_dat,
	iter = 4000, 
	chains = 4, 
	warmup = 2000, 
	cores = 4,
	control = list(
	    adapt_delta = 0.99,
	    max_treedepth = 13
			))

The result is good for me

fit
           mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
beta[1]    0.12    0.00  0.03    0.07    0.10    0.12    0.14    0.17  4761    1
alpha      4.96    0.00  0.15    4.68    4.86    4.96    5.06    5.28  4682    1
shape      2.45    0.01  0.78    1.18    1.90    2.36    2.90    4.21  5034    1
b_alpha -233.89    0.75 52.05 -339.20 -266.54 -232.76 -200.05 -130.52  4758    1
lp__    -124.39    0.03  1.31 -127.79 -125.00 -124.04 -123.43 -122.91  2692    1

Posterior Predictive Checks

library(bayesplot)
y_rep <- as.matrix(fit, pars = "y_rep")
y_rep

ppc_dens_overlay(y = d_eng$n_cited, y_rep[1:100,])

2 Likes

I like the posterior predictive plot! Looks like the predictions are larger than the data suggests.

I don’t know if this would help, but can you add priors on beta?

Also can you just go with normal priors on everything? It’s much easier to interpret the parameters (if there isn’t a particular reason to use a gamma/student_t here)

as a beginner i do not know how to set priors on parameters, so i choose the easy way borrowing from

brms::make_stancode(
  n_cited ~ 1 + year,
	data = d_eng, 
  family = "negbinomial"
	)

which give me

model {
  target += student_t_lpdf(Intercept | 3, 5, 10);
  target += gamma_lpdf(shape | 0.01, 0.01);
}

however, it seems hard to explain although it turns out to be good result.

Following @bbbales2 advices, i change the priors to

target += normal_lpdf(alpha | 0, 100);
target += cauchy_lpdf(phi | 0, 2);

it also work well.

However the prior on beta looks sensitive, and any setting will cause the model to fail, for example

target += normal_lpdf(beta | 0, 10);
Error in unserialize(socklist[[n]]) : error reading from connection
Error in serialize(data, node$con, xdr = FALSE) : error writing to connection
1 Like

according here, set

library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
Sys.setenv(LOCAL_CPPFLAGS = "-march=corei7 -mtune=corei7")

the error goes away. So rewrite priors as

target += normal_lpdf(alpha | 0, 100);
target += normal_lpdf(beta | 0, 10);
target += cauchy_lpdf(phi | 0, 2);

model works again and now looks near perfect.

1 Like

Good find! I’ve seen this error before and not known what to do. Do you still get the error if the only thing you set is:

rstan_options(auto_write = TRUE)

By any chance?

On the posterior predictives?

@bbbales2 Thank you for your patience, you are the best teacher.

  1. if set only
rstan_options(auto_write = TRUE)

The Stan model is also working very well.

  1. posterior predictive is not very good, which are still larger than the data suggests.
    i am sorry that i have misunderstand it.

Cool!

Hmm, are there other posterior prediction plots you could make? Like per year? Or per journal, or something?

fit at each year,like this?

rep_draw <- fit %>% tidybayes::spread_draws(y_rep[condition])
rep_draw
# A tibble: 160,000 x 5
# Groups:   condition [20]
   condition y_rep .chain .iteration .draw
       <int> <dbl>  <int>      <int> <int>
 1         1    94      1          1     1
 2         1    93      1          2     2
 3         1    45      1          3     3
 4         1    72      1          4     4
 5         1    43      1          5     5
 6         1    96      1          6     6
 7         1   193      1          7     7
 8         1    56      1          8     8
 9         1    33      1          9     9
10         1    34      1         10    10
# ... with 159,990 more rows



d_rep <- rep_draw %>%
  group_by(condition) %>%
  summarise(
    pred_mean = mean(y_rep, na.rm = T),
    Q2.5 = quantile(y_rep, probs = c(0.025)),
    Q97.5 = quantile(y_rep, probs = c(0.975))
  ) %>%
  mutate(year = c(2000:2019))

d_rep
# A tibble: 20 x 5
   condition pred_mean  Q2.5 Q97.5  year
       <int>     <dbl> <dbl> <dbl> <int>
 1         1      48.3   5    149.  2000
 2         2      53.8   5    159.  2001
 3         3      59.9   7    180   2002
 4         4      68.2   7    196   2003
 5         5      75.3   9    216   2004
 6         6      85.0  10    240   2005
 7         7      95.6  10    275   2006
 8         8     106.   13    294   2007
 9         9     120.   13    331.  2008
10        10     137.   17    383.  2009
11        11     154.   18    441.  2010
12        12     175.   22    481   2011
13        13     193.   25.0  538.  2012
14        14     221.   27    624   2013
15        15     247.   26    702.  2014
16        16     280.   33    793.  2015
17        17     314.   37    917.  2016
18        18     362.   41   1052   2017
19        19     405.   46   1186.  2018
20        20     452.   54   1345.  2019



d_rep %>%
  ggplot(aes(x = year)) +
  geom_pointrange(
   aes(y = pred_mean,
       ymin = Q2.5,
       ymax = Q97.5),
    color = "red",  shape = 20
  ) +
  geom_point(
    data = d_eng,
    aes(y = n_cited), color = "black"
  )

Yeah, yeah, this is it.

I wonder why 2018 is so extreme? It really looks like a model fit without this datapoint would predict is very poorly.

After deleting this point, rerun the model, and the posterior result shows that it is much better and seems more reasonable.

So, if i want to predict the situation in 2020, which data should we use? can i choose the data without the extreme point?