Can Bayesian predict cumulative-quantity?

Wang_MinJie · April 20, 2020, 9:09am

As a beginner, I want to ask a question about Bayesian prediction. There is a data about the number of papers published and cited at Engineering discipline, which have four variables:

discipline
year (2000-2019)
number of published papers
citation frequency

``` r
library(tidyverse)
library(brms)
library(tidybayes)

d_eng <- tibble::tribble(
  ~discipline, ~year, ~n_paper, ~n_cited,
  "Engineering", 2000L, 3L, 10L,
  "Engineering", 2001L, 4L, 38L,
  "Engineering", 2002L, 5L, 84L,
  "Engineering", 2003L, 10L, 175L,
  "Engineering", 2004L, 5L, 96L,
  "Engineering", 2005L, 12L, 76L,
  "Engineering", 2006L, 7L, 53L,
  "Engineering", 2007L, 11L, 79L,
  "Engineering", 2008L, 7L, 105L,
  "Engineering", 2009L, 11L, 109L,
  "Engineering", 2010L, 15L, 143L,
  "Engineering", 2011L, 14L, 207L,
  "Engineering", 2012L, 16L, 93L,
  "Engineering", 2013L, 15L, 95L,
  "Engineering", 2014L, 15L, 102L,
  "Engineering", 2015L, 13L, 208L,
  "Engineering", 2016L, 13L, 103L,
  "Engineering", 2017L, 20L, 431L,
  "Engineering", 2018L, 49L, 1166L,
  "Engineering", 2019L, 56L, 231L
)

If the total number cited for 10 years exceeds a certain threshold (suppose the threshold is 2843), then this discipline is a first-class subject.

my idea, assuming that the threshold line remains unchanged or changes little, can I predict how likely the engineering discipline will exceed the threshold line in 2020?

I do like this:

It is simply assumed that the cited quantity is a linear relationship with time
modeling

brms_eng <- brms::brm(
  data = d_eng,
  family = gaussian,
  n_cited ~ 1 + year,
  iter = 41000, warmup = 40000, chains = 4, cores = 4,
  seed = 1024,
  control = list(
    adapt_delta = 0.99,
    max_treedepth = 13
  )
)
#> Compiling the C++ model
#> Start sampling

predict the quantity to be cited in 2020, i.e. the increment in 2020

tb <- tibble(
  year = c(2020)
)

post_draw <- tb %>% add_predicted_draws(brms_eng)
post_draw

#> # A tibble: 4,000 x 6
#> # Groups:   year, .row [1]
#>     year  .row .chain .iteration .draw .prediction
#>    <dbl> <int>  <int>      <int> <int>       <dbl>
#>  1  2020     1     NA         NA     1       489. 
#>  2  2020     1     NA         NA     2       672. 
#>  3  2020     1     NA         NA     3       277. 
#>  4  2020     1     NA         NA     4       346. 
#>  5  2020     1     NA         NA     5       201. 
#>  6  2020     1     NA         NA     6       439. 
#>  7  2020     1     NA         NA     7        90.0
#>  8  2020     1     NA         NA     8      -328. 
#>  9  2020     1     NA         NA     9       258. 
#> 10  2020     1     NA         NA    10       425. 
#> # ... with 3,990 more rows

Calculate the accumulated quantity in the previous 9 years

exist_nine_year_cum_cited <- d_eng %>%
  filter(between(year, 2011, 2019)) %>%
  summarise(
    exist_cum_cited = sum(n_cited)
  ) %>%
  pull()

exist_nine_year_cum_cited
#> [1] 2636

The predict value of increment plus the accumulated quantity in the previous 9 years, thus constitutes the predict value of the accumulated value in 2020

post_pred <- post_draw %>%
  mutate(
    .prediction = .prediction + exist_nine_year_cum_cited
  )
post_pred
#> # A tibble: 4,000 x 6
#> # Groups:   year, .row [1]
#>     year  .row .chain .iteration .draw .prediction
#>    <dbl> <int>  <int>      <int> <int>       <dbl>
#>  1  2020     1     NA         NA     1       3125.
#>  2  2020     1     NA         NA     2       3308.
#>  3  2020     1     NA         NA     3       2913.
#>  4  2020     1     NA         NA     4       2982.
#>  5  2020     1     NA         NA     5       2837.
#>  6  2020     1     NA         NA     6       3075.
#>  7  2020     1     NA         NA     7       2726.
#>  8  2020     1     NA         NA     8       2308.
#>  9  2020     1     NA         NA     9       2894.
#> 10  2020     1     NA         NA    10       3061.
#> # ... with 3,990 more rows

finally calculate the probability of exceeding the threshold

exceed_prob <- post_pred %>%
  group_by(year) %>%
  summarise(
    pred_mean = mean(.prediction),
    prob_above_line = mean(.prediction >= 2843) # 2843 is threshold
  )

exceed_prob
#> # A tibble: 1 x 3
#>    year pred_mean prob_above_line
#>   <dbl>     <dbl>           <dbl>
#> 1  2020     3026.           0.786

^{Created on 2020-04-20 by the reprex package (v0.3.0)}

Question: now can we say that 78.6% probability of engineering disciplines will enter first-class disciplines in 2020?

Ara_Winter · April 20, 2020, 12:58pm

This is a good start. I like to start with the simplest model and then build out a more complex one from there.

I would show the upper and lower bounds not just the point estimate (which is what I think I am looking at). I need to read back through all this to see if anything else stands out.

bbbales2 · April 20, 2020, 1:15pm

That’s what the model predicts, at least. So conditioning on your model and data, that’s the prediction.

It seems like predicting the year quantities might be more informative than doing a threshold.

In the context of publishing papers, I guess this negative prediction is weird:

#> 8 2020 1 NA NA 8 -328.

which suggests the model isn’t perfect, which means the prediction probably isn’t perfect. It’s not clear what having the wrong model here means though in terms of the 78.6% prediction (Does it go up? Does it go down? How much? etc.)

Wang_MinJie · April 21, 2020, 10:04am

@bbbales2 Thank you for pointing out my mistake。According to your suggestion, two places have been modified：

the cited quantity must be greater than 0，so the formula adjust as n_cited | trunc(lb = 0) ~ 1 + year
I am optimistic that the discipline is developing in a positive way, so add prior as set_prior("normal(22, 10)", lb = 0, class = "b")

brms_eng <- brms::brm(
  data = d_eng,
  family = gaussian,

  n_cited | trunc(lb = 0) ~ 1 + year,
  set_prior("normal(22, 10)", lb = 0, class = "b"),

  iter = 41000, warmup = 40000, chains = 4, cores = 4,
  seed = 1024,
  control = list(
    adapt_delta = 0.99,
    max_treedepth = 13
  )
)

thus, we can calculate the posterior summary as follows

brms_eng %>% posterior_summary()

#                Estimate    Est.Error         Q2.5        Q97.5
# b_Intercept -65211.19719 16630.772336 -96572.83551 -30663.08546
# b_year          32.46937     8.272460     15.31237     48.05529
# sigma          227.62141    41.051703    165.97844    328.18123
# lp__          -134.79629     1.508954   -139.12025   -133.12979

finally calculate the probability of exceeding the specific threshold

exceed_prob <- post_pred %>%
  group_by(year) %>%
  summarise(
    pred_mean = mean(.prediction),
    prob_above_line = mean(.prediction >= 2843) # 2843 is threshold
  )

exceed_prob
# A tibble: 1 x 3
   year pred_mean prob_above_line
  <dbl>     <dbl>           <dbl>
1  2020     3052.           0.815

the result changes to 81.5%.

bbbales2 · April 21, 2020, 2:20pm

Not checking the code specifically, but that sounds like you’re doing the right thing to make a posterior prediction.

A couple comments:

iter = 41000, warmup = 40000

That’s a huge amount of warmup. Does the default warmup = 1000, iter = 2000 not work?

seed = 1024

In general I think it’s a bad idea to set seeds. The random chain to chain variation is helpful for finding problems. This should be off except for very specific debugging cases.

ahartikainen · April 21, 2020, 2:35pm

I would say seeds are good, when you publish results.

Wang_MinJie · April 21, 2020, 2:40pm

Thank you for your suggestion, I have adjusted accordingly

brms_eng <- brms::brm(
    data = d_eng,
    family = gaussian,
    n_cited | trunc(lb = 0) ~ 1 + year,
    set_prior("normal(22, 10)", lb = 0, class = "b"),
    iter = 4000, warmup = 2000, chains = 4, cores = 4,
    control = list(adapt_delta = 0.99, 
                   max_treedepth = 13)
  )

which give a quite close result

brms_eng %>% posterior_summary()
#                Estimate    Est.Error         Q2.5        Q97.5
# b_Intercept -65211.19719 16630.772336 -96572.83551 -30663.08546
# b_year          32.46937     8.272460     15.31237     48.05529
# sigma          227.62141    41.051703    165.97844    328.18123
# lp__          -134.79629     1.508954   -139.12025   -133.12979

and

exceed_prob

# A tibble: 1 x 3
   year pred_mean prob_above_line
  <dbl>     <dbl>           <dbl>
1  2020     3050.           0.822

Thanks again.

Wang_MinJie · April 24, 2020, 12:59pm

@bbbales2 I try to convert the brms code about truncated data to Stan

excode <- '
data {
  int<lower=1> N; 
  vector[N] year;
  vector<lower=0>[N] cited;
}

parameters {
  real alpha;   
  real beta;   
  real<lower=0> sigma;  
}

model {
  // Priors
  beta ~ normal(22, 10) T[0, ];

  // Likelihood
  for (n in 1:N){
    cited[n] ~ normal(alpha + beta * year[n], sigma) T[0, ];
  }

}
'


stan_dat <- list(
	   N = nrow(d_eng),
	year = d_eng$year,
   cited = d_eng$n_cited
)

stan(model_code = excode,
			data = stan_dat,
			iter = 4000, 
			chains = 4, 
			warmup = 2000, 
			cores = 4,
			control = list(
				adapt_delta = 0.99,
				max_treedepth = 13
			))

However, I get back some warning messages and the final result is obviously incorrect.

Warning messages:
1: There were 6216 divergent transitions after warmup. Increasing adapt_delta above 0.99 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: Examine the pairs() plot to diagnose sampling problems
 
3: The largest R-hat is 1.13, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat 
4: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess 
5: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

Can you tell me how to write this Stan code？

thanks

bbbales2 · April 24, 2020, 1:08pm

cited[n] ~ normal(alpha + beta * year[n], sigma) T[0, ];

This truncation makes sense – it’s saying there won’t be negative citations in a year.

Another distribution that might work would be the negative binomial: https://mc-stan.org/docs/2_23/functions-reference/nbalt.html

beta ~ normal(22, 10) T[0, ];

I don’t think you’d want to truncate your slope (which is where the error is coming from).

You could make the error go away by:

Removing the truncation (this is what I think you should do)
Adding a lower = 0 constraint to beta (real<lower = 0.0> beta;)

Wang_MinJie · April 26, 2020, 9:36am

Thank you kindly.
you are right. The negative binomial distribution is more suited to citations in a year.
After a lot of google

excode <- '
data {
  int<lower=1> N; 
  int Y[N];
  int<lower = 1> K;
  matrix[N, K] X; 
}

transformed data {
 int Kc = K - 1;
 matrix[N, Kc] Xc;
 vector[Kc] means_X;
 for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
 }

parameters {
  vector[Kc] beta;   
  real alpha;   
  real<lower=0> shape;  
}

model {
  // Priors
  target += student_t_lpdf(alpha | 3, 5, 10);
  target += gamma_lpdf(shape | 0.01, 0.01);
  
  // Likelihood
  target += neg_binomial_2_log_glm_lpmf(Y | Xc, alpha, beta, shape);
}

generated quantities {
 real b_alpha = alpha - dot_product(means_X, beta);

  int y_rep[N]; 
   for (n in 1:N) {
    y_rep[n] = neg_binomial_2_log_rng(b_alpha + X[n, 2:] * beta, phi);
  }
  
}
'

then

X <- unname(model.matrix(~ 1 + year, d_eng))
attr(X, "assign") <- NULL
ncol(X)


stan_dat <- list(
	   N = nrow(d_eng),
	   Y = d_eng$n_cited,
	   K = ncol(X),
       X = X  
)


fit <- stan(
     model_code = excode,
     data = stan_dat,
	iter = 4000, 
	chains = 4, 
	warmup = 2000, 
	cores = 4,
	control = list(
	    adapt_delta = 0.99,
	    max_treedepth = 13
			))

The result is good for me

fit
           mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
beta[1]    0.12    0.00  0.03    0.07    0.10    0.12    0.14    0.17  4761    1
alpha      4.96    0.00  0.15    4.68    4.86    4.96    5.06    5.28  4682    1
shape      2.45    0.01  0.78    1.18    1.90    2.36    2.90    4.21  5034    1
b_alpha -233.89    0.75 52.05 -339.20 -266.54 -232.76 -200.05 -130.52  4758    1
lp__    -124.39    0.03  1.31 -127.79 -125.00 -124.04 -123.43 -122.91  2692    1

Posterior Predictive Checks

library(bayesplot)
y_rep <- as.matrix(fit, pars = "y_rep")
y_rep

ppc_dens_overlay(y = d_eng$n_cited, y_rep[1:100,])

bbbales2 · April 26, 2020, 2:20pm

I like the posterior predictive plot! Looks like the predictions are larger than the data suggests.

I don’t know if this would help, but can you add priors on beta?

Also can you just go with normal priors on everything? It’s much easier to interpret the parameters (if there isn’t a particular reason to use a gamma/student_t here)

Wang_MinJie · April 27, 2020, 11:34am

as a beginner i do not know how to set priors on parameters, so i choose the easy way borrowing from

brms::make_stancode(
  n_cited ~ 1 + year,
	data = d_eng, 
  family = "negbinomial"
	)

which give me

model {
  target += student_t_lpdf(Intercept | 3, 5, 10);
  target += gamma_lpdf(shape | 0.01, 0.01);
}

however, it seems hard to explain although it turns out to be good result.

Following @bbbales2 advices, i change the priors to

target += normal_lpdf(alpha | 0, 100);
target += cauchy_lpdf(phi | 0, 2);

it also work well.

However the prior on beta looks sensitive, and any setting will cause the model to fail, for example

target += normal_lpdf(beta | 0, 10);

Error in unserialize(socklist[[n]]) : error reading from connection
Error in serialize(data, node$con, xdr = FALSE) : error writing to connection

Wang_MinJie · April 27, 2020, 3:14pm

according here, set

library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
Sys.setenv(LOCAL_CPPFLAGS = "-march=corei7 -mtune=corei7")

the error goes away. So rewrite priors as

target += normal_lpdf(alpha | 0, 100);
target += normal_lpdf(beta | 0, 10);
target += cauchy_lpdf(phi | 0, 2);

model works again and now looks near perfect.

bbbales2 · April 28, 2020, 6:41pm

Good find! I’ve seen this error before and not known what to do. Do you still get the error if the only thing you set is:

rstan_options(auto_write = TRUE)

By any chance?

bbbales2 · April 28, 2020, 6:41pm

On the posterior predictives?

Wang_MinJie · April 29, 2020, 5:58am

@bbbales2 Thank you for your patience, you are the best teacher.

if set only

rstan_options(auto_write = TRUE)

The Stan model is also working very well.

posterior predictive is not very good, which are still larger than the data suggests.
i am sorry that i have misunderstand it.

r_rep697×639 17.2 KB

bbbales2 · April 29, 2020, 3:47pm

Cool!

Hmm, are there other posterior prediction plots you could make? Like per year? Or per journal, or something?

Wang_MinJie · April 30, 2020, 11:03am

fit at each year，like this?

rep_draw <- fit %>% tidybayes::spread_draws(y_rep[condition])
rep_draw
# A tibble: 160,000 x 5
# Groups:   condition [20]
   condition y_rep .chain .iteration .draw
       <int> <dbl>  <int>      <int> <int>
 1         1    94      1          1     1
 2         1    93      1          2     2
 3         1    45      1          3     3
 4         1    72      1          4     4
 5         1    43      1          5     5
 6         1    96      1          6     6
 7         1   193      1          7     7
 8         1    56      1          8     8
 9         1    33      1          9     9
10         1    34      1         10    10
# ... with 159,990 more rows



d_rep <- rep_draw %>%
  group_by(condition) %>%
  summarise(
    pred_mean = mean(y_rep, na.rm = T),
    Q2.5 = quantile(y_rep, probs = c(0.025)),
    Q97.5 = quantile(y_rep, probs = c(0.975))
  ) %>%
  mutate(year = c(2000:2019))

d_rep
# A tibble: 20 x 5
   condition pred_mean  Q2.5 Q97.5  year
       <int>     <dbl> <dbl> <dbl> <int>
 1         1      48.3   5    149.  2000
 2         2      53.8   5    159.  2001
 3         3      59.9   7    180   2002
 4         4      68.2   7    196   2003
 5         5      75.3   9    216   2004
 6         6      85.0  10    240   2005
 7         7      95.6  10    275   2006
 8         8     106.   13    294   2007
 9         9     120.   13    331.  2008
10        10     137.   17    383.  2009
11        11     154.   18    441.  2010
12        12     175.   22    481   2011
13        13     193.   25.0  538.  2012
14        14     221.   27    624   2013
15        15     247.   26    702.  2014
16        16     280.   33    793.  2015
17        17     314.   37    917.  2016
18        18     362.   41   1052   2017
19        19     405.   46   1186.  2018
20        20     452.   54   1345.  2019



d_rep %>%
  ggplot(aes(x = year)) +
  geom_pointrange(
   aes(y = pred_mean,
       ymin = Q2.5,
       ymax = Q97.5),
    color = "red",  shape = 20
  ) +
  geom_point(
    data = d_eng,
    aes(y = n_cited), color = "black"
  )

bbbales2 · April 30, 2020, 5:33pm

Yeah, yeah, this is it.

I wonder why 2018 is so extreme? It really looks like a model fit without this datapoint would predict is very poorly.

Wang_MinJie · May 1, 2020, 5:27am

After deleting this point, rerun the model, and the posterior result shows that it is much better and seems more reasonable.

So, if i want to predict the situation in 2020, which data should we use? can i choose the data without the extreme point?

Topic		Replies	Views
Non linear, sequential model using results of cumulative regression in a bernoulli regression Modeling brms	2	420	March 8, 2021
Hierarchical model 101: Link, categorical without baseline, posterior predictive checks brms bayesplot	10	2255	October 23, 2018
Help with understanding of bayesian inference using GLM and BRMS brms	3	1415	November 13, 2018
Differing posterior predictive checks for logistic binomial model with and without addition-terms brms fitting-issues , brms	5	2269	March 1, 2022
Variable selection in Brms brms	10	1786	March 2, 2020

Can Bayesian predict cumulative-quantity?

I do like this:

Related topics