# Using the posterior predictive distribution of my model to estimate the probability that a future observation is greater than 15

I am trying to use the posterior predictive distribution of my model to estimate the probability that a future observation is greater than 15 for each of (1) x_new and (2) y_new.

My Stan code is as follows:

  data {
int<lower=1> n;
vector[n] x;
vector[n] y;
}

parameters {
real x_mu;
real y_mu;
real<lower = 0> sigma;
}

transformed parameters {
real prob_positive_diff;

prob_positive_diff = y_mu >= x_mu;
}

model {
x ~ normal(x_mu, sigma);
y ~ normal(y_mu, sigma);

sigma ~ normal(0, 10);
}

generated quantities{
real x_new;
real y_new;

x_new = normal_rng(x_mu, sigma);
y_new = normal_rng(y_mu, sigma);
}


My understanding is that what I’ve done in the generated quantities block is sample from the posterior predictive distribution, right? I’m not 100% sure I’ve done this correctly, so I’d appreciate it if people could please provide feedback.

What I’m unsure of now is how to estimate the probability that each of (1) x_new and (2) y_new are greater than 15? I tried something like the following:

real<lower = 0, upper = 1> x_prob;
real<lower = 0, upper = 1> y_prob;

x_prob = x_new > 15;
y_prob = y_new > 15;


But this doesn’t work.

Should I perhaps be using some cumulative distribution function here?

I would greatly appreciate it if people could please assist me with this.

Do you mean something like this?

N <- 100
sigma <- 10
x <- rnorm(N, 3, sigma)
y <- rnorm(N, 5, sigma)

true_p = pnorm(15, 3, sigma, lower.tail = FALSE)*pnorm(15, 5, sigma, lower.tail = FALSE)
true_p # should give 0.01825641

library(rstan)
options(mc.cores = parallel::detectCores())

stan_code <- "


This goes into the stan_code string, but is presented separately here for highlighting…

data{
int N;
vector[N] x;
vector[N] y;
}
parameters{
real mu_x;
real mu_y;
real<lower=0> sigma;
}
transformed parameters{
real<lower=0,upper=1> p;

p = exp(log1m(normal_cdf(15, mu_x, sigma)) + log1m(normal_cdf(15, mu_y, sigma)));

}
model {
x ~ normal(mu_x, sigma);
y ~ normal(mu_y, sigma);

sigma ~ normal(0, 10);
}


…and continue R code:

"

stan_data <- list(N=N,x=x,y=y)

samples <- stan(model_code = stan_code, data = stan_data)

Inference for Stan model: e3891da5dbf66dc9436e641c2902a7d5.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
mu_x     2.75    0.01 0.94    0.92    2.12    2.74    3.35    4.60  3913    1
mu_y     5.93    0.02 0.97    4.04    5.28    5.93    6.59    7.84  4000    1
sigma    9.42    0.01 0.48    8.54    9.08    9.40    9.74   10.45  4000    1
p        0.02    0.00 0.01    0.01    0.01    0.02    0.02    0.03  4000    1
lp__  -546.65    0.03 1.29 -550.05 -547.20 -546.28 -545.71 -545.21  2045    1

Samples were drawn using NUTS(diag_e) at Fri May 18 14:27:00 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1). Thanks for the response.

I mean that we want the two separate probabilities of x_new > 15 and y_new > 15, where x_new and y_new are samples of the posterior predictive distribution. It would be similar to prob_positive_diff = y_mu >= x_mu; in my code, except there would be 2 of them, one for each of x_new > 15 and y_new > 15.

I don’t think the code you posted samples from the posterior predictive distribution? Or perhaps I am misunderstanding something? I was under the impression that we get the posterior predictive distribution from the code I posted in the generated quantities block? I used the following examples to guide me with regards to the posterior predictive distribution:

https://magesblog.com/post/2015-05-19-posterior-predictive-output-with-stan/

(see Box 1, page 4)

And it seems here that you would be multiplying the probabilities:

p = exp(log1m(normal_cdf(15, mu_x, sigma)) + log1m(normal_cdf(15, mu_y, sigma)));

?

It is possible that I am thinking about this implementation of the posterior predictive distribution incorrectly, which is why I’m struggling to find out how to calculate the aforementioned probabilities that I am seeking.

Sorry, I misread your post and thought that you were looking for the probability that BOTH y_{new}>15 AND x_{new}>15, so P(y_{new}>15)P(x_{new}>15).

You probably meant this:

stancode_alt:

data{
int N;
vector[N] x;
vector[N] y;
int N_rep;
}
parameters{
real mu_x;
real mu_y;
real<lower=0> sigma;
}
transformed parameters{
real<lower=0,upper=1> p_x = exp(log1m(normal_cdf(15, mu_x, sigma)));
real<lower=0,upper=1> p_y = exp(log1m(normal_cdf(15, mu_y, sigma)));
}
model {
x ~ normal(mu_x, sigma);
y ~ normal(mu_y, sigma);

sigma ~ normal(0, 10);
}
generated quantities{
real p_x_new;
real p_y_new;

{
vector[N_rep] x_new_ind;
vector[N_rep] y_new_ind;
for(n in 1:N_rep){
x_new_ind[n] = normal_rng(mu_x, sigma) > 15 ? 1.0 : 0.0;
y_new_ind[n] = normal_rng(mu_y, sigma) > 15 ? 1.0 : 0.0;
}
p_x_new = sum(x_new_ind)/N_rep;
p_y_new = sum(y_new_ind)/N_rep;
}

}

samples_alt <- stan(model_code = stan_code_alt, data = list(N=N,x=x,y=y,N_rep=1e5))
Inference for Stan model: 9f1eca9ece6b6dc4ba656cac787e03b5.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
mu_x       2.79    0.02 0.96    0.94    2.15    2.79    3.43    4.68  3366    1
mu_y       5.95    0.02 0.94    4.15    5.30    5.93    6.59    7.75  3886    1
sigma      9.42    0.01 0.47    8.54    9.09    9.41    9.73   10.37  4000    1
p_x        0.10    0.00 0.02    0.06    0.08    0.10    0.11    0.14  3594    1
p_y        0.17    0.00 0.03    0.12    0.15    0.17    0.19    0.23  3890    1
p_x_new    0.10    0.00 0.02    0.06    0.08    0.10    0.11    0.14  3581    1
p_y_new    0.17    0.00 0.03    0.12    0.15    0.17    0.19    0.23  3903    1
lp__    -546.62    0.03 1.23 -549.75 -547.18 -546.31 -545.72 -545.21  2053    1

Samples were drawn using NUTS(diag_e) at Fri May 18 16:20:20 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).


You can play around with the values for N_rep. The higher N_rep the closer p_x_new will be to p_x, because the posterior predictive is equal to the posterior here. Right?

I think you just need to calculate something like mean(x_new > 15) in R after extracting the draws of x_new from the stanfit object.

1 Like

Thank you for the revision.

The higher N_rep the closer p_x_new will be to p_x, because the posterior predictive is equal to the posterior here. Right?

Yes. Assuming my understanding of the theory is correct, then, if the model (and, therefore, the posterior distribution) is accurate, then the posterior predictive should (roughly) estimate the posterior distribution itself. In this (mock/example) case, since our observations are realisations from a normal distribution, and since our simulated replicated data are also realisations from a normal distribution, we would expect them to be approximately the same. Does this sound right?

  real<lower=0,upper=1> p_x = exp(log1m(normal_cdf(15, mu_x, sigma)));
real<lower=0,upper=1> p_y = exp(log1m(normal_cdf(15, mu_y, sigma)));


Stan’s user manual says the following: So, given this definition, it seems like normal_cdf(15, mu_y, sigma) would be the probability that it is less than or equal to 15? Or am I misunderstanding this?

Thanks for the response.

This was my initial thought. Would there be any difference between this and what @Max_Mantei did in his post? In other words, both methods would be correct, right? His solution seems like it would be more elegant.

Note that log1m(x) is \log(1-x), so that exp(log1m(normal_cdf(15, mu_x, sigma))) effectively becomes 1-\int_0^{15}\text{Normal}(\mu_x,\sigma), which is probably also ok to put in the model as 1 - normal_cdf(15, mu_x, sigma). (I used the exp(log(...)), because I initially thought [wrongly] that you wanted to multiply the two probability and it’s usually better to add them on a log-scale.)

Well, actually you pointed at this (using the cdf) in your first post. I think it depends on the specifics of the application. In this case using a transformed parameter made sense. But If you have new data, that doesn’t enter your model, then perhaps the R or generated quantities approach is better, where better means ease of implementation.

Also, it depends if you want to have a distribution over possible p_x or just a number: mean(x_new > 15) will only give you one number, but sometimes that’s all you need and then this would be the easiest way to get what you want.

Max

1 Like

Thanks again for the clarification.

Yes, I’m only interested in the probability, not the distribution. In that case, I’ll leave it as is, since both implementations would get me the same information.

Thanks again for the assistance.