Why are my predicted values wrong although I'm able to recover the parameters?

This is a follow up to my previous question about AR(1) models.

I can recover the parameters of the following simple model:

data {
  int<lower = 0, upper = 1> run_estimation;               // a switch to evaluate the likelihood
  int<lower=0> N ;
  vector[N] y ;
}

parameters {
  real alpha ;
  real<lower = -1, upper = 1> beta ;
  real<lower=0> sigma ;
}

model {
  alpha ~ normal(0,1) ;
  beta ~ normal(0,1) ;
  sigma ~ normal(0,1) ;
  if(run_estimation==1) y[2:N] ~ normal(alpha + beta * y[1:(N - 1)], sigma) ;
}

generated quantities{
  vector[N] y_sim ;
  y_sim[1] = y[1] ;
  for (n in 2:N){
    y_sim[n] = normal_rng(alpha + beta * y_sim[n-1], sigma) ;
  }
}

image

image

image

Alas, when I try to plot y vs y_sim I get this:

image

This is the R code that I use to produce the plot:

library(tidyverse)
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = 4)
set.seed(9782)

# generate fake data
set.seed(9782)
y <- rnorm(1000)

sim_out <- sampling(model, data = list(run_estimation = 0,
                                       N = length(y),
                                       y = y
                                       ),
                    seed = 42118)

fake_data_matrix  <- sim_out %>% 
  as.data.frame %>% 
  select(contains("y_sim"))

# Recapture

draw <- 82
y_sim <-as.numeric(fake_data_matrix[draw,])

recapture_fit <- sampling(model, data = list(run_estimation = 1,
                                             N = length(y_sim),
                                             y = y_sim
                                             ),
                    seed = 42118)

# Plot

y = as.numeric(fake_data_matrix[draw,])[2:length(y_sim)]
y_sim = as.data.frame(recapture_fit, pars = "y_sim") %>% summarise_all(mean) %>% as.numeric()

df.plot <- data.frame(y =y, y_sim = y_sim[2:length(y_sim)])  

  
ggplot(data = df.plot, aes(x=y, y=y_sim)) + geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype="dotdash") + 
  xlim(min(df.plot$y, df.plot$y_sim), max(df.plot$y, df.plot$y_sim)) + 
  ylim(min(df.plot$y, df.plot$y_sim), max(df.plot$y, df.plot$y_sim))

What am I doing wrong? Thanks!

PS: In case it is easier, I also saved this code as a gist

@ignacio – why would the predictions be the same, when in the generated quantities block you’re conditioning on lagged random numbers (not the lagged values of y)? That is, you have y_sim[t-1] on the right hand side.

I think you’re mixing up the idea of simulating fake data for model/code testing purposes versus generating predictions. If you want to generate s-step ahead predictions, you should be “jumping off” at some y_t and simulating s steps ahead. Does that make sense?

This is how I was thinking about this:

  1. The first chunk of my gist generates some fake data.

  2. The second chunk tries to recapture the parameters given y <- as.numeric(fake_data_matrix[82,]). My thought was that if I can recover the parameters, I should also be able to predict y. Is that not right?

That makes sense. But I thought that because y_sim[1] = y[1] i should be able to predict y_sim[2:N] given that first value. Moreover, I thought that because I was able to recover the parameters of the model, y_sim[2:N] would be very similar to “the real data” (y). Does that make sense?

Ignacio, this is a stationary AR(1). After a few iterations you’ll just be sitting on the unconditional mean (intercept / (1- slope)) --which is precisely what you have.

1 Like

Thanks, that makes sense.