Fake data - _rng where is the error from?

Hi,

I’m trying to use the gamma_rng in the simplest possible way but I don’t understand why the (fake) responses have the same sd and in overall their distribution is very similar.
I tried to replicate the work by using R but the distributions are more variable.

My question is:
Why the gamma_rng returns so consistent distributions? I would expect something (more variable) closer to what R code results.


Here is the Stan code

N = 100
d_sim <- list(N = N)

pm <- stan_model(
  model_code = "
  data {
  int<lower=0> N;
  }
  
  generated quantities {
  real mu; // conditional mean of the dependent variable
  real rate; //beta parameter for the gamma distribution
  real<lower=0> inv_phi;
  
  mu = normal_rng(0.5, 0.05); 
 // mu = normal_rng(0.5, 0.005);  UPDATE by mistake I added one more decimal. 
  inv_phi = normal_rng(20, 2);
   
  // The response
  vector<lower=0>[N] value;
  
  // Simulate data from observational model
  for(n in 1:N){
    
    rate = inv_phi/mu; // Rate
    
    // The likelihood p(value|shape, rate)
    value[n] = gamma_rng(inv_phi, rate);
  
  }
}")


pfit <- sampling(pm,
                 data = d_sim,
                 iter=2000, warmup= 0,
                 chains=1, 
                 cores = 1,
                 seed = 1,
                 algorithm="Fixed_param"
                 )

I don’t see why I get sd = 0.13 for all the values

Then, generated responses have consistent distributions

Example in R

I tried to replicate this example by using R

inv_ph <- 20; 
mu <- 0.5;
alpha <- inv_ph
beta <- inv_ph/mu

curve(dgamma(x, shape = alpha, rate = beta), 
      xlim = c(0, 2), ylim = c(0,4))

for(i in 1:20){
  inv_ph <- rtnorm(1, 20, 2, lower = 0); 
  mu <- rnorm(1, 0.5, .05);
  alpha <- inv_ph
  beta <- inv_ph/mu
  dat <- rgamma(1e5, shape = alpha, rate = beta) 

curve(dgamma(x, shape = alpha, rate = beta),
      xlim = c(0, 2), add = T, col = 'red')
}

Which result a different outcome compared to rstan

Those are averages of the draws. You really only need N = 1 and then set the samples to however many draws you want. The draws themselves will vary. The average of the draws will converge to the average of your given gamma distribution. Just to verify the estimate:

Since you have

\alpha = \phi^{-1} \sim \mathcal{N}(20, 2) \\ \beta = \frac{ \phi^{-1} }{\mu} \sim \frac{\mathcal{N}(20, 2)}{\mathcal{N}(0.5, 0.005)} \approx \mathcal{N}(40, 4.02)

Where the approximation comes from the ratio of two uncorrelated noncentral normal random variables follows an approximate normal distribution with mean \frac{\mu_x}{\mu_y} and standard deviation \sqrt{\frac{\mu_x}{\mu_y} \bigg( \frac{\sigma^2_x}{\mu^2_x} + \frac{\sigma^2_y}{\mu^2_y} \bigg)} when the values from this distribution, y, are unlikely to be negative or \mu_y > 3 \sigma y which is the case here.

The gamma distribution you have is then \Gamma(x \mid \mathbb{E}(\alpha), \mathbb{E}(\beta)) = \Gamma(x \mid 20, 40). The mean and standard deviation of this gamma distribution is

\mathbb{E}(x) = \frac{\alpha}{\beta} = \frac{20}{40} = 0.5 \\ \sqrt{\mathbb{V}(x)} = \sqrt{\frac{\alpha}{\beta^2}} = \sqrt{\frac{20}{40^2}} = \sqrt{0.0125} = 0.1118034

The fact that the sd you have is between 0.13 and 0.12 is a bit weird to me when we would expect it to be 0.11. The mean is right on. I thought maybe this is because the approximation is a bit off for the mean of the ratio distribution so I checked in R with

expected_value_of_ratio_pdf <- function(z, mu_x, mu_y, sigma_x, sigma_y) {
  a <- sqrt( (z^2/sigma_x^2) + 1 / sigma_y^2 )
  b <- z * mu_x / (sigma_x^2) + mu_y / sigma_y^2
  c <- (mu_x / sigma_x)^2 + (mu_y / sigma_y)^2
  d <- exp((b^2 - c * a^2) / (2 * a^2))
  
  p <- b * d / a^3
  n <- 1 / (sqrt(2 * pi) * sigma_x * sigma_y)
  
  v1 <- p * n * (pnorm(b / a) - pnorm(-b / a))
  return( z * (v1 + exp(-c * 0.5) / (a^2 * pi * sigma_x * sigma_y)))
}

integrate(expected_value_of_ratio_pdf, lower=-Inf, upper=Inf, 
          mu_x = 20,
          mu_y = 0.5,
          sigma_x = 2,
          sigma_y = 0.005)

# answer is 40.004 with absolute error < 0.0045

@andrjohns @stevebronder I wonder if we have a small issue with the variance of the generates from the gamma_rng function?

1 Like

Also your simulation in R does not match the simulation in Stan. Specifying a bound in the generated quantities block does not ensure that you are sampling from a truncated normal distribution. I’m looking at real<lower=0> inv_phi;. Take a look at this thread on why and how to write the truncated normal in the generated quantities block Generated quantites block - #4 by bgoodri.

This still doesn’t explain the 0.13 though.

1 Like

Thanks for looking into this!! I just realize that in rstan code I had mu = normal_rng(0.5, 0.005) instead of mu = normal_rng(0.5, 0.05). The sd =0.13 is when mu = normal_rng(0.5, 0.05); .
When mu = normal_rng(0.5, 0.005) the sd in indeed 0.11

1 Like

Oh good!!

Apologies for keeping this open but I’m trying to figure reasonable priors and I don’t understand why with gamma_rng (although I’m using wide priors) I get so consistent results.

Indeed by reducing the sample of the draws (iter), as suggested, to a smaller number I get distributions with high variance but with normal_rng I don’t need to reduce the draws . The prior predictive with normal_rng will result distributions with high variability even with lot’s of draws

Eg, with \sigma \sim N(mean\_sigma = 0.8, sd\_sigma = 0.05) I get a wide band of distributions. I have two different groups but due to the high \sigma the two groups are well overlapped.


N = 100
d_sim_p1 <- list(N = N,
              n_level = 2, 
              level = rep(1:2, length.out = N),
              sd_level = 0.01,
              mean_sigma = .8,
              sd_sigma = .05) 

df_sim_p1 <- bind_cols(n = 1:N,
                    d_sim_p1) %>%
  mutate(level = factor(level))


pm2 <- stan_model(
  model_code = "
/*
1-way ANOVA based on prior
value ~ level
*/

data {
  int<lower=0> N;
  //vector[n] value;
  int<lower=0> n_level;
  int level[N];
  real sd_level;
  real mean_sigma;
  real sd_sigma;
}

generated quantities {
  
  // Simulate model configuration 
  // Index variable for the level 1 and  2
  vector[n_level] b_level;
    b_level[1] = normal_rng(.5, sd_level);
    b_level[2] = normal_rng(.6, sd_level);
  
  // The model  
  vector[N] mu = b_level[level];
  
  // The error
  real<lower=0> sigma = fabs(normal_rng(mean_sigma, sd_sigma));
  
  // The response
  vector[N] value;
  
  // Simulate data from observational model
  for(i in 1:N)
  value[i] = normal_rng(mu[i], sigma);

}
")

pfit2 <- sampling(pm2,
                 data = d_sim_p1 ,
                 iter=2000, warmup= 0,
                 chains=1, 
                 cores = 1,
                 seed = 1,
                 algorithm="Fixed_param"
                 )

Instead with gamma_rng I get too consistent results no mater how wide (or tight) prior I’ll give to the inverse phi

The plots were genarated by using

## values distribution
pfit2 %>%
  spread_draws(value[n]) %>% 
  left_join(df_sim_p1 ) %>%
  ggplot(aes(x = value, group = n, colour = analyst )) +
  geom_density() +
  xlim(0, 2)

Any idea why is this happening would be really helpful
Thanks

How are you getting the final data? You should be plotting from the draws and not the summary statistics.

I forgot to add the script but now I added it. I’m plotting the simulated data (here spread_draws(value[n])). Aren’t these the draws?
I’m not summarizing somehow the simulated outcome.