Generate quantities using bernoulli_rng gives real value not int

Hi,

I’m confused about what bernoulli_rng function does. Based on Stan manual (p.306), it takes a real value (probability) and returns an integer (0 or 1).

int bernoulli_rng(real theta) Generate a Bernoulli variate with chance of success theta; may only be used
ingenerated quantities block

So I assume if I use it as they do here:

y_pred[i, t] = bernoulli_rng(pGamble);

I should feed it with a probability and get 0 or 1 (nothing or succes). However, I get real numbers \in (0,1). Why is that? What am I missing?

How can I predict the behaviour based on those generated quantities? Ie., how do I get the prediction of the model (which are discrete, in this case accept/reject)?

Thank you.

I just tested it on my computer and I get integer values as expected. Could you post a minimal example where you get the real numbers?

I’ve used expose_stan_functions to be able to work with the function directly in R. Here’s what I get:

library(rstan)
cc <- "
functions {
  int wrap_bernoulli_rng(real x) {
    return bernoulli_rng(x);
  }
}
"

expose_stan_functions(stan_model(model_code = cc))

x <- wrap_bernoulli_rng(0.3)
x
typeof(x)

The output is as expected:

> x
[1] 0

> typeof(x)
[1] "integer"

Does your output differ?

Yes, I can. I’m using pystan, I have a ntb for that (just do not know how to share it here, so I will post the code).

The STAN code is the one I posted before, ra_noLA.

The python code goes as follow:

# define conditions and path
import os, sys
import pystan
import pickle

import numpy as np
import pandas as pd

import time
from datetime import timedelta


model_folder = "C:/Users/user/stan_files/"
data_path = 'C:/Users/user/data/'


#prepare model
model = 'ra_noLA.stan'
model_path = os.path.join(model_folder, model)

try:
    with open(os.path.join(data_path, 'data_LA_exp1_testSTAN.txt'), "rb") as fp:   # Unpickling
        LA_data_prep_test = pickle.load(fp)
    print('loading data from folder')

except IOError:
    print("File not accessible")

# Time the process
start = time.time()

# Compile the model
start_compile = time.time()
stan_model_noLA = pystan.StanModel(file=model_path)
end_compile = time.time()
print('Duration of compilation: ', timedelta(seconds=end_compile - start_compile))


# run the model
start_model = time.time()
fit_noLA = stan_model_noLA.sampling(data=LA_data_prep_test, 
                    chains=4, iter=2000, warmup=1000, thin=1, init='random', verbose=True, 
                    control = {"adapt_delta":0.95, "stepsize":1, "max_treedepth":10}, n_jobs=-1)
end_model = time.time()
print('Duration of the model run is: ', timedelta(seconds=end_model - start_model))

# Create an ordered dictionary
summary_dict = fit_noLA.summary()
#print(summary_dict)

# Create a data frame out of the output
df_noLA = pd.DataFrame(summary_dict['summary'], 
                  columns=summary_dict['summary_colnames'], 
                  index=summary_dict['summary_rownames'])
    
end = time.time()

print('Duration of the entire process is: ', timedelta(seconds=end - start))

My data example is attached. data_LA_exp1_testSTAN.txt (7.4 KB)

Printing df_noLA.loc['y_pred[1,1]':'y_pred[1,256]', 'mean'] gives the values and type:

y_pred[1,1] 0.00025
y_pred[1,2] 0.00000
y_pred[1,3] 0.00050
y_pred[1,4] 0.00025
y_pred[1,5] 0.00075

y_pred[1,252] 0.81200
y_pred[1,253] 0.86200
y_pred[1,254] 0.90425
y_pred[1,255] 0.93550
y_pred[1,256] 0.94850
Name: mean, Length: 256, dtype: float64

I do not use PyStan, but it appears, you are printing the summary of the posterior draws of y_pred, particularly the mean. The mean of bernoulli variables is expected to be real number. If you want the reaw 0,1 values you would need to access the draws and not the summary.

Does that make sense?

Yes, it does make sense. I’m taking it from the fit_noLA.summary() where they are stored as well.
I checked the fit_noLA.extract('y_pred') and indeed, you are right, those are integers. So that is solved, thank you very much.

So let me clarify it for myself again. y_pred are stored for each trial, ie. y_pred[1,1], ..., y_pred[1,256] and what I get in the summary is an average across all the (n_of_chains)*(iters_without_warm_up)? That’s the mean I’m seeing?

So can I run a simulation within the code (that’ what I thought it is doing), where it takes the estimated parameters and regenerates the behaviour, ie. returns only a vector of 0 and 1 of the same length as the input?

1 Like

Correct.

You have one such simulation for each chain x iteration. The usual workflow is to compute any quantity of interest for each chain x iteration separately, giving you chain x iteration samples of the quantity letting you judge the uncertainty you have.

In addition to the posterior mean vs. draws, y_pred is defined as a real value in the program:

  real y_pred[N, T];

Make that int if you want y_pred[i, t] to hold integers.

Thanks for that tip. I was just confused by it for a while, now I understand what is going on.