I want to model dice rolls. People recommend generating synthetic data to test the model. To do that I gave the data file {"N": 3, "y": [null, null, null]} but Stan said variable: y, error: null values not allowed. How can I generate data from this model for testing?
Is this impossible? How do you suggest testing a model with discrete data?
data {
int N;
int y[N];
}
parameters {
simplex[6] theta;
}
model {
theta ~ dirichlet(rep_vector(1,6));
for (n in 1:N) {
y[n] ~ categorical(theta);
}
}
You have to actually simulate realistic synthetic data which you pass as y. For your problem, and R script to do the simulation would be
p <- rep(1, 6) # true unnormalized probabilities
N <- 10 # number of trials
# simulate the data
sim_data <- rmultinom(10,1,p)
# for a more consise data structure
sim_data2 <- apply(sim_data, 2, function(x){which(x == 1)})
To thoroughly test the computational accuracy of a model, it would be wise to test multiple runs for a variety of choices of p. The cleanest way to do this testing is for a procedure known as simulation-based calibration (SBC), for which some of the Stan developers maintain a dedicated R package; see here GitHub - hyunjimoon/SBC
The statements in the transformed data block are designed to be executed once and have a deterministic result. Therefore, log probability is not accumulated and sampling statements may not be used. Random number generating functions are also prohibited.
so I don’t see how it’s possible to simulate data in that block. Can you show how?
The linked page isn’t correct, and has been updated. Click “view current version” at the top of the page to see the updated version of the page, current as of Stan 2.28 (the linked version from your post is 2.18).
I must be misunderstanding. The current version of transformed data says
The statements in the transformed data block are designed to be executed once and have a deterministic result. Therefore, log probability is not accumulated and sampling statements may not be used.
I don’t see any examples of how to use that block for generating data so I’m just guessing, but this fails to compile.
data {
int N;
}
transformed data {
theta ~ dirichlet(rep_vector(1,6));
for (n in 1:N) {
y[n] ~ categorical(theta);
}
}
parameters {
simplex[6] theta;
}
model {
theta ~ dirichlet(rep_vector(1,6));
for (n in 1:N) {
y[n] ~ categorical(theta);
}
}
column 8 to column 34:
-------------------------------------------------
5: transformed data {
6: for (n in 1:N) {
7: y[n] ~ categorical(theta);
^
8: }
9: }
-------------------------------------------------
Target can only be accessed in the model block or in definitions of functions with the suffix _lp.
It is a common misunderstanding that ~, a sampling statement, means generating random numbers from a distribution (using ~ for incrementing log prob probably hasn’t been the best design decision in Stan). You need categorical_rng to draw from categorical distribution.
Here’s the “loaded dice” model you want which will work for any number of dice sides K (use K = 6 for standard six-sided dice). The Dirichlet isn’t doing anything because Dirichlet(1) is uniform and the sampling statement turns into a no-op (just adds a constant to the target log density which doesn’t affect sampling). Also, you want to vectorize sampling statements for efficiency and include lower/upper bounds on data variables to indicate constraints.
data {
int<lower=0> K;
int<lower=0> N;
array[N] int<lower=1, upper=K> y;
}
parameters {
simplex[K] theta;
}
model {
y ~ categorical(theta);
}
If you want to generate data, you can do this and run it for a single iteration:
data {
int<lower=0> K;
simplex[K] theta;
int<lower=0> N;
}
generated quantities {
array[N] y;
for (n in 1:N) {
y[n] = categorical_rng(theta);
}
}
It needs the dice size and simplex, but can then simulate. If you want to simulate the simplex, too, that’d be:
data {
int<lower=0> K;
int<lower=0> N;
}
generated quantities {
simplex[K] theta = dirichlet_rng(rep_vector(1, K));
array[N] y;
for (n in 1:N) {
y[n] = categorical_rng(theta);
}
}
Each posterior draw from this model contains a K-simplex and a replicated data set y[1:N].
Well, I had a similar issue once when I was trying to model dice rolls. Instead of overthinking it, I just used dice.onl to quickly generate some random rolls and formatted them to fit my setup. It was way easier than writing a script from scratch, and it gave me solid test data to work with right away. Sometimes keeping it simple saves a ton of time.
This one lets you specify the dice in D&D-style notation (e.g., 3d6 for rolling 3 six sided dice and summing them) and then generates histograms of outcome probabilities for you. These sites are very popular for those of us who play and design role playing games.
Sometimes we need a bit more, though.
In the above, I provided code to estimate the dice biases given observations, which you can’t do with the online sites. And sometimes you’re playing a game where the dice mechanic is complicated, like my favorite, Blades in the Dark, which uses dice pools and selects the largest value, then quantizes unevenly into 1–3, 4–5, 6, and double 6. If you really want something mind bending, look at the Star Wars game, Edge of Empire. Four different dice types (two for the GM, two for the players) and none of them have numbers. And then there are the exploding dice, where if you roll the highest number on a die, you roll it again and add the result ad infinitum. Many of these dice sites can account for these kinds of dice.