# Rolling dice

I want to model dice rolls. People recommend generating synthetic data to test the model. To do that I gave the data file `{"N": 3, "y": [null, null, null]}` but Stan said `variable: y, error: null values not allowed`. How can I generate data from this model for testing?

Is this impossible? How do you suggest testing a model with discrete data?

``````data {
int N;
int y[N];
}
parameters {
simplex[6] theta;
}
model {
theta ~ dirichlet(rep_vector(1,6));
for (n in 1:N) {
y[n] ~ categorical(theta);
}
}
``````

You have to actually simulate realistic synthetic data which you pass as `y`. For your problem, and R script to do the simulation would be

``````p <- rep(1, 6) # true unnormalized probabilities
N <- 10 # number of trials

# simulate the data
sim_data <- rmultinom(10,1,p)

# for a more consise data structure
sim_data2 <- apply(sim_data, 2, function(x){which(x == 1)})
``````

To thoroughly test the computational accuracy of a model, it would be wise to test multiple runs for a variety of choices of `p`. The cleanest way to do this testing is for a procedure known as simulation-based calibration (SBC), for which some of the Stan developers maintain a dedicated R package; see here GitHub - hyunjimoon/SBC

Is it possible to make Stan simulate synthetic data, instead of duplicating the model in R?

Sure; you need to do it in the `transformed data` block.

The statements in the transformed data block are designed to be executed once and have a deterministic result. Therefore, log probability is not accumulated and sampling statements may not be used. Random number generating functions are also prohibited.

so I donâ€™t see how itâ€™s possible to simulate data in that block. Can you show how?

The linked page isnâ€™t correct, and has been updated. Click â€śview current versionâ€ť at the top of the page to see the updated version of the page, current as of Stan 2.28 (the linked version from your post is 2.18).

I must be misunderstanding. The current version of `transformed data` says

The statements in the transformed data block are designed to be executed once and have a deterministic result. Therefore, log probability is not accumulated and sampling statements may not be used.

I donâ€™t see any examples of how to use that block for generating data so Iâ€™m just guessing, but this fails to compile.

``````data {
int N;
}
transformed data {
theta ~ dirichlet(rep_vector(1,6));
for (n in 1:N) {
y[n] ~ categorical(theta);
}
}
parameters {
simplex[6] theta;
}
model {
theta ~ dirichlet(rep_vector(1,6));
for (n in 1:N) {
y[n] ~ categorical(theta);
}
}

``````
`````` column 8 to column 34:
-------------------------------------------------
5:  transformed data {
6:      for (n in 1:N) {
7:          y[n] ~ categorical(theta);
^
8:      }
9:  }
-------------------------------------------------

Target can only be accessed in the model block or in definitions of functions with the suffix _lp.

``````

`transformed data` cannot handle sampling statements, but it can handle `_rng` functions.

1 Like

It is a common misunderstanding that `~`, a sampling statement, means generating random numbers from a distribution (using `~` for incrementing log prob probably hasnâ€™t been the best design decision in Stan). You need `categorical_rng` to draw from categorical distribution.

1 Like

Hereâ€™s the â€śloaded diceâ€ť model you want which will work for any number of dice sides `K` (use `K = 6` for standard six-sided dice). The Dirichlet isnâ€™t doing anything because `Dirichlet(1)` is uniform and the sampling statement turns into a no-op (just adds a constant to the target log density which doesnâ€™t affect sampling). Also, you want to vectorize sampling statements for efficiency and include lower/upper bounds on data variables to indicate constraints.

``````data {
int<lower=0> K;
int<lower=0> N;
array[N] int<lower=1, upper=K> y;
}
parameters {
simplex[K] theta;
}
model {
y ~ categorical(theta);
}
``````

If you want to generate data, you can do this and run it for a single iteration:

``````data {
int<lower=0> K;
simplex[K] theta;
int<lower=0> N;
}
generated quantities {
array[N] y;
for (n in 1:N) {
y[n] = categorical_rng(theta);
}
}
``````

It needs the dice size and simplex, but can then simulate. If you want to simulate the simplex, too, thatâ€™d be:

``````data {
int<lower=0> K;
int<lower=0> N;
}
generated quantities {
simplex[K] theta = dirichlet_rng(rep_vector(1, K));
array[N] y;
for (n in 1:N) {
y[n] = categorical_rng(theta);
}
}
``````

Each posterior draw from this model contains a `K`-simplex and a replicated data set `y[1:N]`.

1 Like