I want to model dice rolls. People recommend generating synthetic data to test the model. To do that I gave the data file {"N": 3, "y": [null, null, null]} but Stan said variable: y, error: null values not allowed. How can I generate data from this model for testing?

Is this impossible? How do you suggest testing a model with discrete data?

data {
int N;
int y[N];
}
parameters {
simplex[6] theta;
}
model {
theta ~ dirichlet(rep_vector(1,6));
for (n in 1:N) {
y[n] ~ categorical(theta);
}
}

You have to actually simulate realistic synthetic data which you pass as y. For your problem, and R script to do the simulation would be

p <- rep(1, 6) # true unnormalized probabilities
N <- 10 # number of trials
# simulate the data
sim_data <- rmultinom(10,1,p)
# for a more consise data structure
sim_data2 <- apply(sim_data, 2, function(x){which(x == 1)})

To thoroughly test the computational accuracy of a model, it would be wise to test multiple runs for a variety of choices of p. The cleanest way to do this testing is for a procedure known as simulation-based calibration (SBC), for which some of the Stan developers maintain a dedicated R package; see here GitHub - hyunjimoon/SBC

The statements in the transformed data block are designed to be executed once and have a deterministic result. Therefore, log probability is not accumulated and sampling statements may not be used. Random number generating functions are also prohibited.

so I donâ€™t see how itâ€™s possible to simulate data in that block. Can you show how?

The linked page isnâ€™t correct, and has been updated. Click â€śview current versionâ€ť at the top of the page to see the updated version of the page, current as of Stan 2.28 (the linked version from your post is 2.18).

I must be misunderstanding. The current version of transformed data says

The statements in the transformed data block are designed to be executed once and have a deterministic result. Therefore, log probability is not accumulated and sampling statements may not be used.

I donâ€™t see any examples of how to use that block for generating data so Iâ€™m just guessing, but this fails to compile.

data {
int N;
}
transformed data {
theta ~ dirichlet(rep_vector(1,6));
for (n in 1:N) {
y[n] ~ categorical(theta);
}
}
parameters {
simplex[6] theta;
}
model {
theta ~ dirichlet(rep_vector(1,6));
for (n in 1:N) {
y[n] ~ categorical(theta);
}
}

column 8 to column 34:
-------------------------------------------------
5: transformed data {
6: for (n in 1:N) {
7: y[n] ~ categorical(theta);
^
8: }
9: }
-------------------------------------------------
Target can only be accessed in the model block or in definitions of functions with the suffix _lp.

It is a common misunderstanding that ~, a sampling statement, means generating random numbers from a distribution (using ~ for incrementing log prob probably hasnâ€™t been the best design decision in Stan). You need categorical_rng to draw from categorical distribution.

Hereâ€™s the â€śloaded diceâ€ť model you want which will work for any number of dice sides K (use K = 6 for standard six-sided dice). The Dirichlet isnâ€™t doing anything because Dirichlet(1) is uniform and the sampling statement turns into a no-op (just adds a constant to the target log density which doesnâ€™t affect sampling). Also, you want to vectorize sampling statements for efficiency and include lower/upper bounds on data variables to indicate constraints.

data {
int<lower=0> K;
int<lower=0> N;
array[N] int<lower=1, upper=K> y;
}
parameters {
simplex[K] theta;
}
model {
y ~ categorical(theta);
}

If you want to generate data, you can do this and run it for a single iteration:

data {
int<lower=0> K;
simplex[K] theta;
int<lower=0> N;
}
generated quantities {
array[N] y;
for (n in 1:N) {
y[n] = categorical_rng(theta);
}
}

It needs the dice size and simplex, but can then simulate. If you want to simulate the simplex, too, thatâ€™d be:

data {
int<lower=0> K;
int<lower=0> N;
}
generated quantities {
simplex[K] theta = dirichlet_rng(rep_vector(1, K));
array[N] y;
for (n in 1:N) {
y[n] = categorical_rng(theta);
}
}

Each posterior draw from this model contains a K-simplex and a replicated data set y[1:N].