Rolling dice

I want to model dice rolls. People recommend generating synthetic data to test the model. To do that I gave the data file {"N": 3, "y": [null, null, null]} but Stan said variable: y, error: null values not allowed. How can I generate data from this model for testing?

Is this impossible? How do you suggest testing a model with discrete data?

data {
    int N;
    int y[N];
}
parameters {
    simplex[6] theta;
}
model {
    theta ~ dirichlet(rep_vector(1,6));
    for (n in 1:N) {
        y[n] ~ categorical(theta);
    }
}

You have to actually simulate realistic synthetic data which you pass as y. For your problem, and R script to do the simulation would be

p <- rep(1, 6) # true unnormalized probabilities
N <- 10 # number of trials

# simulate the data
sim_data <- rmultinom(10,1,p)

# for a more consise data structure
sim_data2 <- apply(sim_data, 2, function(x){which(x == 1)})

To thoroughly test the computational accuracy of a model, it would be wise to test multiple runs for a variety of choices of p. The cleanest way to do this testing is for a procedure known as simulation-based calibration (SBC), for which some of the Stan developers maintain a dedicated R package; see here GitHub - hyunjimoon/SBC

Is it possible to make Stan simulate synthetic data, instead of duplicating the model in R?

Sure; you need to do it in the transformed data block.

transformed data says

The statements in the transformed data block are designed to be executed once and have a deterministic result. Therefore, log probability is not accumulated and sampling statements may not be used. Random number generating functions are also prohibited.

so I don’t see how it’s possible to simulate data in that block. Can you show how?

The linked page isn’t correct, and has been updated. Click “view current version” at the top of the page to see the updated version of the page, current as of Stan 2.28 (the linked version from your post is 2.18).

I must be misunderstanding. The current version of transformed data says

The statements in the transformed data block are designed to be executed once and have a deterministic result. Therefore, log probability is not accumulated and sampling statements may not be used.

I don’t see any examples of how to use that block for generating data so I’m just guessing, but this fails to compile.

data {
    int N;
}
transformed data {
    theta ~ dirichlet(rep_vector(1,6));
    for (n in 1:N) {
        y[n] ~ categorical(theta);
    }
}
parameters {
    simplex[6] theta;
}
model {
    theta ~ dirichlet(rep_vector(1,6));
    for (n in 1:N) {
        y[n] ~ categorical(theta);
    }
}

 column 8 to column 34:
   -------------------------------------------------
     5:  transformed data {
     6:      for (n in 1:N) {
     7:          y[n] ~ categorical(theta);
                 ^
     8:      }
     9:  }
   -------------------------------------------------

Target can only be accessed in the model block or in definitions of functions with the suffix _lp.

transformed data cannot handle sampling statements, but it can handle _rng functions.

1 Like

It is a common misunderstanding that ~, a sampling statement, means generating random numbers from a distribution (using ~ for incrementing log prob probably hasn’t been the best design decision in Stan). You need categorical_rng to draw from categorical distribution.

1 Like

Here’s the “loaded dice” model you want which will work for any number of dice sides K (use K = 6 for standard six-sided dice). The Dirichlet isn’t doing anything because Dirichlet(1) is uniform and the sampling statement turns into a no-op (just adds a constant to the target log density which doesn’t affect sampling). Also, you want to vectorize sampling statements for efficiency and include lower/upper bounds on data variables to indicate constraints.

data {
  int<lower=0> K;
  int<lower=0> N;
  array[N] int<lower=1, upper=K> y;
}
parameters {
  simplex[K] theta;
}
model {
  y ~ categorical(theta);
}

If you want to generate data, you can do this and run it for a single iteration:

data {
  int<lower=0> K;
  simplex[K] theta;
  int<lower=0> N;
}
generated quantities {
  array[N] y; 
  for (n in 1:N) {
    y[n] = categorical_rng(theta);
  }
}

It needs the dice size and simplex, but can then simulate. If you want to simulate the simplex, too, that’d be:

data {
  int<lower=0> K;
  int<lower=0> N;
}
generated quantities {
  simplex[K] theta = dirichlet_rng(rep_vector(1, K));
  array[N] y; 
  for (n in 1:N) {
    y[n] = categorical_rng(theta);
  }
}

Each posterior draw from this model contains a K-simplex and a replicated data set y[1:N].

1 Like