Rolling dice

abs · January 31, 2022, 7:15pm

I want to model dice rolls. People recommend generating synthetic data to test the model. To do that I gave the data file {"N": 3, "y": [null, null, null]} but Stan said variable: y, error: null values not allowed. How can I generate data from this model for testing?

Is this impossible? How do you suggest testing a model with discrete data?

data {
    int N;
    int y[N];
}
parameters {
    simplex[6] theta;
}
model {
    theta ~ dirichlet(rep_vector(1,6));
    for (n in 1:N) {
        y[n] ~ categorical(theta);
    }
}

jsocolar · January 31, 2022, 7:39pm

You have to actually simulate realistic synthetic data which you pass as y. For your problem, and R script to do the simulation would be

p <- rep(1, 6) # true unnormalized probabilities
N <- 10 # number of trials

# simulate the data
sim_data <- rmultinom(10,1,p)

# for a more consise data structure
sim_data2 <- apply(sim_data, 2, function(x){which(x == 1)})

To thoroughly test the computational accuracy of a model, it would be wise to test multiple runs for a variety of choices of p. The cleanest way to do this testing is for a procedure known as simulation-based calibration (SBC), for which some of the Stan developers maintain a dedicated R package; see here GitHub - hyunjimoon/SBC

abs · January 31, 2022, 7:45pm

Is it possible to make Stan simulate synthetic data, instead of duplicating the model in R?

jsocolar · January 31, 2022, 7:59pm

Sure; you need to do it in the transformed data block.

abs · January 31, 2022, 8:01pm

transformed data says

The statements in the transformed data block are designed to be executed once and have a deterministic result. Therefore, log probability is not accumulated and sampling statements may not be used. Random number generating functions are also prohibited.

so I don’t see how it’s possible to simulate data in that block. Can you show how?

jsocolar · January 31, 2022, 8:08pm

The linked page isn’t correct, and has been updated. Click “view current version” at the top of the page to see the updated version of the page, current as of Stan 2.28 (the linked version from your post is 2.18).

abs · January 31, 2022, 8:16pm

I must be misunderstanding. The current version of transformed data says

The statements in the transformed data block are designed to be executed once and have a deterministic result. Therefore, log probability is not accumulated and sampling statements may not be used.

I don’t see any examples of how to use that block for generating data so I’m just guessing, but this fails to compile.

data {
    int N;
}
transformed data {
    theta ~ dirichlet(rep_vector(1,6));
    for (n in 1:N) {
        y[n] ~ categorical(theta);
    }
}
parameters {
    simplex[6] theta;
}
model {
    theta ~ dirichlet(rep_vector(1,6));
    for (n in 1:N) {
        y[n] ~ categorical(theta);
    }
}

 column 8 to column 34:
   -------------------------------------------------
     5:  transformed data {
     6:      for (n in 1:N) {
     7:          y[n] ~ categorical(theta);
                 ^
     8:      }
     9:  }
   -------------------------------------------------

Target can only be accessed in the model block or in definitions of functions with the suffix _lp.

jsocolar · January 31, 2022, 10:43pm

transformed data cannot handle sampling statements, but it can handle _rng functions.

jtimonen · January 31, 2022, 11:08pm

It is a common misunderstanding that ~, a sampling statement, means generating random numbers from a distribution (using ~ for incrementing log prob probably hasn’t been the best design decision in Stan). You need categorical_rng to draw from categorical distribution.

Bob_Carpenter · February 1, 2022, 7:44pm

Here’s the “loaded dice” model you want which will work for any number of dice sides K (use K = 6 for standard six-sided dice). The Dirichlet isn’t doing anything because Dirichlet(1) is uniform and the sampling statement turns into a no-op (just adds a constant to the target log density which doesn’t affect sampling). Also, you want to vectorize sampling statements for efficiency and include lower/upper bounds on data variables to indicate constraints.

data {
  int<lower=0> K;
  int<lower=0> N;
  array[N] int<lower=1, upper=K> y;
}
parameters {
  simplex[K] theta;
}
model {
  y ~ categorical(theta);
}

If you want to generate data, you can do this and run it for a single iteration:

data {
  int<lower=0> K;
  simplex[K] theta;
  int<lower=0> N;
}
generated quantities {
  array[N] y; 
  for (n in 1:N) {
    y[n] = categorical_rng(theta);
  }
}

It needs the dice size and simplex, but can then simulate. If you want to simulate the simplex, too, that’d be:

data {
  int<lower=0> K;
  int<lower=0> N;
}
generated quantities {
  simplex[K] theta = dirichlet_rng(rep_vector(1, K));
  array[N] y; 
  for (n in 1:N) {
    y[n] = categorical_rng(theta);
  }
}

Each posterior draw from this model contains a K-simplex and a replicated data set y[1:N].

Wormanza · December 17, 2024, 6:39am

Well, I had a similar issue once when I was trying to model dice rolls. Instead of overthinking it, I just used dice.onl to quickly generate some random rolls and formatted them to fit my setup. It was way easier than writing a script from scratch, and it gave me solid test data to work with right away. Sometimes keeping it simple saves a ton of time.

Bob_Carpenter · December 17, 2024, 11:02pm

Hi, @Wormanza and welcome to the Stan forums!

If you’re trying to understand dice, you might want to check out one of the sites that collect statistics, like

https://dice.clockworkmod.com.

This one lets you specify the dice in D&D-style notation (e.g., 3d6 for rolling 3 six sided dice and summing them) and then generates histograms of outcome probabilities for you. These sites are very popular for those of us who play and design role playing games.

Sometimes we need a bit more, though.

In the above, I provided code to estimate the dice biases given observations, which you can’t do with the online sites. And sometimes you’re playing a game where the dice mechanic is complicated, like my favorite, Blades in the Dark, which uses dice pools and selects the largest value, then quantizes unevenly into 1–3, 4–5, 6, and double 6. If you really want something mind bending, look at the Star Wars game, Edge of Empire. Four different dice types (two for the GM, two for the players) and none of them have numbers. And then there are the exploding dice, where if you roll the highest number on a die, you roll it again and add the result ad infinitum. Many of these dice sites can account for these kinds of dice.

Topic		Replies	Views
Sampling from a matrix in STAN Modeling rstan , techniques , performance	28	1293	September 1, 2021
Generating random numbers in the model Modeling	20	5038	July 23, 2021
Problem Simulating Data with Generated Quantities (Dimension mismatch in assignment; type = real; right-hand side type = real[ ]) RStan rstan , techniques , specification	3	853	April 29, 2021
Statistics on random numbers in model block Modeling	4	797	August 8, 2017
Simulating fake data for regression in Stan Modeling techniques , specification , performance	2	1296	October 9, 2021

Rolling dice

Related topics