Trouble writing a hypergeometric family function: how to supply only integers to hypergeometric_lpmf?

I’m puzzling over how to supply hypergeometric_lpmf with integer parameters. Maybe it requires a tparameters block?

The data I’m modeling is # of guesses it takes to get the next character (or word) right in a cloze completion task. So, you’ve got a text snippet “…nd he went to the store to buy milk, but was waylaid by a _”, and you’re meant to guess what letter goes in the blank.

There are 26 letters in the English alphabet, and you guess until you get it right - sampling without replacement. So the hypergeometric distribution seemed like the thing that was closest to modeling the underlying data-generating process.

The hypergeometric requires combinations of integers; how can I make sure it gets integers?
Or/also, am I missing something here?

Here’s code for a simple model (I’ve tried many different things at this point, this is just a starting point for discussion):


# Load necessary libraries
library(brms)
library(tidyverse)
library(tidybayes)
library(patchwork)

# Define constants for hypergeometric distribution
N <- 26  # Total number of letters in the alphabet
m <- 1   # Number of correct letters
nincorrect <- 25  # Number of incorrect letters

# Define a custom family function for the hypergeometric distribution
# Note: The hypergeometric distribution isn't natively supported in brms, so we define it as a custom family
hypergeo_family <- custom_family(
  "hypergeo",
  dpars = c("mu", "nincorrect", "k"),
  links = c("identity", "identity", "identity"),
  lb = c(0, 0, 0),
  type = "int",
)

# Define custom functions for the likelihood and posterior predictive density
dparse <- stanvar(
  scode =     "
    real hypergeo_lpmf(int y, int mu, int nincorrect, int k) {
      return hypergeometric_lpmf(y | mu, nincorrect, k); // using vint to ensure integer types
    }",
  block = "functions"
)

# Fit a Bayesian model with brms
m_hypergeo <-
  brm(
    formula = num_guesses ~ GPT_num_guesses + GPT_ans_prob + OANC_mean_guesses + prop_left + (1|ppt.code),
    family = hypergeo_family,
    data = model_data,
    stanvars = dparse,
    cores = 4,
    seed = 1,
    iter = 2000
  )

# Summary of the model
summary(m_hypergeo)

# Plot model diagnostics
plot(m_hypergeo)

The error I see with this particular implementation is:

Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  0
Semantic error in 'string', line 57, column 16 to column 58:
   -------------------------------------------------
    55:      }
    56:      for (n in 1:N) {
    57:        target += hypergeo_lpmf(Y[n] | mu[n], nincorrect, k);
                         ^
    58:      }
    59:    }
   -------------------------------------------------

Ill-typed arguments supplied to function 'hypergeo_lpmf':
(int, real, real, real)
Available signatures:
(int, int, int, int) => real
  The second argument must be int but got real

EDIT: it has since become clear that what I really need is the negative hypergeometric:

It appears this is implemented in extraDisr in R: R: Negative hypergeometric distribution

I have little idea where to begin.

I’m afraid Stan doesn’t support integer parameters. That’s because HMC only samples from continuous densities. So there’s not much utility to this hypergeometric distribution.

I don’t know anything about brms specifics.

Thanks for the info Bob!