Error in a binomial model

Hello!

I am quite new to Stan, and I am trying to self-train with some exercises (that might apply to my research in general).

As an example, I have created a list in R with:

  • N = Number of sites
  • X = Number of pig bones found in a site
  • T = Total bones (of any animal) found in a site,
  • radiocarbon = The date in years to which this context is dated (ranging from -6000 BC to - 3000).
  • radiocarbon_sigma = A measurement error for the date.
# Number of sites
N <- 500

# Generate fake radiocarbon dates (years) with measurement errors
radiocarbon <- floor(runif(N, -6000, -3000))
sigma <- floor(runif(N, min = 0, max = 100))  # Measurement errors

# Generate fake total number of bones for each site
total_bones <- round(runif(N, min = 100, max = 1000))

# Generate fake number of pig bones for each site (assuming proportion varies)
p_true <- runif(N, min = 0, max=1)  # True proportion of pig bones 
pig_bones <- rbinom(N, total_bones, p_true)

# Putting all this together
data_list <- list(
  N = N,
  X = pig_bones,
  T = total_bones,
  radiocarbon = radiocarbon,
  sigma_radiocarbon = sigma
)


Now, ideally I would like to have a binomial model (or even better a beta-binomial model) to estimate the probability of finding pig bones at a certain calendar date, so I can then plot the posterior distribution to see the temporal variation in proportions along with a credible interval. I have tried the following code, but it is full of issues:

data {
  int<lower=0> N;                     // Number of sites
  int<lower=0> X[N];                  // Number of pig bones
  int<lower=0> T[N];                  // Total number of bones
  vector[N] radiocarbon;              // Radiocarbon dates
  vector<lower=0>[N] sigma_radiocarbon; // Measurement errors for radiocarbon dates
}

parameters {
  real alpha;                         // Intercept
  real beta;                          // Slope for radiocarbon dates
  real date; // The predictor, i.e. the radiocarbon date
}

model {
  vector[N] pbar;
    for ( i in 1:N ) {
        pbar[i] = alpha[i] + beta[i]*date[i];
        pbar[i] = inv_logit(pbar[i]);
    }
  // Priors
  alpha ~ normal(0, 1.5);              // Prior for intercept
  beta ~ normal(0, 1.5);               // Prior for slope

  // Likelihood
  for (i in 1:N) {
    X[i] ~ binomial(T[i], pbar[i]);      // Binomial likelihood
    date[i] ~ normal(radiocarbon[i], sigma_radiocarbon[i]);
  }
}

Errors:

# Setting chain to 1 just for diagnostics
fit <- sampling(stan_model, data = data_list, chains = 1, iter = 1000, warmup = 500, cores = 2)

Chain 1: Rejecting initial value:
Chain 1:   Error evaluating the log probability at the initial value.
Chain 1: Exception: normal_lpdf: Scale parameter is 0, but must be > 0!  (in 'model70c11e48b364_c86e30e492361fb52409da14f1e4b216' at line 31)
...

I apologise-I know the code might be very messy, but I am trying to learn.

Thank you in advance to whoever is willing to help me improve this!

P.S. As an extra step if this model works I would like to add a varying effect because I might have more samples from the same site, so it would be good to take into account that source of variability as well. Would that be too difficult to add?

This can generate values of 0, which would produce the error you shared. There are several ways to fix this, but I would just test the model without rounding down to the nearest integer.

And no worries. Your example was very easy to follow, and you documented everything well.

Hi Simon,

thank you so much for your kindness. The error is now gone, but I get another one:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Too many indexes, expression dimensions=0, indexes found=1
 error in 'model70c14b1baed8_83d8533b2f4f65c13a7f863a26132a04' at line 19, column 26
  -------------------------------------------------
    17:   vector[N] pbar;
    18:     for ( i in 1:N ) {
    19:         pbar[i] = alpha[i] + beta[i]*date[i];
                                 ^
    20:         pbar[i] = inv_logit(pbar[i]);
  -------------------------------------------------

Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  failed to parse Stan model '83d8533b2f4f65c13a7f863a26132a04' due to the above error.

I have tried changing the beginning of the stan code to the following, but I get a message giving me a syntax error:

data {
  int<lower=0> N;                     // Number of sites
  array[N] int X;                  // Number of pig bones
  array[N] int T;                  // Total number of bones
  array[N] int radiocarbon;              // Radiocarbon dates
  array[N] int sigma_radiocarbon; // Measurement errors for radiocarbon dates
}

I am not sure exactly what I am doing wrong, but the error must be in the Stan code now

1 Like

Thanks for sharing the updated code. There are a few things going on.

First, you’ve updated some of your code to the new array syntax (which is good). The old syntax (which is no longer valid in the most recent version of Stan) followed your first example

int X[N];

The new syntax follows your second

array[N] int X;

So glad you caught that.

The new error occurs because alphaonly had one element (it’s just a real, not a vector or array of reals) and so you can’t use the brackets to subset it. The same is true for betaand date. In all three cases, I would use either a vector or array of reals.

As a side note, I would go back to treating radiocarbonand sigma_radiocarbonas vectors.

Thank you for the answer. I have tried changing the code now to the following.

data {
  int<lower=0> N;          // Number of sites
  array[N] int X;         // Number of pig bones
  array[N] int T;         // Total number of bones
  vector[N] radiocarbon;       // Radiocarbon dates
  vector[N] sigma_radiocarbon; // Measurement errors for radiocarbon dates
}

parameters {
  array[N] real alpha;            // Intercept
  array[N] real beta;             // Slope for radiocarbon dates
  array[N] real date;
}

model {
  vector[N] pbar;
  for ( i in 1:N ) {
    pbar[i] = alpha[i] + beta[i]*date[i];
    pbar[i] = inv_logit(pbar[i]);
  }
  // Priors
  alpha ~ normal(0, 1.5);       // Prior for intercept
  beta ~ normal(0, 1.5);       // Prior for slope

  // Likelihood
  for (i in 1:N) {
    X[i] ~ binomial(T[i], pbar[i]);   // Binomial likelihood
    date[i] ~ normal(radiocarbon[i], sigma_radiocarbon[i]);
  }
}

I have tried both using arrays of real and vectors in the parameters section, but it seems to be stopping earlier at the data part.

 SYNTAX ERROR, MESSAGE(S) FROM PARSER:
 error in 'model854b5bb1e70b_079be793b49218b6bb574e21fef2675e' at line 4, column 2
  -------------------------------------------------
     2: data {
     3:   int<lower=0> N;          // Number of sites
     4:   array[N] int X;         // Number of pig bones
         ^
     5:   array[N] int T;         // Total number of bones
  -------------------------------------------------

Tomorrow I will also try saving the Stan code on a separate file and loading it into R, maybe the tab is giving me this issue?

The problem was indeed the spaces in the formula. The model now runs, but it does not sample efficiently at all, most Rhats are very high, etc.


I have tried starting all over using the rethinking package, to see if I can understand the basics of this model and then implement it with Stan if it works:

m1 <- 
  ulam(
    alist(
      X ~ dbinom(T, p),
      logit(p) <- alpha + beta*date,
      alpha ~ dnorm(0, 1.5),
      beta ~ dnorm(0,1.5),
      date ~ dnorm(radiocarbon,sigma_radiocarbon)
    ), 
    data = data_list, 
    chains = 2,
    cores=detectCores()
  )

Somehow the sampled dates are all very similar (SD=0.28). I know I probably shouldn’t sample dates with a normal distribution, but I want to treat time as a continuous variable. Also, the betas are all around 0.

posterior = extract.samples(m1)
> precis( posterior , depth=2 )
ulam posterior: 1000 samples from object
         mean   sd    5.5%   94.5%      histogram
alpha    0.24 0.25   -0.02    0.49    ▇▁▁▁▁▁▁▁▁▁▇
beta     0.00 0.00    0.00    0.00 ▁▇▁▁▁▁▁▁▁▁▁▁▇▁
date  4361.49 0.28 4361.03 4361.95      ▁▁▃▅▇▇▂▁▁

The aim was to get a temporal model that could show moments of increase and decrease in the probabilities of animal remains over the years, but I am doing something conceptually wrong.

My logic was to create an intercept (alpha) model with a slope (beta) that is conditioned by the independent time variable (date). The time has a measurement error so I sample it in the model with a normal distribution.

I apologise, I do not have formal training in statistics, so I may be doing something completely wrong.

It’s been a while since I’ve used the rethinking package so I’m not sure how you’d define this there, but the problem is in your parameter declarations. You want to define a single parameter value each for alpha and beta, rather than a vector of parameters of length N (one intercept and slope for each observation). However, as you have one observed date per data point, and want to predict the “true” date for each, you do want to declare date as a vector:

parameters {
  real alpha;            // Intercept
  real beta;             // Slope for radiocarbon dates
  vector[N] date;        // "true" dates
}
1 Like

The code runs with these adjustements, and I only get one alpha and one beta, while I get as many dates as the number of observations. However, the dates do not seem real.


Stan code:

data {
  int<lower=0> N;          // Number of sites
  array[N] int X;         // Number of pig bones
  array[N] int T;         // Total number of bones
  vector[N] radiocarbon;       // Radiocarbon dates
  vector[N] sigma_radiocarbon; // Measurement errors for radiocarbon dates
}

parameters {
  real alpha;            // Intercept
  real beta;             // Slope for radiocarbon dates
  vector[N] date;        // "true" dates
}

model {
  vector[N] pbar;
  for ( i in 1:N ) {
    pbar[i] = alpha + beta*date[i];
    pbar[i] = inv_logit(pbar[i]);
  }
  // Priors
  alpha ~ normal(0, 1.5);       // Prior for intercept
  beta ~ normal(0, 1.5);       // Prior for slope

  // Likelihood
  for (i in 1:N) {
    X[i] ~ binomial(T[i], pbar[i]);   // Binomial likelihood
    date[i] ~ normal(radiocarbon[i], sigma_radiocarbon[i]);
  }
}
m3 <- stan(
  file = "m1.stan",  # Stan program
  data = data_list,    # named list of data
  chains = 2,             # number of Markov chains
  warmup = 1000,          # number of warmup iterations per chain
  iter = 2000,            # total number of iterations per chain
  cores = 2,              
)

> precis(m3,2)
             mean     sd    5.5%   94.5% n_eff Rhat4
alpha       -0.03   0.00   -0.03   -0.02   144  1.03
beta         0.00   0.00    0.00    0.00     2  1.41
date[1]     -0.70   0.69   -1.57    0.86     4  1.56
date[2]      1.60   0.73    0.46    2.56     3  2.45
date[3]      0.04   1.48   -1.92    1.62     1 12.30
...

I don’t understand why I get these results, it seems almost as if I am getting the product of beta*date or something else. Also, the beta continues to be 0 and all the other issues (n_eff, Rhat, etc) remain.

If I try to generate the dates with a normal distribution straight in R with the same assumptions:

with(data_list,
rnorm(10, radiocarbon, sigma_radiocarbon)
)

I get:
[1] 5051.292 4106.767 5174.575 5714.055 3274.404 5842.854 3662.921 3263.604 3863.440 5241.465

Maybe I have to do this using the target binomial function rather than putting the binomial in the for cycle

It looks like the true values of both alpha and beta are zero, so that part of the model looks fine to me.

It’s not clear to me why the dates are so small and seemingly unrelated to the values of radiocarbon and sigma_radiocarbon (though a plot comparing radiocarbon to date might be helpful).

You might consider breaking your model into two and making sure each work on their own. Write one model where you pass the true dates and estimate alpha and beta alone. Then write a separate model where you pass radiocarbon and sigma_radiocarbon and estimate the dates. Then, once the two models work separately, write third model that estimates everything simultaneously. To be clear, I think your current combined model looks good on both fronts separately, so this might be just added tedium in your case. But this step-by-step approach can be really helpful when trying to understand why a complex model goes off the rails.

1 Like

I think the model for measurement error is misspecified. At the moment, each true date gets a prior mean and SD from the measurement, but as there is no relationship between date and the number of bones, the only information about date there is to constrain date is in the prior.

You need to model the observed dates as a function of the “true” dates and the known measurement error. Then, you need to supply some sort of model for the true value of the dates. There’s a detailed section on it in the users’ guide: Measurement Error and Meta-Analysis. Something like this works (ignoring the bones part), although it doesn’t sample well:

data {
  int N;
  vector[N] radiocarbon;       // Radiocarbon dates
  vector[N] sigma_radiocarbon; // Measurement errors for radiocarbon dates
}

parameters { 
  real mu_date;           
  real<lower=0> sigma_date;  
  vector[N] date;
}

model {
  date ~ normal(mu_date, sigma_date);
  radiocarbon ~ normal(date, sigma_radiocarbon);
}

What might be useful is simulating the generative model directly: sample a set of known true dates under some prior model, and then simulate a set of observations from those true dates. You can then check and see how well your model picks up the true values.

Edit: reminded myself a bit more about how measurement error models work. The below samples very well:

data {
    int N;
    vector[N] radiocarbon;       // Radiocarbon dates
    vector[N] sigma_radiocarbon; // Measurement errors for radiocarbon dates
}

parameters {
    vector[N] date;
}

model {
    date ~ normal(-4500, 750);
    radiocarbon ~ normal(date, sigma_radiocarbon);
}
2 Likes