Exception error in STAN program

Hello,

I recently posted a question on a “mark-recapture-like” scenario with an accompanying STAN program. However, I am unable to run it as it stands.

I receive the error (in part)

Exception: vector[multi] indexing: accessing element out of range.

Here is some data along with the program.

### R script ###

library(rstan)

N <- 76
Hstar <- 4
probs <- c(53, 13, 9, 1) / N
perms <- 10000

fit <- stan("model.stan", data = list(N = N, Hstar = Hstar, prob = probs, perms = perms))

STAN program

data {
  int<lower=1> N;
  int<lower=1> Hstar;
  simplex[Hstar] probs; // elements sum to 1
  int<lower=2> perms;
}

transformed data {
  matrix[perms, N] pop;
  simplex[N] distrib = rep_vector(1.0 / N, N);
  int<lower = 1, upper = N> resample_idxs[N];
  vector[perms] res;
  for (i in 1:N)
    resample_idxs[i] = categorical_rng(distrib);
  for (j in 1:perms)
    res = N * probs[resample_idxs];
 pop = append_col(res, pop);
}

transformed parameters {
  real<lower=N> Nstar;
  vector[perms] H = col(pop, N);
  Nstar = (N * Hstar) / mean(H);
}

model {
  // empty since this is taken care of by categorical_rng
}

Essentially the program fills a matrix (pop) with integers between 1-4 and takes the mean across of the final column. This is followed by some substitutions into a formula to compute the quantity of interest (Nstar).

Can someone point me to a solution regarding the Exception error above and anything else that may be funky about the STAN program, which appears to be syntactically correct (since I am a newcomer)?

The issues are caused by this section:

  for (j in 1:perms)
    res = N * probs[resample_idxs];

You’ve declared probs as having 4 elements (i.e., a simplex of length Hstar, where Hstar is 4), but you’re passing indexes that can range from 1 to 76 ( int<lower = 1, upper = N> resample_idxs[N]). So the index that you’re passing is larger than the vector.

Also, you’ve declared this as a loop, but you’re not actually looping over anything (i.e., nothing is being indexed by j).

There are further issues with this line:

 pop = append_col(res, pop);

You’ve initialised an empty matrix, and then you’re trying to append the vector res to it. This will fail because because you’re creating a matrix of dimension matrix[perms, N+1] and assigning it to a matrix of size matrix[perms, N+1]. Can you clarify what you’re trying to achieve here? There might be a more efficient way to do this

Thanks for the quick reply.

I see the issue now regarding not looping over anything. I would need to write ‘res[j]’ instead of just ‘res’. This was a silly mistake. But this would still not fix things completely as a STAN check reveals a dimension mismatch.

I will try to explain more clearly.

In the section

for (j in 1:perms)
    res = N * probs[resample_idxs];

‘res’ should be a (column) vector of length ‘N’ consisting of randomly placed integer values between 1-4. The ‘for’ loop should repeat this ‘perms’ times, each time filling the ‘pop’ matrix by column so that I now have a filled ‘pop’ matrix of dimensions 10000 rows x 76 columns. ‘resample_idxs’ refers to indices of the ‘probs’ simplex. This is just in place for easy retrieval of integer values. I took the idea from the STAN implementation of the bootstrap. Multiplication of ‘probs[resample_idxs]’ by ‘N’ converts probabilities to counts.

Once the ‘pop’ matrix is filled, I take the mean of the last column (in the example, corresponding to the 76th column) this corresponds to the vector ‘H’. This is why I use
pop = append_col(res, pop)

I then substitute N, Hstar and mean(H) into a formula to compute the quantity of interest (Nstar).

Let me know if this explanation is sufficient. I would be happy to try and explain things further.

To clarify, should pop be a 10000 x 76 matrix of randomly-generated integers between 1 and 4, or should it be a matrix randomly consisting of the values: 53, 13, 9, and 1?

Also, can I ask why you’re implementing this in Stan? There don’t seem to be any parameters being sampled, and you would likely get the same results in R.

For example, you could express this like:

get_nstar = function(N,Hstar,counts,perms) {
    pop = structure(sample(counts,N*perms,replace = TRUE),dim=c(perms,N))
    Nstar = (N * Hstar) / mean(pop[,N])
    return(Nstar)
}

> get_nstar(76,4,c(53,13,9,1),10000)
[1] 15.87302

pop Is 10000 x 76 and should contain randomly sampled integers between 1-4.

The values 53, 13, 9, and 1 are first standardized to range between 0-1. This is a achieved by dividing by 76 (I.e, the sum of the 4 values). Then the values are relabeled with the values 1-4 — 1 for the value 53/76, 2 for the value 13/76, etc.

Also, the reason I am implementing this is STAN is twofold:

  1. I want to better hone my STAN skills and overall understanding by implementing a “nontraditional” model (as compared to a simple linear regression for instance)

  2. Despite this initial model not containing any priors for Nstar, eventually I want to expand the model to include prior information. The problem I face with that is I don’t know what a “good” prior would be. I suppose one could just as easily set a prior for the mean(H), where H consists of N values.

For my PhD work, I have implemented this in R as an R package called HACSim. It’s essentially a local search algorithm that can probably be best compared to a genetic algorithm.

In fact for the example I give (N = 76, Hstar = 4, probs = c(53, 13, 9, 1), perms = 10000), my algorithm finds that Nstar is likely around 124 individuals.

This means that if we were to sample randomly in the wild and collect new individuals, we would likely need a total of 124 specimens to capture the majority of genetic diversity that exists for the Siamese fighting fish, the species on which the example is based.

Essentially, the question of interest is: is N individuals enough? That is N = 76 individuals enough? Here N = 76 is our initial guess of the number of individuals needed to be sampled to capture the majority of genetic diversity that exists for that species. The key use here is that we don’t actually know the full diversity of genetic information for this species, only a snapshot provided by probs. The assumption is that the current sample is representative of ground truth.

Results of HACSim tell me that the answer is no and that we likely need around Nstar = 124 Individuals total.

However, the value of Nstar will differ depending on the number of perms used (the larger, the better, as this reduces Monte Carlo error).

The STAN model is a work in progress: I will be expanding it to more closely mimic my pure R implementation and also expand it in new ways (by setting priors). Right now, it’s basically a buggy minimally reproducible example.

Sorry for the ramble. I thought providing background and motivation would help clarify things.

1 Like

Hello Andrew,

So i have been working on fixing my STAN program and this is what I have come up with based on your valuable input. However, the program is still causing an out of range error.

There is much more I will add to the program (such as adding prior information), but I want to build it piece-by-piece and run it often to debug as I go along.

After consulting the STAN user guide and reference manual, I am lost as to what seems to be the issue here.

I have annotated the program. Essentially the program right now populates ‘pop’ with perms x N values in the range 1-Hstar (Hstar = 4 in my above example).

Can you weigh in here as to eliminating the error?

data {
  int<lower=1> N;
  int<lower=1> Hstar;
  simplex[Hstar] probs; // elements sum to 1
  int<lower=2> perms;
}

transformed data {
  matrix[perms, N] pop; // 'perms' rows, 'N' columns
  simplex[Hstar] distrib = rep_vector(1.0 / Hstar, Hstar); 
  int<lower=1,upper=Hstar> resample_idxs[Hstar]; 
  vector[perms] res; 
  for (i in 1:perms) {
  // should resample indices in resample_idxs 'perms' times
    resample_idxs[i] = categorical_rng(distrib); 
 }
 // should be a column vector containing values from 1:Hstar with 'perms' elements
  res = N * probs[resample_idxs]; 
  for (k in 1:N) {
  // should fill all 'N' columns of 'pop' matrix according to res
    pop = append_col(res, pop[, k]); 
  }
}

transformed parameters {
  real<lower=N> Nstar;
  vector[perms] H;
  H = col(pop, N);
  Nstar = (N * Hstar) / mean(H);
}

model {
// empty since this is taken care of by categorical_rng
}

Many thanks!

The issue that is resample_idxs is length Hstar, but you’re looping over perms:

  for (i in 1:perms) {
  // should resample indices in resample_idxs 'perms' times
    resample_idxs[i] = categorical_rng(distrib); 

But you still have the issue that this isn’t set up as a Bayesian problem - there are no random variables (i.e., everything here is known). This means that the use of transformed parameters is unnecessary (this can all be done in the transformed data block), because you have no parameters (random variables) that are being manipulated/transformed

1 Like

Thanks for the speedy reply.

I realized I had included the wrong for loop for the index i above. I have been maintaining several files representing my different attempts to logically assess why I keep getting the error I get.

Index i should range from 1:Hstar, correct?. I believe the loop for k is correct, but now I fail to see where ‘perms’ comes in, since ‘pop’ is supposed to have ‘perms’ filled rows across the N columns.

Once the bugs are worked out (including moving things from the transformed parameters block to the transformed data block), shouldn’t STAN still run (but not show any randomness)?

Regarding prior information, I will eventually model both the unknown amount of genetic diversity and the frequency distribution, call it ‘freq’ representing the total amount of genetic diversity as random variables. Placing a Dirichlet prior on ‘freq’ seems plausible. Modelling the unknown amount of diversity will require some thought as this will also be the number of elements in the ‘freq’ vector. These will be included in a parameters block.

Thanks again!

To simplify the process a little, what you’re doing is:

  • Sampling an integer from 1 to 4 (categorical_rng)
  • That integer then selects one of the probabilities passed as input data probs[resample_idxs]
  • The input probability is then multiplied by the total number to give the resulting count: N * probs[resample_idxs];
  • This should be done for every element of the matrix pop, so that the result is a perms x N matrix of the numbers 53, 13, 9, and 1

You can do this in a much simpler way by passing in the raw counts as data (i.e., don’t divide them by N before passing them to Stan), then use categorical_rng for each element of the pop matrix:

data {
  int<lower=1> N;
  int<lower=1> Hstar;
  int[Hstar] counts; // elements sum to N
  int<lower=2> perms;
}

transformed data {
  matrix[perms, N] pop; // 'perms' rows, 'N' columns
  simplex[Hstar] distrib = rep_vector(1.0 / Hstar, Hstar); 
  for(i in 1:perms)
    for(j in 1:N)
      pop[i,j] = counts[categorical_rng(distrib)];

}

Once the bugs are worked out (including moving things from the transformed parameters block to the transformed data block), shouldn’t STAN still run (but not show any randomness)?

I think for a model without parameters, you’ll need to run it in fixed_param mode, rather than regular HMC/NUTS. You can request this using algorithm="Fixed_param" in your RStan call

1 Like

Oh I see now, thanks a lot!

Reading through this new program and running a STAN check reveals that counts requires an identifier, so I changed it to

int[Hstar] counts;

I will consider this a solution.

Thanks!