Error "categorical_lpmf: Probabilities parameter is not a valid simplex. "

Hello, I am brand new to Stan and probabilistic modeling in general, but am collaborating with some more knowledgeable colleagues who graciously wrote the code for me. At the moment I am the only one who has time to work on this, though, so I am the one seeking help for our current problem.

We have experimental data (attached exp1_all.csv (223.5 KB) ) from a search task in which participants chose one out of five spatial locations (cups arranged in an approximate pentagon on a rotating table). There were four types of rotations. All participants completed three trials in each of four rotations. Participants were assigned to one of two conditions; there were also three age groups. Different strategies will lead to different kinds of errors across rotations (specified in models in our R code, attached Rcode_to_share.R (8.0 KB) ). The goal of our model is to see which mix of strategies best fits/predicts the performance across rotations, conditions, and age groups.

The code below yields the following error:

Chain 1: Error evaluating the log probability at the initial value.
Chain 1: Exception: categorical_lpmf: Probabilities parameter is not a valid simplex. sum(Probabilities parameter) = 0, but should be 1 (in 'model3c546d0f5fec_modComb_2020Aug' at line 47)Chain 1:
Chain 1: Initialization between (-2, 2) failed after 100 attempts.
Chain 1: Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
[1] "Error in sampler$call_sampler(args_list[[i]]) : Initialization failed."
[2] "In addition: Warning message:"
[3] "In system(paste(CXX, ARGS), ignore.stdout = TRUE, ignore.stderr = TRUE) :"
[4] " '-E' not found"
[1] "error occurred during calling the sampler; sampling not done"

The most relevant prior post I found was this one (Error in model for ordinal data - probabilities parameter is not a valid simplex) but the recommended changes there did not fix this error.

data {
  int<lower=1> K; // #number of search locations (5)
  int<lower=1> Ntotal; // #number of trials (12 per participant)
  int<lower=1> Nage; // # of age groups (3)
  int<lower=1> Ncond; // # of conditions (2 between subjects)
  int<lower=1> Nrot; // # of rotations (4 within subjects)
  int <lower=1> Nmod; // #of strategies/models for mixture (not including random response)

  int <lower=1,upper=3> ageGrp[Ntotal]; // #age group for each trial
  int <lower=1,upper=2> condInd[Ntotal]; //#cond for each trial
  int <lower=1,upper=4> rotInd[Ntotal]; // #rotation for each trial

  int <lower=1,upper=K> y[Ntotal]; // #responses from participant

  vector <lower=0, upper=1>[K] modPred[Ntotal,Nmod]; // #model predictions
  vector<lower=0>[K] alphaT; //#dirichlet param for rand response model
  vector<lower=0>[Nmod] alphaW; //#dirichlet params for mixing models
  real<lower=0> a; // #prior parameter for model pred weight
  real<lower=0> b; //#prior parameter for rand model weight
}

parameters {

  simplex[K] theta[Nage,Ncond,Nrot];
  simplex[Nmod] w[Nage,Ncond,Nrot];
}


model {
  int NmodIt=Nmod-1;
  
   for (n in 1:Ntotal) {
    real lps_ws[Nmod];

    for (m in 1:NmodIt) {
    	if (m == 1) {
    		lps_ws[m] = 0.;
    	}

      lps_ws[m] += categorical_lpmf(y[n] | modPred[n,m,1:]);
      lps_ws[m] += log(w[ageGrp[n],condInd[n],rotInd[n],m]);
    }
    lps_ws[Nmod] += categorical_lpmf(y[n] | theta[ageGrp[n], condInd[n],rotInd[n]]);

    target += log_sum_exp(lps_ws);
  }
  for (i in 1:Nage) {
    for (c in 1:Ncond) {
      for (r in 1:Nrot) {
        theta[i,c,r] ~ dirichlet(alphaT);
        w[i,c,r] ~ dirichlet(alphaW);
      }
    }
  }
}

Any ideas about what we are missing/doing wrong?
Thanks in advance.

I suspect it is this line:

lps_ws[m] += categorical_lpmf(y[n] | modPred[n,m,1:]);

Make sure sum(modPred[n, m]) == 1.0 for all n/m.

If it turns out to be a numerical thing (like maybe one row is 1.000000000001), then we can do something else, but I suspect there’s probably an issue with reading in data.

Thanks for the suggestion. I checked the row sums in modPred and no values exceeded 1. Does that sufficiently address this concern?

They need to sum to exactly one. If they sum to anything other than one it’d be a problem.

I see, thanks, sorry for my confusion. They do not sum to 1, I think because of how we are combining stratedies/models. In a given model, the probability across search locations sums to 1, but then when we combine there could be multiple strategies indicating the same location, or different strategies could indicate different locations. In either case, summing across models would be more than 1 then.

So it seems this is something to address in the way the strategies/models are first combined, yes? This is done in the R code rather than stan file.

Yeah, I’m not sure what the solution is, but probably the place to start is:

lps_ws[m] += categorical_lpmf(y[n] | modPred[n,m,1:]);

What does that term mean in the model and what is the data?

Does the model need to change or should there be a data transformation?

Stuff like that. Maybe check in with the person who wrote it. Maybe it’s an accident or something.