Issue with model: Gradient evaluated at the initial value is not finite

Please share your Stan program and accompanying data if possible.


Hello all,

I have been trying to implement a hierarchical Dirichlet-Multinomial model for analyzing spatial variation in fish diet composition. My parameter of interest is the proportion of a given prey at a given site (theta in the model string). The softmax component in the hierarchical structure accounts for the influence of fish size on the probability of observing a prey. My model runs OK if I estimate the general diet composition across all locations. However, when I include the categorical covariate (site) in the model and do the necessary adaptations, I’m getting the following:

Rejecting initial value:
Chain 1: Gradient evaluated at the initial value is not finite.
Chain 1: Stan can’t start sampling from this initial value.
Chain 1:
Chain 1: Initialization between (-2, 2) failed after 100 attempts.
Chain 1: Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.

I have included the model specification, as well as some code to generate dummy data:

data {

int N_obs;              // number of observations
int N_prey;             // number of categories
int site[N_obs] ;       // grouping factor
int N_sites;            // number of sites
int y[N_obs];           // outcomes
int TL[N_obs];

}

parameters {

simplex[N_prey] theta[N_sites];
vector[N_prey] a;
vector[N_prey] b;

}

transformed parameters {

vector[N_prey] alpha;
vector[N_prey] s;

for (i in 1:N_obs) { 
  for (k in 1:N_prey) {
  
    s[k] = a[k] + b[k]*TL[i]; 
    alpha = softmax(s); 

  }
}

}

model {

a ~ normal(0,0.001);
b ~ normal(0,0.001);

for (k in 1:N_prey) {

    theta[k] ~ dirichlet(alpha);

}

for (i in 1:N_obs) {

  y[i] ~ categorical(theta[site[i]]);

}

}

Simulating data and running the model

require(rstan)
require(rethinking)

N <-500
TL <-round(runif(N),1)*100
site <- c(rep(1,100), rep(2,100), rep(3,100), rep(4,100), rep(5,100))
b <-c(-0.5,0.4,1)
y <-rep(NA,N)
for (i in 1:N){
  score <-0.5*(1:3)+b*TL[i]
  p <-softmax(score[1],score[2],score[3])
  y[i] <-sample(1:3,size=1,prob=p)
}

stan_data <- list(N_obs = N, N_prey = length(unique(y)), y = y,
                  TL = TL, site = site, N_sites = length(unique(site)))

stan.fit <- rstan::stan(file = "DMM.stan", data = stan_data,
                 warmup = 100,iter = 500, chains = 2)

I’d very much appreciate some help on that! Thanks in advance.

Edited by @jsocolar for R syntax highlighting.

1 Like

Are the priors supposed to be this tight? Stan parametrizes with mean and sd, not mean and precision.

2 Likes

Thanks for noting that. I come from jags so I was using precision. The model still does not work when I put broader priors, though.

Can you share the code for the model that is working? For what it’s worth, the suspicious pair of lines for me are

for (k in 1:N_prey) {
    theta[k] ~ dirichlet(alpha);
}

which suggests that the kth element of theta represents a prey, and

y[i] ~ categorical(theta[site[i]]);

which suggests that elements of theta represent sites.

3 Likes

This part is a problem:

vector[N_prey] alpha;
vector[N_prey] s;

for (i in 1:N_obs) { 
  for (k in 1:N_prey) {
  
    s[k] = a[k] + b[k]*TL[i]; 
    alpha = softmax(s); 

  }
}

My guess is the gradients are breaking because you call softmax(s) inside the innermost loop, before s has been filled completely. You should only call softmax() after the loop has completed.

vector[N_prey] alpha;
vector[N_prey] s;

for (i in 1:N_obs) { 
  for (k in 1:N_prey) {
    s[k] = a[k] + b[k]*TL[i]; 
  }
  alpha = softmax(s); 
}

But even this is likely wrong: each iteration of the outer loop overwrites alpha. Was alpha supposed to be a N_{obs}\times N_{prey} matrix instead of a vector? (i.e, the same shape as theta)

Also I note that alpha is not the probability parameter for categorical distribution (that’s theta) but the rate parameter for dirichlet distribution, and probably should not be normalized. The sum-to-one constraint implied by softmax() makes the dirichlet very dispersed. I’d rather use alpha = exp(s);.

With only one observation per theta priors are going to be important.

4 Likes

Hey! Putting alpha outside the loop actually did solve the problem. It turns out it was a small, logical mistake. Now the model runs pretty nicely.

An additional question: why would exp() be preferrable over softmax() if I’m trying to scale a glm to probability scale?

Thanks very much!

Some additional information I figured out:

In fact, I should not use the softmax link for the “alpha” parameter of the Dirichlet distribution. According to this very good explanation of the parameters of the Dirichlet distribution linked below:

Assuming my understanding is correct, If I use the softmax link for “alpha”, it will restrict the rate parameters to more extreme values.

1 Like

The model running does not mean the results are correct. As originally written the model ignores all TLs except the last one. I doubt that’s what you meant to do.

That’s correct.
exp() is preferrable to softmax() because the input to dirichlet is not on a probability scale; the output is on probability scale.

1 Like

Thanks for your response. I’m unsure what exactly you mean by the model ignoring the continuous covariate, though.

To give a better explanation of what I want to do, my parameter of interest would be theta, and I would obtain one for each level of N_prey and N_site. At some level of the model, I want to incorporate the generalized linear model to account for the continuous covariate TL (fish size). What would the correct specification be?

Thanks again for your help!

But theta depends on alpha which in turn depends on TL and TL is per-observation, not per-site. The simulation code in the opening post does not use site at all.
So what does “site-specific diet composition” mean when the diet composition depends on the fish size and each site contains many fish of different sizes?

Assuming you want to predict the diet composition of a fish of some reference size, I think the model should be

data {
  int N_obs;              // number of observations
  int N_prey;             // number of categories
  int site[N_obs] ;       // grouping factor
  int N_sites;            // number of sites
  int y[N_obs];           // outcomes
  real TL[N_obs];
  real reference_TL;
}

transformed data {
  real sd_b = 5.0 / (max(TL) - min(TL));
}

parameters {
  simplex[N_prey] theta[N_obs];
  vector[N_prey] a[N_sites];
  vector[N_prey] b;
  vector[N_prey] mean_a;
}

model {
  // mean_a ~ implicit uniform
  for (i in 1:N_sites) {
    a[i] ~ normal(mean_a, 1);
  }
  b ~ normal(0, sd_b);
  for (i in 1:N_obs) {
    theta[i] ~ dirichlet(exp(a[site[i]] + b*(TL[i] - reference_TL)));
    y[i] ~ categorical(theta[i]);
  }
}

generated quantities {
  simplex[N_prey] diet_composition[N_sites];
  for (i in 1:N_sites) {
    diet_composition[i] = softmax(a[i]);
  }
}

Now the quantity of interest is diet_composition, and theta has become a fish-specific latent variable. Integrating out theta should improve sampling efficiency:

parameters {
  vector[N_prey] a[N_sites];
  vector[N_prey] b;
  vector[N_prey] mean_a;
}

model {
  // mean_a ~ implicit uniform
  for (i in 1:N_sites) {
    a[i] ~ normal(mean_a, 1);
  }
  b ~ normal(0, sd_b);
  for (i in 1:N_obs) {
    vector[N_prey] alpha = a[site[i]] + b*(TL[i] - reference_TL);
    target += alpha[y[i]] - log_sum_exp(alpha); // y[i] ~ dirichlet_categorical(exp(alpha))
  }
}
1 Like