Hierachical categoríes on a finite mixture model

Hi, I am working on a model to estimate the proportion (Theta) of food types on freshwater insects using stable isotope data. I am interested in estimate the diet proportions for at least 10 different species for three types of streams. I have started with a finite mixture model. However, I am getting troubles with how I should write the code to estimate Theta for each species. Would someone give me any advice on how to code the species factor on my model?

sink("mod_exp5.stan")
cat("

    data {
    int N; // number of datapoints (bugs)
    vector[N] y_C; // d13C signature for bugs
    vector[N] y_N; // d15N signature for bugs
    int n_groups;  // number of food sources
    vector [n_groups] Mu_C; // d13C mean value of each source
    vector [n_groups] Mu_C_sd; // d13C sd value of each source
    vector [n_groups] Mu_N; // d15N mean value of each source
    vector [n_groups] Mu_N_sd; // d15N sd value of each source
    int  <lower=0> Taxa [N]; // Species of Bugs
    int  <lower=0> Taxa_no;  // number of species
    }

    parameters {
    ordered [n_groups] mu_C; // Estimation of mean d13C for the mix source food
    ordered [n_groups] mu_N; // Estimation of mean d15N for the mix source food
    
    vector [n_groups] Theta; // Proportion of food sources on diet.
    vector <lower=0> [Taxa_no] Delta_C; //  Trophic factor of enrichment for Carbon
    vector <lower=0> [Taxa_no] Delta_N; // Trophic factor of enrichment for Nitrogen
    vector <lower=0> [Taxa_no] L;  // trophic level
    
    vector<lower = 0> [n_groups] sigma_C; // Variation on the model for Carbon
    vector<lower = 0> [n_groups] sigma_N;  // Variation on the model for Nitrogen
      
    }

model {
  vector[n_groups] contributions;
  // priors
  for (k in 1: n_groups){
  mu_C[k] ~ normal( Mu_C[k], Mu_C_sd[k]); //values from field Data samples of sources
  }
  
  for (k in 1: n_groups){
  mu_N[k] ~ normal( Mu_N[k], Mu_N_sd[k]); //values from field Data samples of sources
  }
  
  Theta ~ dirichlet(rep_vector(1.0, n_groups)); // Uniform prior
  
  
  Delta_C ~ normal (0.39, 1.14); // from Post (2002)
  Delta_N ~ normal (3.4, 0.99); // from Post(2002)
  //L ~ uniform (1,10);
  
  sigma_C ~ normal(0, 100);
  sigma_N ~ normal(0, 100);
  
  // likelihood
  
  for(i in 1:N) {
    for(k in 1:n_groups, Taxa_no) {
      
      contributions[k] = log(Theta[k]) + normal_lpdf(y_C[i] | (mu_C[k]+(Delta_C[Taxa[i]]*L[Taxa[i]])), sigma_C[k]);
      
    }
    target += log_sum_exp(contributions);
    }
  
  for(i in 1:N) {
    for(k in 1:n_groups, Taxa_no) {
     
     contributions[k] = log(Theta[k]) + normal_lpdf(y_N[i] | (mu_N[k]+(Delta_N[Taxa[i]]*L[Taxa[i]])), sigma_N[k]);
     
    }
  target += log_sum_exp(contributions);
  }

  }

  
  "
,fill=TRUE)
sink()

What error are you getting? I assume first this should be:

simplex[n_groups] theta; // Proportion of food sources on diet.

since theta needs to sum to 1.0 to work for a mixture.

Thank you bbbales2.

I have implemented your suggestion before:

simplex[n_groups] Theta;

With it, I just can get the proportion for one specie, but I am looking to estimate Theta for 10 species at the same time.

If you have one mixture model now, you could do:

simplex[n_groups] Theta[S]; // S being the number of species

And then you could fit S different mixture models, but my feeling is I’m missing the problem.

Can you describe in more detail what you are trying to do? Is it S different mixture models over n_groups elements? Or is it something else?

I haven try that.

My goal is estimate how much leaf litter, biofilm and pure algae are eating more or less 10 species of aquatic insects on three types of streams. To estimate the what they are eating I am using stable isotope data. On the model I am trying to implement the following equations:

\delta y_{i, t, s} \sim Normal (\sum_{k,s} ^{t} P_{k, t, s} \times (\delta X_{k, s} + \Delta X_{t} * L_{t}), \sigma _{k,t} )

where \delta y_{i, t, s} is the signature for \delta ^{13}C or \delta ^{15}N of each insect species (t), at each stream type (s); \delta X_{k, s} is the stable isotope signature for the food sources (k) at each stream type (s); P_{k, t} are the proportions of each type of food source (k) for each species (t) at each stream type (s). \Delta X_t is a trophic enrichemnt factor for each stable isotope, and L_{t} is the trophic level for each species (t).

So, as I understand I should combine a hierarchical model and a finite mixture model to do it. I have to start with a model where I have just a stream and just a species. Now, I am trying to integrate on the code the estimations of P for all the species that I am working on. Then I will try to incorporate the last level of stream types.

I have code n_groups as the different elements on diet.
I have ran your suggestion of

simplex [n_groups] Theta [Taxa_no] // where n_groups are the elements of diet to get proportions and Taxa_no the different species for I am estimating Theta. However, I got  
an error message on the model:

```stan
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
No matches for: 

  vector[ ] ~ dirichlet(vector)

Available argument signatures for dirichlet:

  vector ~ dirichlet(vector)

Real return type required for probability function.
 error in 'model2ba03e2f34f8_mod_exp5' at line 44, column 47
  -------------------------------------------------
    42:   }
    43:   
    44:   Theta ~ dirichlet(rep_vector(1.0, n_groups)); // Uniform prior
                                                      ^
    45:   
  -------------------------------------------------

After following the advice of bbbales2,

The model is the following:


sink("mod_exp5.stan")
cat("

    data {
    int N; // number of datapoints (bugs)
    vector[N] y_C; // d13C signature for bugs
    vector[N] y_N; // d15N signature for bugs
    int n_groups;  // number of food sources
    vector [n_groups] Mu_C; // d13C mean value of each source
    vector [n_groups] Mu_C_sd; // d13C sd value of each source
    vector [n_groups] Mu_N; // d15N mean value of each source
    vector [n_groups] Mu_N_sd; // d15N sd value of each source
    int  <lower=0> Taxa [N]; // Species of Bugs
    int  <lower=0> Taxa_no;  // number of species
    }

    parameters {
    ordered [n_groups] mu_C; // Estimation of mean d13C for the mix source food
    ordered [n_groups] mu_N; // Estimation of mean d15N for the mix source food
    
    simplex [n_groups] Theta [Taxa_no]; // Proportion of food sources on diet.
    vector <lower=0> [Taxa_no] Delta_C; //  Trophic factor of enrichment for Carbon
    vector <lower=0> [Taxa_no] Delta_N; // Trophic factor of enrichment for Nitrogen
    vector <lower=0> [Taxa_no] L;  // trophic level
    
    vector<lower = 0> [n_groups] sigma_C ; // Variation on the model for Carbon
    vector<lower = 0> [n_groups] sigma_N ;  // Variation on the model for Nitrogen
      
    }

model {
  vector[n_groups] contributions;
  // priors
  for (k in 1: n_groups){
  mu_C[k] ~ normal( Mu_C[k], Mu_C_sd[k]); //values from field Data samples of sources
  }
  
  for (k in 1: n_groups){
  mu_N[k] ~ normal( Mu_N[k], Mu_N_sd[k]); //values from field Data samples of sources
  }
  
  Theta[Taxa_no] ~ dirichlet(rep_vector(1.0, n_groups)); // Uniform prior
  
  
  Delta_C ~ normal (0.39, 1.14); // from Post (2002)
  Delta_N ~ normal (3.4, 0.99); // from Post(2002)
  //L ~ uniform (1,10);
  
  sigma_C ~ normal(0, 100);
  sigma_N ~ normal(0, 100);
  
  // likelihood
  
  for(i in 1:N) {
         for (t in 1:Taxa_no){
                  for(k in 1:n_groups, Taxa_no) {
            contributions[t,k] = log(Theta[t,k]) + normal_lpdf(y_C[i] | (mu_C[k]+(Delta_C[t]*L[t], sigma_C[k]);
              }
          }
    target += log_sum_exp(contributions);
    }
  
  for(i in 1:N) {
        for (t in 1:Taxa_no){
              for(k in 1:n_groups) {
          contributions[t,k] = log(Theta[k]) + normal_lpdf(y_N[i] | (mu_N[k]+(Delta_N[t]*L[t])), sigma_N[k]);
                 }
           }
  target += log_sum_exp(contributions);
  }

  }

  
  "
,fill=TRUE)
sink()

However, it is not running. I am getting a syntax error of too many indexes on the likelihood part.

I don’t think the twiddle notation does very well with mixture models. Also I think the indices need a bit of work (\sum_{k, s}^t seems confusing – for measurement y_{i, t, s} sum to the t th species?).

Let’s also assume capital letters are the total number of things. So there are K food sources and T species and S stream types.

So let’s start with something like this:

p(y_{i, t, s} | X, L, \sigma) = \sum_k^K w_{k, t, s} p_k(y_{i, t, s} | X_k, L, \sigma_k)\\

In this case I am assuming each component p_k(y_{i, t, s} | X_k, L, \sigma_k) is a normal likelihood with mean X_{k, s} + \Delta X_t L_t and standard deviation \sigma_{k, t}. The mixture weights w_{k, t, s} are modeled hierarchically, and represent TS different simplexes of length K.

Anyway, start with that notation and push it in the direction you want it to go.

Thank you bbbales2,

I will follow your instructions. I am wondering, if w_{kt,s} is a matrix or a cov_matrix?

I’d represent it in Stan like:

simplex[K] w[T, S];

That is a 2d array (dimension T by S) of K-dimensional simplexes.

Anyway I do think it’s valuable to try to write down your model in equations. Once you have everything clear there then it’s easier to translate to code. If you get stuck ask away.