Advise about linking models / two-stage modeling structure

Hi there, I’m looking for advice about the best way to link two models that predict the same response together in Stan. Specifically, for a given location, I’m trying to simultaneously model the total biomass of a population, as well as the counts of the component members of the population by size category. I can model each separately successfully, but would like to combine them such that the information from the total model informs the count model if possible. I think there must be a way to do this via either the ODE solvers or the algebra solver, but I’ve no experience and need some friendly advice to get me going down the right path.

Here is a simplified example of the total model:

// total model

data {
  int<lower=1> N;
  int<lower=1> P;
  real<lower=0> bio[N];
  matrix[N, P] X;
}

parameters {
  real kappa0; // intercept
  vector[P] kappa; // biomass predictors
  real<lower=0> inv_beta; // rate
}

transformed parameters {

  real <lower=0> alpha[N];
  real <lower=0> beta[N];
  alpha = exp(kappa0 + X * kappa); 
  beta = 1 / inv_beta;
}

model {    
  target += normal_lpdf(kappa0 | 0, 0.5); 
  target += normal_lpdf(kappa | 0, 0.1); 
  target += normal_lpdf(inv_beta | 0, 0.5);

  target += gamma_lpdf(bio | alpha, beta);
}

And here is a simplified example of the count Poisson hurdle model:

// count model

data {
  int<lower=1> N;
  int<lower=1> n_sizes;
  int<lower=1, upper = n_sizes> size[N];
  int<lower=0> y[N]; // count in each size
}

parameters {
  real<lower=0, upper=1> theta[n_sizes];
  real log_lambda[n_sizes];
}

model {
  target += beta_lpdf(theta | 5, 1);
  target += normal_lpdf(log_lambda | 0, 1);
  
  for (n in 1:N) {
    if (y[n] == 0) {
      target += bernoulli_lpmf(1 | theta[size[n]]);
    } else {
      target += bernoulli_lpmf(0 | theta[size[n]]);
      target += poisson_log_lpmf(y[n] | log_lambda[size[n]]);
    }
}


Note that I’ve tried using the biomass predictors directly in the count model with some success, but for various reasons the predictors work better on the total rather than count by size class, however count by size class is really the true prediction problem, which is why I’d like to link the two.

So with the two models I have two different kinds of predictions for total biomass at a given location:

\textrm{biomass}_{total} \sim \textrm{Gamma}(\alpha, \beta)
\textrm{biomass}_{total} = \sum_{s=1}^S \textrm{size}_s * \textrm{count}_s
\textrm{count}_s \sim \textrm{Poisson Hurdle}(\theta_s, \lambda_s)

I guess what I’d really like to do is to condition the count model such that when integrated over sizes the totals that are within a certain tolerance of the the totals from the Gamma model. Any advice as to how to go about this would be appreciated.

1 Like

I think there’s an issue here that you have two models with different sets of parameters for one set of data (biomass). I think you could try to figure out which of those models is more likely correct, but I don’t think having two models is very advantageous here.

This is in comparison to the situation where you might have two measurements, and two generating processes driven by a single set of parameters. In that case it makes sense if you only have one model, to only do one fit, but since you have both to bring them together (cause in that case you use two sets of data to inform one set of parameters).

In this case we’re taking one set of data and informing a couple sets of parameters. That make sense? I coulda missed something though so if it doesn’t sound right lemme know.

1 Like

Thanks for your help, @bbbales2!

Maybe I’ll try to clarify a little more, though it might not change your conclusion.

In this case, I don’t actually care that much about the total biomass (or only do tangentially). What I’m really interested in is counts of individuals of various size classes in different locations across a landscape. I can get a baseline posterior for each \lambda_s without an issue, but do less well predicting variation in counts across the landscape with the measurement data at hand. I am continuing to spend effort improving the basic count model, for sure.

In the meantime, I do have measurements that do a reasonable job predicting variation in total biomass, however! So I was hoping to use the predictions in the total biomass model to inform and/or constrain the predictions from the count model.

The simplest way I can think of approaching it is to just add predicted total biomass as a predictor to the count model, but because for any given location there is the direct functional dependence \textrm{Total} = \sum_s^{S}\textrm{count}_s*\textrm{size}_s I was wondering if there is a technique to more explicitly link and estimate both models simultaneously in a way that will condition the count model appropriately based on information in the total model. If there is, I’d love to learn about it and try it out!

2 Likes

I’m imagining we have some area and we broken it up into a checkerboard pattern of cells.

So this represents the proportion of sizes you expect to find in each cell, regardless of position?

Do you get this number by only making measurements in some cells? Or have you made measurements in all cells and are pooling that information together?

By variation you mean that, cell to cell, your proportions are off? (The other way to interpret that would be that you’re unsatisfied with the variance of your predictions in a single cell)

And you can make measurements that look like:

y_i \sim N(\sum_j^J N_j s_j, \sigma)

Where y_i is mass in the ith cell, N_j is the number of individuals of size s_j in each cell (there are J different sizes)? Apologies for changing notation I just didn’t want to type \text{...} :D.

And the connection we’d like to exploit is that \lambda_j (beware my notation swap yaay!) should inform N_j?

2 Likes

Yes, exactly. Each cell has equal area so I’m omitting the exposure term at this point.

Yeah the response variable for the hurdle Poisson model is the number of individuals in a given size class j (to match your notation) within a cell of a given area. Not all size classes appear in all cells, hence the hurdle portion, each size class with its own \theta_j.

Only some cells have received measurements, and the fundamental problem is prediction of count by size class to unmeasured cells. I’m interested in introducing additional hierarchical structure eventually (by stratifying cells), but for now I am pooling counts by size class across all measured cells to get the posterior for each \lambda_j.

Indeed. The pooled estimates are fine, but adding predictors to the count model by itself doesn’t get me as much as I would like in terms of decreasing within-cell variance.

Exactly, with the caveat that y_i is obviously constrained to be positive (hence the gamma model in the original post), and also noting that I have predictors measured for each cell that can predict total biomass reasonably well in the gamma model.

And now that you write it like this I think I see that I could use the count model to predict the mean for the total model…

Given my current framework would be to reparameterize the gamma to make:

y_i \sim \textrm{Gamma}(\alpha_i, \beta_i) \\ \alpha_i/\beta_i = \sum_j^J N_{ij} s_{ij} \\ N_{ij} \sim \textrm{Poisson}(\lambda_{ij}) \\ \lambda_{ij} = X_{ij}\kappa_j

Is this reasonable? I haven’t had much luck specifying gamma models that way, but probably I’m just making some ignorant mistake - I can’t figure out how to reasonably decompose the mean into \alpha and \beta. I guess I could build a similar model with a Gaussian and see how I do, but it bothers me (possibly unreasonably - I come to learn!) to model positive constrained responses like that, when \sigma can push predictions to be negative.

Thanks again for your help.

[Edited to fix some indexing]

1 Like

Though thinking about it more, I guess I don’t really care that much about whether or not the total is modeled by a Gaussian if I’m just extracting the information about \lambda_j to make predictions…?

1 Like

Cool beans.

The issue with that model is N_{ij} is data, and that will make it so that the top two lines and bottom two lines of the model aren’t informing each other.

Does this mean that we basically expect the same number of individuals in every cell? And this is why we expect the total biomass numbers to be meaningful – if the total biomass is very small, then that implies that we must not have many of the large individuals?

Without assuming the same number of individuals in every cell, then I’m not sure total biomass is informative of what the distribution of individuals in each cell is (there’s an identifiability problem between lots of small things or a couple big ones).

Don’t worry, it’s much easier for me to recommend Gaussians and I will encourage you to do that even if it’s a bad idea. So stay skeptical :D. The Discourse Industrial Complex recommends normals for all ailments.

But in that theme, do you mind making a couple plots?

Like what does your data look like for a single cell? What do a few cells look like lumped together?

In the case that these two models stay separate, maybe there’s something we can do to make just the \lambda_{ij} part work better.

2 Likes

Hello!

I think I have an idea that can be adapted to your problem.

Imagine your cell i resulting from the weighted mean of your class sizes s, \phi_ij representing the weight (between 0 and 1) of your class j in the cell i. The vector \boldsymbol{\phi}_i is a simplex (\boldsymbol does not work, sorry).

y_{i} \sim Gamma(\sum_{k=1}^{K} \phi_{ij}s_{j}, \sigma)\\ \phi_{ij} = \frac{\exp(\eta_{ij})}{\sum_{j = 1}^{K} \exp(\eta_{ij})}\\ \eta_{ij} = X\beta_{j}

X could be the predictors that you like for biomass.

Dos this seem interesting or helpful??

Have a good day!
Lucas

2 Likes

Unfortunatly, the idea above does not deal with the potential zero inflation, but I guess it could be managed in a second step… It could be possible to provide information about when weights are zeros.

To exploit the measurements, you could add the observed proportions p_{ij}? \theta$ being the precision parameter of the Dirichlet.

y_{i} \sim Gamma(\sum_{k=1}^{K} \phi_{ij}s_{j}, \sigma)\\ p_{ij} \sim Dirichlet(\phi_{ij}\theta)\\ \phi_{ij} = \frac{\exp(\eta_{ij})}{\sum_{j = 1}^{K} \exp(\eta_{ij})}\\ \eta_{ij} = X\beta_{j}
1 Like

Yeah as I started to write up a sketch of the above I quickly realized this - how to represent something as both data and a transformed parameter.

You’ve described exactly the heart of my problem. The number of individuals changes in each cell, so there are multiple possible distributions of count by size classes that can in principal produce high or low total biomass. There is not a uniform probability of such combinations, though. The combinations would seem to be limited by the proportions of each size class that are present in the population, such that for a given population certain combinations of size class counts are more or less likely to produce high or low biomass cells.

So I can build a reasonable model for the total biomass in a given cell, and have information about the distribution of counts by size class in the population, the task then becomes how to build a model that can connect the two and exploit the structure I’ve described to produce reasonable predictions of count by size class in new cells. I definitely see the identifiability issue, though - I am just hoping that the structure of a given population is enough to overcome it.

LOL

Sure! I’ll throw something together. Thanks again for your help.

Hello, and thanks for the help! Is this the model you are suggesting? I’m just trying to understand the indexing and think there was some J / K mixup maybe?

y_{i} \sim Gamma(\sum_{j=1}^{J} \phi_{ij}s_{j}, \sigma)\\ p_{ij} \sim Dirichlet(\phi_{ij}\theta)\\ \phi_{ij} = \frac{\exp(\eta_{ij})}{\sum_{j = 1}^{J} \exp(\eta_{ij})}\\ \eta_{ij} = X\beta_{j}

This is definitely interesting, and could be useful. Maybe I’m misinterpreting, though, but in this case if \phi_{ij} is a simplex representing proportional weightings for each size class, I’m not sure how I then extract information about the absolute number of individuals in s_j for a new cell. Don’t I just have information about the relative contribution of each s_j to the total? Please correct me if I am wrong in this.

If that is the correct interpretation, though, I guess for new cells I could produce predictions of relative contributions by size class, and then pair that with results from a second model that predicts the total and derive a solution. I can try it out, but I guess I’m still hoping for a solution that can model both simultaneously (possibly too much to hope for) - or at least do a better job modeling counts as responses as @bbbales2 has suggested.

1 Like

Hi!

Your interpretation is perfectlt right! Sorry for messing the j and k, I copy-pasted an old code and did not saw that some k were remaining :p

I guess you could try to use the same approach with counts directly : the mean of the gamma describing bomass being equal to the sum of a given number of tree to which you attributed individual biomass.

The weight would not be the phi simplex anymore, but the lambda from a poisson distribution. However, I am not sure how to write down the math for the gamma mean in that case (I am on my cell and it’s sunday, it does not help :p).

I have an intuition though : I am wondering if proportions should not be easier to predict than direct count per category. In fact, I guess the same total biomass might results from different combination of total numer of trees and repartition in classes. There may be smartier analysis that I don’t know to deal with thay, because I am not really versed in analysis dealing with category attribution.

Have a good day!
Lucad

EDIT : sorry, I did not see that you talked about the problem of dealing with the biomass and total number of individuals in anothrr message!

2 Likes

Got it - thanks again for engaging.

No problem! Just wanted to be sure I was interpreting things correctly.

So if I understand correctly you are suggesting modifying the model I initially articulated above as something like:

y_i \sim Gamma(\sum_j^Jn_{ij}s_{j}, \sigma) \\ n_{ij} \sim \textrm{Poisson}(\lambda_{ij}) \\ \lambda_{ij} = \exp(\log(N_j) + X_i\kappa_j) \\ \kappa_j \sim \textrm{MN}(\kappa, \Omega)

where y_i is the total biomass in cell i, n_{ij} is the predicted count for class j in cell i, N_j is the population mean count for size class j from the data, X_i is the predictor matrix for cell i, and \kappa_j is the vector of predictor coefficients for size class j. This leaves aside the question of the zero inflation for now, but in principal looks promising, and I think avoids the split model problem @bbbales2 identified in my initial formulation.

My current approach deals with proportions, but I could admittedly be more sophisticated about it. The model with the simplex could be useful in that regard, and it’s definitely something I will play around with.

[Edits to correct specification]

Hi!

What I thought was more

y_i \sim Gamma(\sum_j^J\lambda_{ij}s_{j}, \sigma) \\ n_{ij} \sim \textrm{Poisson}(\lambda_{ij}) \\ \lambda_{ij} = \exp(\log(N_j) + X_i\kappa_j) \\ \kappa_j \sim \textrm{MN}(\kappa, \Omega)

Given that \lambda_{ij} is the mean number of counts.

Lucas

1 Like