How to use unobserved counts as response variable in joint (or second) model

Hello - this is my first time posting on the Stan Forums. I am trying to write a model that jointly (1) estimates the Poisson rate for some observed count data, (2) uses that estimate to calculate some unobserved counts, and (3) models those unobserved counts with respect to some predictor variable.

Here is a simplified version of what I am working with: I have some data on the number of seeds per flower (y_2) and the number of flowers per plant (y_1) (note: y_2 is only measured for a subset of plants and flowers). These data are intended to be used together to represent the total number of seeds per plant (Y). I would then like to model Y using some predictor variable (x). However, considering that Y is not directly observed and the number of seeds per flower can be variable, I was thinking I could treat Y as an unobserved/latent variable in my model.

So far, I’ve been able to write a model that is getting close to what I want, but not fully there:

data {
  int<lower=0> n2; // number of observations for y2
  array[n2] int<lower=0> y2; // number of seeds per flower
  int<lower=0> n1; // number of observations for y1, x
  array[n1] int<lower=0> y1; // number of flowers
  vector[n1] x; // predictor
}

parameters {
  real<lower=0> log_lambda; // log Poisson rate
  real a; // intercept
  real b; // slope
}

transformed parameters {
  real<lower=0> lambda; // Poisson rate
  lambda = exp(log_lambda);
}

model {
  // model number of seeds per flower (y2)
  log_lambda ~ normal(2, 1);
  y2 ~ poisson(lambda);
  // model number of flowers (y1)
  // **I would like to model the total number of seeds per plant (Y) instead**
  a ~ normal(0, 1);
  b ~ normal(0, 1);
  y1 ~ poisson(exp(a + b*x));
}

generated quantities {
  array[n1] int Y; // total number of seeds per plant
  for (i in 1:n1) {
    Y[i] = poisson_rng(lambda)*y1[i];
  }
}

I am estimating the Poisson rate \lambda for y_2 and the relationship between y_1 and x in the same model, then generating Y in the generated quantities block. However, for various reasons, I would like to model the relationship between Y (not y_1) and x. To do this, I think I need to make Y a parameter (essentially missing data) and have something like Y ~ poisson(lambda) in the model block, but I also understand that Stan does not accept integer parameters.

I was hoping someone could help me figure out what could be done here. My main goal of trying to write the model like this is to carry forward uncertainty in Y, as opposed to deterministically calculating Y as mean(y2)*y1 (especially since I don’t think this uncertainty will be the same across observations), so I’m open to other approaches to handling the data/model that would allow me to do this. Thank you!

1 Like

Minor correction: I realized that the model I shared above constrains log_lambda to <lower=0>, but this parameter should be unconstrained. This shouldn’t impact my issue/question though.

1 Like

Howdy! Cool model!
I don’t have an answer to your question about how to model Y with predictor x, but I was wondering if in the generated quantities block you wanted Y[i] = poisson_rng(lambda) * poisson_rng(exp(a + b*x[i])); instead of multiplying by data y1, so that you include the uncertainty in the number of flowers. In this way, x would also contribute to the total number of seeds per plant Y estimate.

1 Like

Two clarifying questions (really the same question stated in two ways) that I think will also answer @jd_c 's question:

  1. Do you care about predicting seed-set at new plants not in your dataset for which you do not have observations of y_1?
  2. When evaluating the performance of your model, do you want to evaluate it’s ability to predict y conditional on y_1, or do you want to evaluate its unconditional ability to predict y conditional only on covariates not including y_1?
1 Like

Thanks for the clarifying questions! I’m actually a little stumped on this (which might be part of the problem), but I will try my best to elaborate.

In an ideal world, I would have directly measured total seed set (Y) so that I could model Y conditional on just the covariates (x). So in that sense, I suppose what I would like to be able to do is predict seed set on any new plant. I take it that this might be where the suggestion from @jd_c is coming from, which I agree would make sense in the context of the generated quantities block (but unfortunately this still doesn’t let me model the unobserved Y for each case of data with respect to x).

But practically, I’m having to use the number of flowers (y_1) and the number seeds per flower (y_2) as a means to get to Y, which is really the response of interest. And since y_1 is directly observed, while I only have an average (and associated uncertainty) for the number of seeds per flower, I was hoping to take that uncertainty into account while modeling the relationship between Y and x. Your question is making me realize that the model I’m envisioning makes Y conditional on y_1, which I would say is an undesired side effect considering that Y would ideally be observed and modeled directly if possible. For some added context, y_1 is not necessarily comparable across different species (which I omitted from the model above for simplicity) due to their different biology, whereas I would say that seed set (Y) is a more comparable response across species.

Does this make sense? I’d be happy to clarify further if needed.

I’m not sure how to get exactly what you’re looking for, unless you want to do some imputation prior to fitting the model. The key would be to come up with a reasonable imputation model that is congruent with the important features of the data.

However, I wonder if you could get more out of modelling both number of flowers and number of seeds simultaneously. If I can take a (possibly very naïve) guess, because a) seeds are clustered within flowers and b) the number of seeds per flower is within some comparatively narrow range means that the distribution of total seeds would be very clumpy. A plant with 4 flowers might have 38-42 seeds while a plant with 5 flowers might have 48-52 seeds. Using a Poisson distribution to predict the total seeds would smooth over that clumpiness.

Obviously, if you model the two separately (but simultaneously), you wouldn’t end up with a single coefficient for the effect of x on total number of seeds. So such a model wouldn’t be nearly as simple to interpret. But you could still examine the relationship by integrating over x's effects on flower and seeds, accounting for the correlation between the two.

3 Likes

Would you mind sharing some details about what x is? I’m not a botanist, but it seems to me that the total number of seeds Y would always depend on the number of flowers y_1 and seeds per flower y_2, so how does x affect Y without going through y_1 or y_2?

For example, a generative process like this hypothetical example where the amount of phosphate fertilizer affects the number of flowers per plant but not seeds per flower. In any case, the total number of seeds per plant always depends on both the number of flowers per plant and number of seeds per flower, with phosphate only acting through either of those two. (again, disclaimer, I am not botanist/ag specialist, so just trying to help you think about your model)

n_plant <- 10
phosphate <- rnorm(n_plant, 0, 1)
a_lfp <- 3
b_lfp <- 0.75
lambda_flower_plant <- exp(a_lfp + b_lfp*phosphate)
n_flower_plant <- rpois(n=n_plant, lambda=lambda_flower_plant)
id_plant <- rep(1:n_plant, n_flower_plant)
lambda_seeds_flower <- 50
n_seeds_flower <- rpois(n=length(id_plant), lambda=lambda_seeds_flower)
n_tot_seeds_plant <- tapply(n_seeds_flower, id_plant, sum)

Edit - I thought of a scenario, where x could affect Y outside of y_1 or y_2 - the wind could carry the seeds away, leaving only observed seeds. In this case, you could add the following to the above generative process:

a_prb <- 1
b_prb <- -0.25
windspeed <- rnorm(n_plant, 0, 1)
prob <- plogis(a_prb + b_prb*windspeed)
n_tot_seeds_plant_observed <- rbinom(n_plant, size=n_tot_seeds_plant, p=prob)

For a model, this would require missing data imputation for n_tot_seeds_plant, the true number of total seeds, though.

2 Likes

Thanks @simonbrauer - these are both intriguing suggestions!

As for the imputation you speak of, do you mean doing it outside of the Stan model? I would be open to that especially if there were a way to carry forward the uncertainty in the imputation in the Stan model, but I wasn’t sure how to do that without doing the imputation inside the model (which is what got me stumped in the first place).

For your second suggestion, that sounds to me a lot like the model in my original post (especially if I were to incorporate @jd_c’s suggestion for the generated quantities block). Is that sort of what you were envisioning, or am I missing something here? Modeling y_1 as opposed to Y might necessitate fitting separate models for different species (given the issues noted in my last post), but definitely sounds like something worth pursuing.

Of course - x is a suite of soil characteristics (including moisture, texture, nutrients, etc.). So your example with P is actually spot on. Your generative example also makes a lot of sense to me.

It’s increasingly sounding like (in the absence of a straightforward way to have latent counts within my model) adjusting which response variable(s) I am trying to model might be a reasonable path forward. My initial instinct was to make total seed production(Y) the response because it is more of a common currency across species, but I suppose that it is still possible to model the number of flowers (or whatever reproductive unit is counted, y_1) for each species separately and later predict Y from the model.

So in that case, I think what you actually want to do is model y_1 and y_2, similar to what you have already done, and then generate predictions for Y in generated quantities. A “suite of soil characteristics” would affect Y only via y_1 and y_2, so you will want to use x as a predictor for one or both of those models. If you have multiple species then you will want to include an offset for each species (either no pooling or partial pooling, whatever makes sense; you may also want to include different slopes for each species for the affect of x on y_1 and y_2). If you want to make predictions for new plants, then you want to include both the y_1 and y_2 models in the generated quantities code for Y.

2 Likes

Great, thank you so much for the input! If I may ask a follow-up question about the multiple species part: I would almost certainly include a species-specific slope and intercept, but considering that y_1 can have a different meaning for each species (e.g. for one species it might be a direct count of seeds in which case y_2 = 1, while for another species it might be the number of flower heads, and so on), I would think I should avoid partial pooling. Does that sound reasonable? Or on the flip side, depending on whether I proceed with no vs. partial pooling, is it reasonable to include multiple species in the same model despite y_1 not being equivalent across species?

If I am understanding what you mean, that sounds reasonable to me. To use partial pooling you would want the same outcome… you wouldn’t want to mix number of actual seeds with number of seed heads that contain multiple seeds. It would be ok to use partial pooling for the case where you have the same outcome across species. Maybe giving a concrete example would help, but I think what you say makes sense.

What does your raw data look like? I realize that I’ve been assuming it was at the flower level, which you aggregated to the plant level. Something like the table below, where the first plant has three rows of data, one for each flower; the second plant has two rows… etc.

plant id flower id num seeds
1 1 56
1 2 34
1 3 42
2 1 60
2 2 MISSING

I basically have two sets of data, one for counts of some reproductive unit (such as flower; might vary by species) for each plant, and another for counts of the number of seeds per reproductive unit (taken from a subset of flowers from across all individuals). Here’s what they look like (omitting species and site for simplicity):

First data set:

plant id num flowers (y_1)
1 2
2 4
3 2
\vdots \vdots
n_1 y_{1, n_1}

Second data set:

plant id unit id num seeds (y_2)
1 1 42
5 1 20
5 2 36
8 1 53
\vdots \vdots \vdots
n_2 1 y_{2, n_2}

Technically it would be possible to join these two data sets by plant id, in which case there would be a vast number of missing y_2 relative to the number of observed y_2 (n_1 >> n_2, where n is the number of observations in each data set). I also have soil covariates x sampled from the site at which each individual plant was grown.

1 Like

This is what I was hoping to confirm, although this might pose a problem for me since I will need to include the site where the plants were grown as a random effect in the model.

That makes sense. I apologize, as I misunderstood a few details and what you had originally described.

So you have data on…

  • …the number of flowers on each plant
  • …the number of seeds on each flower, but only for a subset of flowers
  • …the total number of seeds on each plant, but only for those plants in which you know the number of seeds on every flower

The generated quantities approach is the imputation approach I was suggesting and that @jd_c clarified. I just didn’t put the pieces together. I think the original model is missing the fact that the seed data (y_2) is nested within the flower data (y_1). So the imputation process is ignoring the fact that you already know the number of seeds on n_2 flowers. A slight improvement would be to only impute the unknown counts. In the model below, we skip over the unknown seed counts when calculating the log-likelihood, but then generate those seed counts afterwards.

data {
  // Plant-level data
  int<lower = 0> n; // Number of plants
  array[n] int<lower = 0> y1; // Number of flowers per plant
  vector[n] x; // predictor

  // Flower-level data
  int<lower = n> m; // Number of flowers
  array[m] int<lower = 0> y2; // Number of seeds per flower
  array[m] int<lower = 1, upper = n> j; // which plant is each flower on?
  array[m] int<lower = 0, upper = 1> z; // Is seed count m observed?
}
parameters {
  real<lower=0> log_lambda; // log Poisson rate
  real a; // intercept
  real b; // slope
}
transformed parameters {
  real<lower=0> lambda; // Poisson rate
  lambda = exp(log_lambda);
}
model {
  // model number of seeds per flower (y2)
  log_lambda ~ normal(2, 1);
  for(i in 1:m){
    // Only model number of seeds per flower if truly observed
    if(z[i] == 1){
      y2[i] ~ poisson(lambda);
    }
  }

  // model number of flowers (y1)
  a ~ normal(0, 1);
  b ~ normal(0, 1);
  y1 ~ poisson(exp(a + b*x));
}

generated quantities {
  vector[n] Y = rep_vector(0, n); // total number of seeds per plant

  for (i in 1:m) {
    // If we know the number of seeds on the flower, add to the total
    // If not, add a poisson-distributed value
    if(z[i] == 1){
      Y[j[i]] += y2[i];
    } else{
      Y[j[i]] += poisson_rng(lambda);
    }
  }
}

Note that in your original model, the predictor doesn’t predict the number of seeds (lambda is constant across flowers). So the only way that the predictors are affecting the total number of seeds per plant is through the number of flowers. Since you know the total number of flowers per plant (I think), the predictor isn’t actually influencing the imputations here.

1 Like

This is a great insight that I hadn’t fully considered. Thank you!