Specifying a model with sum of latent variables

maria_belotti · March 3, 2023, 8:32pm

Hey all!

Problem Description

I have a dataset of measurements of biomass within an aggregation (of birds, but this doesn’t matter all that much) at multiple occasions of such aggregations throughout a season (a couple of months every year). I want to model the seasonal pattern of biomass change through time (measured in days) using an hierarchical Bayesian model.

The real biomass x_{ij} of each aggregation i will be a latent variable obtained from the measurement y_i (this conversion is done by using priors from a log-normal calibration model). I use j to index the days.

I need to obtain the daily sum, say X_{j} of real biomass (the latent x_{ij}), in order to fit a model (I’m going to fit the derivative of a logistic curve) to the daily sum of biomass and get the seasonal curve.

The first issue I’m facing is that I have a varying number of number of aggregations each day. On some days I might have one aggregation, and on others I might have ten.

The second issue is that my calibration model is a log-normal. I’m fitting a linear regression between \log(x_i) and \log(y_i). This means that I need to take the power before making the sum.

I tried to implement this using the code below (see end of post), but it returns the following error:

>  DIAGNOSTIC(S) FROM PARSER
> Info:
> Let-hand side of sampling statement may contain a non-linear 
> transform of a parameter or local variable. If it does, you need 
> to include a target += statement with the log of absolute 
> determinant of the Jacobian of the transform. 
> Left-hand-side of sampling statement:
> X[j] ~ normal(...)

The Questions

I think this has something to do with a change of variables, and needing correct it with a Jacobian. But is this the correct way to do model the sum of latent variables? Can anyone think of another solution to this? If I’m on the right track, would you have a suggestion of where I can learn more about how to implement the variable change? I tried reading the User Guide, but I might need something a little more in-depth.

The Code


data {
  
  int<lower=0> n_obs;
  int<lower=0> n_days;
  vector[n_obs] log_y;
  
 // Vector containing the julian day of each observation:
  int<lower=0> julian_day[n_obs];

 // This vector has the label. If the label is 1, it will use the calibration model to fix the 
 // measurements. If the label is 0, no correction is needed.
  int<lower=0> label[n_obs];
  
  real alpha0_prior_mean;
  real<lower=0> alpha0_prior_sd;
  real alpha1_prior_mean;
  real<lower=0> alpha1_prior_sd;
  real<lower=0> sigma_prior_a;
  real<lower=0> sigma_prior_b;

} // end data block

parameters {
  
 // Parameters for correction of measurements:
  real <lower=0> sigma_obs;
  real beta11;
  real beta12;

  real<lower=0> sigma_process;
  
 // Log-scale latent variable:
  vector[n_obs] log_x;
  
  // Logistic parameters:
  real L;
  real k;
  real mid_point;
  
} // end parameters block

transformed parameters{

 vector[n_obs] x;
 vector[n_days] X;
 
 // Transform latent variable to linear scale: 
 for(i in 1:n_obs){
   x[i] = 10^log_x[i];
 }

// This is my attempt at summing the biomasses on each day. 
// It doesn't work (see text).
 for(j in 1:n_days){
   for(i in 1:n_obs){
     if(julian_day[i] == j){
       
       X[j] = X[j] + x[i];
     
     }
   }
 }
}

model {
  
  matrix[2,2] betas;
  
 // Inherit priors from calibration model:
  sigma_obs ~ gamma(sigma_prior_a, sigma_prior_b);
  beta11 ~ normal(alpha0_prior_mean, alpha0_prior_sd);
  beta12 ~ normal(alpha1_prior_mean, alpha1_prior_sd);

  // Dummy prior for now, will change to something better at some point: 
  sigma_process ~ uniform(0.001, 100000);

  betas[1,1] = beta11;
  betas[1,2] = beta12;
  betas[2,1] = 0;
  betas[2,2] = 1;
  
  for(i in 1:n_obs){
    log_y[i] ~ normal(betas[label[i],]*log_x[i], sigma_obs);
  }
  
  L ~ normal(10^6, 10^4);
  mid_point ~ normal(75, 20);
  k ~ normal(1, 10);
  
  //Likelihood
  for(j in 1:n_days){
    X[j] ~ normal( L*k * exp(k*(julian_day[j] - mid_point)) / (1 + exp(k*(julian_day[j] - mid_point)))^2, sigma_process);
  }  
} // end model block

caesoma · March 10, 2023, 10:16am

This is not as much an error as a warning, it is saying that by sampling from a distribution for a nonlinear transform of a parameter implies sampling from a different distribution for the parameter itself, up to the (log absolute) determinant of the Jacobian. There’s more discussion about this here, here, and here, as well as in the Stan documentation.

If I understand correctly at each time point you have a variable number of observations, so one solution is to use implicit ragged data structures, since Stan doesn’t really support ragged arrays. Nevertheless, in your case since you are summing over biomass levels, an absence of observation is equivalent to zero, so you can also create a matrix-like, 2D-array and pad the absent observations with zeros, so you can just sum over the columns (assuming each day is a column, or rows if you have them transposed).

maria_belotti · March 16, 2023, 12:30pm

Hi!

Thank you so much for your suggestions! I really appreciate it! I think that the ragged data structure will fix some of my problems!

I’m going to look into the links about variable transformations. After posting this, I had second thoughts about using the lognormal in the calibration, so my model might change a bit.

Once I have a full working solution to the problems in this question, I’ll post it here for reference.

Topic		Replies	Views
Modeling observed sums of positive latent random variables Modeling techniques , specification , state-space	0	342	June 13, 2023
Building a multivariate nonlinear hierarchical model Modeling	0	418	June 2, 2019
Dependent variable: Can it be a derived quantity? General	4	766	June 23, 2022
Sum of log-normal distributions and Quadrature Integration Modeling	7	1526	July 21, 2017
Constraining the sum of parameter vector Modeling techniques	1	540	May 30, 2020

Specifying a model with sum of latent variables

Related topics