State-space model selection

Hi, I am struggling with model selection for state-space models coded in rstan.

In my state-space models, the states (longitudinal positions of 10 fish individuals in a river) are assumed to follow a Cauchy random walk. I use the inverse function method to express the Cauchy random walk, specifying the state using the following formula:
state[i] ~ Cauchy(state[i-1], s_w)

I want to examine how the shape parameter s_w varies according to survey month and individual fish. To do this, I implement the following regression equation in the model block:
s_w[m, k] ~ lognormal(a[k] + b_TL*TL[m], sigma)
where m is the survey month, and k is the individual.
a is the individual intercept and b_TL is the coefficient expressing the effect of TL.

a is specified by the following formula:
a[m] = a0 + a_sig * a_raw
a_raw ~ normal(0, 1) (non-centered reparameterization)

Generally, leave-one-out cross-validation (LOOCV) can be used for model comparison, and this can be done using loo::loo(). However, in my case, the data come from different individuals and different survey months, and I am wondering whether I can still compare models using the same method.
Or should I instead use leave-one-individual-out cross-validation in my case?
Additionally, is leave-one-individual-out cross-validation still valid when the number of survey months differs between individuals (e.g., individual A has data from 2 survey months, while individual B has data from 5 months)?

My model is something like follows:

data {
  int<lower=0> N; // Number of observations per individual per survey month
  int<lower=0> K; // Number of individuals
  int<lower=0> M; // Number of months
  array[N, K, M] real Y;  // Observed values
  matrix[K, M] zero_Dist; // Initial position for each individual on the first day of each survey month
  vector[M] TL; // Total length (TL) of fish per survey month
}

parameters {
  real<lower=0> s_v; // Observation error standard deviation
  matrix[K, M]<lower=0> s_w; // State error standard deviation
  matrix[K, M] state_zero_raw; // Standardized state deviation (unconstrained)
  array[N, K, M] real state_raw; // Used for inverse function method

  // Hierarchical model parameters
  real a0; // Intercept
  real<lower=0> a_sig;
  vector[K] a_raw;

  real b_TL;
  real<lower=0> sigma;
}

transformed parameters {
  array[N, K, M] real state; // Longitudinal position of fish
  
  for (m in 1:M) {
    for (k in 1:K) {
      // Initial value for the first day of each month
      state[1, k, m] = zero_Dist[k, m] + 3.04 * state_zero_raw[k, m];

      // For subsequent days, update state using state error
      for (n in 2:N) {
        state[n, k, m] = state[n - 1, k, m] + s_w[k, m] * tan(state_raw[n, k, m]); // Inverse function method
      }
    }
  }

  // 個体効果のreparametarization
  vector[K] a;
  a = a0 + a_sig * a_raw;
}

model {
  // Priors
  s_v ~ normal(3, 1); 
  state_zero_raw ~ normal(0, 1);
  a_raw ~ normal(0, 1);
  sigma ~ normal(0, 10);

  // Observation model
  for (m in 1:M) {
    for (k in 1:K) {
      for (n in 1:N) {
        Y[n, k, m] ~ normal(state[n, k, m], s_v);
      }
    }
  }

  // Regression model
  for (m in 1:M) {
    for (k in 1:K) {
      s_w[k, m] ~ lognormal(a[k] + b_TL * TL[m], sigma);
    }
  }  
}

Thank you!

1 Like

I would stop right there and rethink whether this model makes sense. Cauchy distributions have such wide tails they don’t even have means or variances. If you take a Cauchy(0, 1). For example, here’s a summary of 10K standard Cauchy draws:

> summary(rcauchy(1e4))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-5720.499    -1.051    -0.008    -0.432     0.983  1329.406 

So while the central 50% interval is reasonable, the tails are huge and consistent with fish teleporting home in one step or jumping into other rivers. As you can imagine, this makes it hard to fit the location and shape parameter of a Cauchy.

The implementation

for (i in 2:N) { state[i] ~ Cauchy(state[i-1], s_w); }

can be vectorized to

state[2:N] ~ cauchy(state[1:N-1], s_w);

You need to group the data into conditional i.i.d. blocks. Then you can apply loo. That’s often not possible with time series or spatial data, which often has autoregressive structure. There’s a new package aimed at that situation:

And you can see the FAQ, which discusses in more detail how to divide the data if you have temporal or spatial structure:

2 Likes

@Bob_Carpenter already pointed to CV-FAQ. The relevant questions are 8 an 9. If you do use leave-one-individual-out you can also use any weighting you like to adjust the fact that individuals have different number of observations.