Gaussian Process with measurement uncertainty and a large number of observations

Hello all,

I am trying to fit a GP model where I cannot observe time directly but I know the event happened at some point between a start date and an end date. At this stage there is no way to know the duration of the event or when it most likely happened. For each time range, I have an observed variable I want to measure, and the goal is to get a measurement error for that value that is accounting for the uncertainty in these observations. Also, I want to stratify by four different categories (animal species in this example) so that I get a curve for each.

I do not come from a mathematical/CS background (but from Social Sciences) so I apologise in advance for the trivial question and for obviously resorting to other automated forms of help before asking here - as I never got formal training in Statistics.

With a subset the sampling works fine, and the results on a simulated dataset are not so far off. However, with my real dataset (5000+ observations) the warmup doesn’t even start even after 48h. Googling a bit, I found Hilbert Space GP approximation instead of decomposing the matrix (which might explain why it takes so long), but - as far as I can tell - it wouldn’t be straightforward to keep the measurement error in the predictor (time) propagated into the observed variable?

I have also uploaded for convenience the script for the simulation (which work) because they have a small number of observations. If you prefer I can also upload my real dataset (it comes from an open access publication). I have also uploaded a screenshot of how the data looks like in terms of time ranges and the value they represent.

Very sorry for the long post, but I really don’t know what steps to take to fit this model (or another model with the same goal). I have also thought of something similar to spatial modelling where points that are close in space with similar values are condensed into one but I really wouldn’t know how to do that in Stan and how would I keep the measurement errors in that way.

I would appreciate any help or lead into how to manage a large dataset for GP, or alternatives, or anything I can try to study to solve this problem.

Thank you so much in advance!

Model Script

data {
  int<lower=1> N;                      // Total number of observations
  int<lower=1> N_taxa;                 // Number of taxa
  array[N] int<lower=1, upper=N_taxa> taxon_id;  // Taxon indicator for each obs
  array[N] real y;                     // Observed LSI values
  array[N] real start_date;            // Lower bound of date range
  array[N] real end_date;              // Upper bound of date range
  
  // Prediction grid
  int<lower=1> N_pred;                 // Number of prediction points
  array[N_pred] real x_pred;           // Prediction time points
}

transformed data {
  // Normalize time to [0, 1] for better numerical behavior
  real time_min = min(start_date);
  real time_max = max(end_date);
  real time_range = time_max - time_min;
  
  // Normalized start/end dates
  array[N] real start_norm;
  array[N] real end_norm;
  
  // Normalized prediction points
  array[N_pred] real x_pred_norm;
  
  for (n in 1:N) {
    start_norm[n] = (start_date[n] - time_min) / time_range;
    end_norm[n] = (end_date[n] - time_min) / time_range;
  }
  
  for (p in 1:N_pred) {
    x_pred_norm[p] = (x_pred[p] - time_min) / time_range;
  }
}

parameters {
  // Latent true dates (on unit interval, will be transformed)
  array[N] real<lower=0, upper=1> true_date_raw;
  
  // GP hyperparameters (one set per taxon)
  array[N_taxa] real<lower=0> alpha;   // Marginal standard deviation
  array[N_taxa] real<lower=0> rho;     // Length-scale
  
  // Observation noise
  real<lower=0> sigma;
  
  // Mean LSI for each taxon
  array[N_taxa] real mu;
}

transformed parameters {
  // Transform raw dates to actual date range [start, end]
  array[N] real true_date_norm;
  
  for (n in 1:N) {
    // Linear interpolation between normalized start and end dates
    true_date_norm[n] = start_norm[n] + true_date_raw[n] * (end_norm[n] - start_norm[n]);
  }
}

model {
  // Priors on GP hyperparameters
  alpha ~ normal(0, 0.5);              // Prior on marginal SD
  rho ~ inv_gamma(5, 1);               // Prior on length-scale (prevents too short)
  sigma ~ exponential(10);             // Prior on observation noise
  mu ~ normal(0, 0.5);                 // Prior on mean LSI
  
  // Uniform prior on true dates (implicit via bounds on true_date_raw)
  
  // Model each taxon with its own GP
  for (t in 1:N_taxa) {
    // Get indices for this taxon
    int n_t = 0;
    for (n in 1:N) {
      if (taxon_id[n] == t) n_t += 1;
    }
    
    // Skip if no observations for this taxon
    if (n_t > 0) {
      // Extract data for this taxon
      array[n_t] real x_t;
      vector[n_t] y_t;
      {
        int idx = 1;
        for (n in 1:N) {
          if (taxon_id[n] == t) {
            x_t[idx] = true_date_norm[n];
            y_t[idx] = y[n];
            idx += 1;
          }
        }
      }
      
      // Build covariance matrix using Stan's built-in function
      matrix[n_t, n_t] K = gp_exp_quad_cov(x_t, alpha[t], rho[t]);
      
      // Add jitter to diagonal for numerical stability
      for (i in 1:n_t) {
        K[i, i] += 1e-9;
      }
      
      // Add observation noise
      for (i in 1:n_t) {
        K[i, i] += square(sigma);
      }
      
      // Likelihood (multivariate normal)
      y_t ~ multi_normal(rep_vector(mu[t], n_t), K);
    }
  }
}

generated quantities {
  // Posterior predictions at grid points for each taxon
  array[N_taxa, N_pred] real f_pred;
  
  for (t in 1:N_taxa) {
    // Get data for this taxon
    int n_t = 0;
    for (n in 1:N) {
      if (taxon_id[n] == t) n_t += 1;
    }
    
    if (n_t > 0) {
      // Extract data
      array[n_t] real x_t;
      vector[n_t] y_t;
      {
        int idx = 1;
        for (n in 1:N) {
          if (taxon_id[n] == t) {
            x_t[idx] = true_date_norm[n];
            y_t[idx] = y[n];
            idx += 1;
          }
        }
      }
      
      // Covariance matrices for prediction
      matrix[n_t, n_t] K = gp_exp_quad_cov(x_t, alpha[t], rho[t]);
      
      // Add jitter for numerical stability
      for (i in 1:n_t) {
        K[i, i] += 1e-9;
      }
      
      for (i in 1:n_t) {
        K[i, i] += square(sigma);
      }
      
      // Cross-covariance between training and prediction points
      matrix[n_t, N_pred] K_x_xpred;
      for (i in 1:n_t) {
        for (j in 1:N_pred) {
          K_x_xpred[i, j] = square(alpha[t]) * 
            exp(-0.5 * square(x_t[i] - x_pred_norm[j]) / square(rho[t]));
        }
      }
      
      // Predictive covariance at prediction points
      matrix[N_pred, N_pred] K_pred = gp_exp_quad_cov(x_pred_norm, alpha[t], rho[t]);
      
      // Add jitter to K_pred for numerical stability
      for (i in 1:N_pred) {
        K_pred[i, i] += 1e-9;
      }
      
      // GP prediction (mean)
      matrix[n_t, n_t] L_K = cholesky_decompose(K);
      vector[n_t] y_centered = y_t - mu[t];
      vector[n_t] L_K_div_y = mdivide_left_tri_low(L_K, y_centered);
      vector[n_t] K_inv_y = mdivide_right_tri_low(L_K_div_y', L_K)';
      vector[N_pred] f_mean = mu[t] + K_x_xpred' * K_inv_y;
      
      // Predictive variance
      matrix[n_t, N_pred] v = mdivide_left_tri_low(L_K, K_x_xpred);
      matrix[N_pred, N_pred] f_cov = K_pred - v' * v;
      
      // Sample from predictive distribution
      f_pred[t] = to_array_1d(multi_normal_rng(f_mean, f_cov + diag_matrix(rep_vector(1e-9, N_pred))));
    } else {
      // No data for this taxon, sample from prior
      for (p in 1:N_pred) {
        f_pred[t, p] = normal_rng(0, alpha[t]);
      }
    }
  }
  
  // Output the actual true dates (back-transformed)
  array[N] real true_date_actual;
  for (n in 1:N) {
    true_date_actual[n] = time_min + true_date_norm[n] * time_range;
  }
}

01_Script_GP_Simulated.R (5.3 KB)
02_Visualise_Simulated_Data.R (5.1 KB)
03_Fit_and_Visualise.R (7.6 KB)

Okay, so just trying to restate your question to see if I understand it:

You have observations y that you know happened in some interval [t_i, t_f] (with the assumption that the latent true time t \sim \mathrm{Uniform}(t_i, t_f)). And your observational model is y | f, t, \sigma \sim \mathcal{N}(f(t), \sigma), where f has a Gaussian Process prior over it. You want to use this infer the latent t?

If so, the two main computational issues here are

  1. scale of your data – attempting to evaluate an exact GP at 5000 points is quite large and will take a long time under the best conditions.
  2. You sample latent t and then use the sampled values as points to evaluate the GP at. This is a core issue here. By resampling t each time, you are changing the input points of the GP every time and adding some pretty significant additional computational complexity and geometric challenges to your model.

I assume you are doing 2 because you want posterior inferences on that latent t? The right way to do this would be to marginalize out the latent t, fit the model, and then use the posterior over f to construct samples of _n after the fact.

Even if you didn’t have the latent t sampling your input space is still quite large, so a basis function approximation would help a good bit. The good news is that you can keep your uncertainty in t by applying the marginalization over the basis functions.

I think the combination of using something like HSGP and marginalizing over the latent t will be practical, then you should be able to use the posterior samples to generate valid posterior approximations of your t.

Exactly. I am interested in y of course, and the GP should give me at the end changes in Y (per species), but I do want to infer the latent t and get measurement error for it; for instance I am interested to see what’s my uncertainty at a given time point t_x.

I should scale my data as you suggest probably, but I wanted to keep the information about the sample size somehow (= show larger CI around periods in which I have less data).

Re: 2. This is what I was worried about. As I said I am not a statistician by any means, so I apologise, but yes that’s the main reason – I wanted the posterior inference on the latent t. What you’re saying can be done in a single model or would it be two models? One for the latent t and one that feeds the GP?

Thank you so much for your thorough response! I really appreciate it.

It is possible. The performance drops as the basis function terms PHI for covariate values presented as parameters need to be computed at every leapfrog step, while with fixed covariate values the basis function terms PHI can be computed only once in transformed data. In addition, adding as many unknowns as observations increases the sampling time. HSGP should still scale better than covariance matrix based GP computation.

My case study Gaussian process demonstration with Stan shows code for HSGP and below is the parts I changed to test interval observed x’s. Just for demonstration purposes, I replaced each original x with interval observation [x-5, x+5].

standata_gpbff_mis <- list(x_lower=mcycle$times-5,
                       x_upper=mcycle$times+5,
                        y=mcycle$accel,
                        N=length(mcycle$times),
                        c_f=1.5, # factor c of basis functions for GP for f1
                        M_f=40,  # number of basis functions for GP for f1
                        c_g=1.5, # factor c of basis functions for GP for g3
                        M_g=40)  # number of basis functions for GP for g3

model_gpbff_mis <- cmdstan_model(stan_file = "gpbff_mis.stan", include_paths = ".")

fit_gpbff_mis <- model_gpbff_mis$sample(data=standata_gpbff_mis,
                                iter_warmup=500, iter_sampling=500, refresh=100,
                                chains=4, parallel_chains=4, adapt_delta=0.9)
functions {
#include gpbasisfun_functions.stan
}
data {
  int<lower=1> N;      // number of observations
  vector[N] x_lower;         // univariate covariate
  vector[N] x_upper;         // univariate covariate
  vector[N] y;         // target variable
        
  real<lower=0> c_f;   // factor c to determine the boundary value L
  int<lower=1> M_f;    // number of basis functions
}
transformed data {
  // Normalize data
  real ymean = mean(y);
  real ysd = sd(y);
  vector[N] yn = (y - ymean)/ysd;
  // Normalize data
  real xmean = mean(append_row(x_lower, x_upper));
  real xsd = sd(append_row(x_lower, x_upper));
  // Scaling factor for basis functions
  real L_f = c_f*max((x_upper - xmean)/xsd);
}
parameters {
  real intercept_f;
  vector[M_f] beta_f;          // the basis functions coefficients
  real<lower=0> lengthscale_f; // lengthscale of f
  real<lower=0> sigma_f;       // scale of f
  real<lower=0> sigman;         // noise sigma
  vector<lower=x_lower,upper=x_upper>[N] x; // inferred x
}
transformed parameters {
  vector[N] xn = (x - xmean)/xsd;
  // Basis functions
  matrix[N,M_f] PHI_f = PHI(N, M_f, L_f, xn);
}
model {
  // spectral densities for f and g
  vector[M_f] diagSPD_f = diagSPD_EQ(sigma_f, lengthscale_f, L_f, M_f);
  // priors
  intercept_f ~ normal(0, 0.1);
  beta_f ~ normal(0, 1);
  lengthscale_f ~ normal(0, 1);
  sigma_f ~ normal(0, 1);
  sigman ~ normal(0, 1);
  // model
  yn ~ normal(intercept_f + PHI_f * (diagSPD_f .* beta_f), sigman);
}

I tested with quite small data (N=133), but I have used models (non-GP) with thousands of interval observed x’s, so I think 5000+ should work. In my example, the function is highly non-linear and I’ve used 40 basis functions. If your function is not as wiggly, you can use fewer bassi function for additional speedup.

2 Likes

Thank you so much for this.

I am going to test this as soon I understand everything going on in my code (and after I read your case study today). From what I gather in your code, the latent date (between x_lower and x_upper) is not back transformed into years, but I can just do that in my generated quantities?

Yes, you can transform back to your desired scale. In addition, if you want to make predictions for certain time points, you can make these predictions with precisely known times

1 Like

It worked on a simulated dataset with this code below, I am now testing it on the large dataset. Hope to get back to you in not so many hours!

# Prediction grid (100 points across the time range)
time_min <- min(observed_data$START_DATE)
time_max <- max(observed_data$END_DATE)
x_pred <- seq(time_min, time_max, length.out = 100)

stan_data <- list(
  N          = as.integer(nrow(observed_data)),
  N_taxa     = as.integer(length(taxa)),
  taxon_id   = as.integer(taxon_lookup[observed_data$TAXON]),
  y          = as.numeric(observed_data$VALUE),
  start_date = as.numeric(observed_data$START_DATE),
  end_date   = as.numeric(observed_data$END_DATE),
  N_pred     = as.integer(length(x_pred)),
  x_pred     = as.numeric(x_pred),
  M          = 40L,
  c          = 1.5
)

# Compile model
model <- stan_model(here("Simulation_2", "stan", "lsi_gp_hsgp_v2.stan"))

# Fit the model
init_fn <- function() list(
  alpha = rep(0.2, length(taxa)),
  rho   = rep(0.3, length(taxa)),
  sigma = 0.1,
  mu    = rep(0, length(taxa)),
  z     = matrix(0, nrow = length(taxa), ncol = stan_data$M)
)

fit <- sampling(
  model,
  data    = stan_data,
  init    = init_fn,       # add this line
  chains  = 4,
  iter    = 4000,
  warmup  = 1000,
  seed    = 42,
  control = list(adapt_delta = 0.95, max_treedepth = 12),
  refresh = 100
)
functions {
  // Vehtari's vectorised helpers (from gpbasisfun_functions.stan)
  vector diagSPD_EQ(real alpha, real rho, real L, int M) {
    return alpha * sqrt(sqrt(2*pi()) * rho)
           * exp(-0.25 * (rho * pi() / 2 / L)^2
                 * linspaced_vector(M, 1, M)^2);
  }
  matrix PHI(int N, int M, real L, vector x) {
    return sin(diag_post_multiply(
             rep_matrix(pi() / (2*L) * (x + L), M),
             linspaced_vector(M, 1, M)
           )) / sqrt(L);
  }
}

data {
  int<lower=1> N;
  int<lower=1> N_taxa;
  array[N] int<lower=1, upper=N_taxa> taxon_id;
  array[N] real y;
  array[N] real start_date;
  array[N] real end_date;
  int<lower=1> N_pred;
  array[N_pred] real x_pred;
  int<lower=1> M;
  real<lower=1> c;
}

transformed data {
  real time_min   = min(start_date);
  real time_max   = max(end_date);
  real time_range = time_max - time_min;

  vector[N]      start_norm;
  vector[N]      end_norm;
  vector[N_pred] x_pred_norm;

  for (n in 1:N) {
    start_norm[n] = (start_date[n] - time_min) / time_range;
    end_norm[n]   = (end_date[n]   - time_min) / time_range;
  }
  for (p in 1:N_pred)
    x_pred_norm[p] = (x_pred[p] - time_min) / time_range;

  real L = c;

  // PHI at prediction grid computed once
  matrix[N_pred, M] PHI_pred = PHI(N_pred, M, L, x_pred_norm);
}

parameters {
  array[N] real<lower=0, upper=1> true_date_raw;
  array[N_taxa] real<lower=0> alpha;
  array[N_taxa] real<lower=0> rho;
  array[N_taxa] vector[M] z;       // non-centred: z ~ normal(0,1)
  real<lower=0> sigma;
  array[N_taxa] real mu;
}

transformed parameters {
  vector[N] true_date_norm;
  for (n in 1:N)
    true_date_norm[n] = start_norm[n]
                        + true_date_raw[n] * (end_norm[n] - start_norm[n]);

  // beta[t] = spd .* z[t]  (non-centred)
  array[N_taxa] vector[M] beta;
  for (t in 1:N_taxa)
    beta[t] = diagSPD_EQ(alpha[t], rho[t], L, M) .* z[t];
}

model {
  alpha ~ normal(0, 0.5);
  rho   ~ inv_gamma(5, 1);
  sigma ~ exponential(1);
  mu    ~ normal(0, 0.5);
  for (t in 1:N_taxa)
    z[t] ~ normal(0, 1);

  // PHI at latent dates — computed inside loop (depends on sampled true_date_norm)
  matrix[N, M] PHI_obs = PHI(N, M, L, true_date_norm);
  for (n in 1:N) {
    int t = taxon_id[n];
    y[n] ~ normal(mu[t] + PHI_obs[n] * beta[t], sigma);
  }
}

generated quantities {
  array[N_taxa, N_pred] real f_pred;
  for (t in 1:N_taxa) {
    vector[N_pred] f_mean = mu[t] + PHI_pred * beta[t];
    for (p in 1:N_pred)
      f_pred[t, p] = f_mean[p];
  }

  array[N] real true_date_actual;
  for (n in 1:N)
    true_date_actual[n] = time_min + true_date_norm[n] * time_range;
}

1 Like

I ran the code and it works (much faster!), however I am not convinced by certain things (for sure because I am not very confident with GPs and I do not understand them):

  1. I am not understanding whether the GP is accounting for the variability of my response variable across time. The CI look very narrow, but if you plot (see blow) the samples on top of the GP you can see that the model should be more skeptical of the mean/trend around that area.

  2. I think the GP is doing what I want in the sense that if the value (let’s say Y) is not so similar to the others around it (y-1;y+1; etc.) it ‘fails’ to assign a date that is not so flat. See example below:

  3. Am I somehow accounting for the fact that I should trust less samples with a very wide chronology? E.g. A sample that could be dated anytime between 600 years surely must be treated with more uncertainty?

  4. Is the CI flat because I should have assumed a varying sigma for each taxon (e.g. vector<lower=0>[N_taxa] sigma;) or a varying sigma across the entire chronological scale (no idea how to do that however – even though possibly it’s the most scientifically honest approach)? In a sense, I feel I failed to assume heteroscedasticity as in the link you provided.

P.S. For the wiggliness I assume I have to work more on the rho parameter.

I apologise about the million questions, really!

functions {
  // Vehtari's vectorised helpers (from gpbasisfun_functions.stan)
  vector diagSPD_EQ(real alpha, real rho, real L, int M) {
    return alpha * sqrt(sqrt(2*pi()) * rho)
           * exp(-0.25 * (rho * pi() / 2 / L)^2
                 * linspaced_vector(M, 1, M)^2);
  }
  matrix PHI(int N, int M, real L, vector x) {
    return sin(diag_post_multiply(
             rep_matrix(pi() / (2*L) * (x + L), M),
             linspaced_vector(M, 1, M)
           )) / sqrt(L);
  }
}

data {
  int<lower=1> N;
  int<lower=1> N_taxa;
  array[N] int<lower=1, upper=N_taxa> taxon_id;
  array[N] real y;
  array[N] real start_date;
  array[N] real end_date;
  int<lower=1> N_pred;
  array[N_pred] real x_pred;
  int<lower=1> M;
  real<lower=1> c;
}

transformed data {
  real time_min   = min(start_date);
  real time_max   = max(end_date);
  real time_range = time_max - time_min;

  vector[N]      start_norm;
  vector[N]      end_norm;
  vector[N_pred] x_pred_norm;

  for (n in 1:N) {
    start_norm[n] = (start_date[n] - time_min) / time_range;
    end_norm[n]   = (end_date[n]   - time_min) / time_range;
  }
  for (p in 1:N_pred)
    x_pred_norm[p] = (x_pred[p] - time_min) / time_range;

  real L = c;

  // PHI at prediction grid computed once
  matrix[N_pred, M] PHI_pred = PHI(N_pred, M, L, x_pred_norm);
}

parameters {
  array[N] real<lower=0, upper=1> true_date_raw;
  array[N_taxa] real<lower=0> alpha;
  array[N_taxa] real<lower=0> rho;
  array[N_taxa] vector[M] z;       // non-centred: z ~ normal(0,1)
  real<lower=0> sigma;
  array[N_taxa] real mu;
}

transformed parameters {
  vector[N] true_date_norm;
  for (n in 1:N)
    true_date_norm[n] = start_norm[n]
                        + true_date_raw[n] * (end_norm[n] - start_norm[n]);

  // beta[t] = spd .* z[t]  (non-centred)
  array[N_taxa] vector[M] beta;
  for (t in 1:N_taxa)
    beta[t] = diagSPD_EQ(alpha[t], rho[t], L, M) .* z[t];
}

model {
  alpha ~ normal(0, 0.5);
  rho   ~ inv_gamma(5, 2.5);
  sigma ~ exponential(1);
  mu    ~ normal(0, 0.5);
  for (t in 1:N_taxa)
    z[t] ~ normal(0, 1);

  // PHI at latent dates — computed inside loop (depends on sampled true_date_norm)
  matrix[N, M] PHI_obs = PHI(N, M, L, true_date_norm);
  for (n in 1:N) {
    int t = taxon_id[n];
    y[n] ~ normal(mu[t] + PHI_obs[n] * beta[t], sigma);
  }
}

generated quantities {
  array[N_taxa, N_pred] real f_pred;
  array[N_taxa, N_pred] real y_pred;   // add this — posterior predictive with sigma noise
  for (t in 1:N_taxa) {
    vector[N_pred] f_mean = mu[t] + PHI_pred * beta[t];
    for (p in 1:N_pred) {
      f_pred[t, p] = f_mean[p];
      y_pred[t, p] = normal_rng(f_mean[p], sigma);  // add this
    }
  }
  array[N] real true_date_actual;
  for (n in 1:N)
    true_date_actual[n] = time_min + true_date_norm[n] * time_range;
}

I think they look good. You have quite many observations and the intervals are not that long.

Currently you are assuming they have the same uncertainty in LSI value, but longer intervals lead to smoother estimates.

Before implementing a more complex model, examine the residuals per taxon and in chronological scale to see if there are clear patterns.

I think the wigglines for the Cattle is due to some clustering in the data, which I think is visible in the plot, that is, there are groups of lines that together are higher or lower. Do you know what is the reason for this clustering?

No need to apologise. Forum is meant for questions and I’m volunteering to answer

For the residuals, I have the model still running with the varying sigma, even though I should probably stop it because I realize it’s still assuming the same variance, just changing by different animals. Also I think I need to add something to the stan code to examine residuals as I don’t think my generated quantities block might help me with that. But I might be wrong?

But in the meanwhile, I tried to do some tests for the clustering reason. These values come from a dataset with these variables:

> colnames(real_data)
 [1] "City"                     "Archaeological.site.name"
 [3] "Site"                     "US"                      
 [5] "T.P.Q."                   "T.A.Q."                  
 [7] "meantpqtaq"               "Taxa"                    
 [9] "Bone"                     "Variable"                
[11] "Individual.number"        "Measurement"             
[13] "Dimension"                "LSI" 

where ‘LSI’ is my response, as it is a standardized measure to ‘increase’ the number of samples you have to provide a unified dimensional measure of ancient animal bones. I tried to see if (for Cattle) there was a particular cluster in the individual number, in the site of provenance, and I found that this particular unit within an archaeological site had 207 observations (for reference, the second highest has 59 observations). These observations seem to have little variance (See graph below) so might that be the culprit?

Since you were able to make the plot 1 showing data and GP curve, you can compute residuals

Probably. You could add additional intercept terms for the units with common population prior with scale as a parameter (a usual hierarchical intercept model). This would then explain the variation between units

I approximated at the midpoint (the graph was really unreadable). Probably this doesn’t look that good.

Regarding the hierarchical model, I have ~1013 units coming from different sites. I am afraid this would exponentiate the fit time, do you think it is feasible? Because in the meanwhile I was trying the hierarchical sigma by taxon and it already didn’t converge. Even though as I said earlier I don’t think this fixes my problem of a not having fixed variance across the plot. Ideally my credible interval would take into account:

  • Sample size at time point X_i
  • If the value Y_i comes from a very large time range
  • The chronological variance (if certain periods have a less stable Y that’s interesting to me) - although to fix this I think I do need as you suggest a grouping of some sort, either at the site level (which would contain the unit in the plot in the previous message, but maybe you lose that variability) or at the unit level itself.

So the code would be now:

observed_data <- real_data %>%
  transmute(
    SAMPLE     = row_number(),
    TAXON      = Taxa,
    VALUE      = LSI,
    START_DATE = T.P.Q.,
    END_DATE   = T.A.Q.,
    US_unique  = paste(Archaeological.site.name, US, sep = "__")
  ) %>%
  filter(!is.na(VALUE), !is.na(START_DATE), !is.na(END_DATE))

# --- Stan data ---
taxa <- unique(observed_data$TAXON)
taxon_lookup <- setNames(seq_along(taxa), taxa)

us_levels <- unique(observed_data$US_unique)
us_lookup <- setNames(seq_along(us_levels), us_levels)

time_min <- min(observed_data$START_DATE)
time_max <- max(observed_data$END_DATE)
x_pred   <- seq(time_min, time_max, length.out = 100)

stan_data <- list(
  N          = as.integer(nrow(observed_data)),
  N_taxa     = as.integer(length(taxa)),
  taxon_id   = as.integer(taxon_lookup[observed_data$TAXON]),
  y          = as.numeric(observed_data$VALUE),
  start_date = as.numeric(observed_data$START_DATE),
  end_date   = as.numeric(observed_data$END_DATE),
  N_pred     = as.integer(length(x_pred)),
  x_pred     = as.numeric(x_pred),
  M          = 40L,
  c          = 1.5,
  N_us       = as.integer(length(us_levels)),
  us_id      = as.integer(us_lookup[observed_data$US_unique])
)

# Compile + Fit
model <- stan_model(here("Real_Data", "stan", "lsi_gp_hsgp_v4.stan"))

init_fn <- function() list(
  alpha        = rep(0.2, length(taxa)),
  rho          = rep(0.3, length(taxa)),
  sigma        = 0.1,
  mu           = rep(0, length(taxa)),
  z            = matrix(0, nrow = length(taxa), ncol = stan_data$M),
  gamma_us_raw = rep(0, stan_data$N_us),
  sigma_us     = 0.05
)

fit <- sampling(
  model,
  data    = stan_data,
  init    = init_fn,
  chains  = 4,
  iter    = 2000,
  warmup  = 1000,
  seed    = 42,
  control = list(adapt_delta = 0.95, max_treedepth = 12),
  refresh = 100
)
functions {
  vector diagSPD_EQ(real alpha, real rho, real L, int M) {
    return alpha * sqrt(sqrt(2*pi()) * rho)
    * exp(-0.25 * (rho * pi() / 2 / L)^2
          * linspaced_vector(M, 1, M)^2);
  }
  matrix PHI(int N, int M, real L, vector x) {
    return sin(diag_post_multiply(
      rep_matrix(pi() / (2*L) * (x + L), M),
      linspaced_vector(M, 1, M)
    )) / sqrt(L);
  }
}
data {
  int<lower=1> N;
  int<lower=1> N_taxa;
  array[N] int<lower=1, upper=N_taxa> taxon_id;
  array[N] real y;
  array[N] real start_date;
  array[N] real end_date;
  int<lower=1> N_pred;
  array[N_pred] real x_pred;
  int<lower=1> M;
  real<lower=1> c;
  // --- NEW ---
    int<lower=1> N_us;
    array[N] int<lower=1, upper=N_us> us_id;
}
transformed data {
  real time_min   = min(start_date);
  real time_max   = max(end_date);
  real time_range = time_max - time_min;
  vector[N]      start_norm;
  vector[N]      end_norm;
  vector[N_pred] x_pred_norm;
  for (n in 1:N) {
    start_norm[n] = (start_date[n] - time_min) / time_range;
    end_norm[n]   = (end_date[n]   - time_min) / time_range;
  }
  for (p in 1:N_pred)
    x_pred_norm[p] = (x_pred[p] - time_min) / time_range;
  real L = c;
  matrix[N_pred, M] PHI_pred = PHI(N_pred, M, L, x_pred_norm);
}
parameters {
  array[N] real<lower=0, upper=1> true_date_raw;
  array[N_taxa] real<lower=0> alpha;
  array[N_taxa] real<lower=0> rho;
  array[N_taxa] vector[M] z;
  real<lower=0> sigma;
  array[N_taxa] real mu;
  // Added the non-centred hierarchical intercept
  vector[N_us] gamma_us_raw;
  real<lower=0> sigma_us; // between-context SD: how much LSI varies across stratigraphic units
}
transformed parameters {
  vector[N] true_date_norm;
  for (n in 1:N)
    true_date_norm[n] = start_norm[n]
  + true_date_raw[n] * (end_norm[n] - start_norm[n]);
  array[N_taxa] vector[M] beta;
  for (t in 1:N_taxa)
    beta[t] = diagSPD_EQ(alpha[t], rho[t], L, M) .* z[t];
// Non-centred hierarchical intercept per stratigraphic unit:
// gamma_us_raw ~ N(0,1), so gamma_us ~ N(0, sigma_us)      
    vector[N_us] gamma_us = sigma_us * gamma_us_raw;
}
model {
  alpha ~ normal(0, 0.5);
  rho   ~ inv_gamma(5, 2.5);
  sigma ~ exponential(1);
  mu    ~ normal(0, 0.5);
  for (t in 1:N_taxa)
    z[t] ~ normal(0, 1);
  // New parameters for hierarchical intercepts
  sigma_us ~ exponential(1);       // weakly informative: 
  gamma_us_raw ~ std_normal();     // implies gamma_us ~ normal(0, sigma_us) 
  
  matrix[N, M] PHI_obs = PHI(N, M, L, true_date_norm);
  for (n in 1:N) {
    int t = taxon_id[n];
    y[n] ~ normal(mu[t] + PHI_obs[n] * beta[t] + gamma_us[us_id[n]], sigma);  // <- added gamma_us
  }
}
generated quantities {
  array[N_taxa, N_pred] real f_pred;
  array[N_taxa, N_pred] real y_pred;
  for (t in 1:N_taxa) {
    vector[N_pred] f_mean = mu[t] + PHI_pred * beta[t];
    for (p in 1:N_pred) {
      f_pred[t, p] = f_mean[p];
     //OLD:  y_pred[t, p] = normal_rng(f_mean[p], sigma);
     // New with sigma_US added to the predictive uncertainty to get better PPC
      y_pred[t, p] = normal_rng(f_mean[p], sqrt(sigma^2 + sigma_us^2));

    }
  }
  array[N] real true_date_actual;
  for (n in 1:N)
    true_date_actual[n] = time_min + true_date_norm[n] * time_range;
}

Let’s see how long it takes (and if it converges). Thank you!!

I don’t think you need hierarchical sigma, just the hierarchical intercept which should have easier posterior

That doesn’t sound big, but then it is sometimes surprising how the computation scales. You may also need to test whether centered or non-centered parameterization is better for taxon and stratigraphic unit. The model would probably benefit also from sum-to-zero constraints for mu and gamma_us (see The Sum-to-Zero Constraint in Stan). Centering the data beforehand is likely to help, too.

I think you are getting these.

I have tried both, the hierarchical sigma (on a simulation; still going on) and the hierarchical intercept as you suggested and it got rid of the wiggliness immediately (although the computational wasn’t the fastest). I am currently trying a third option, which sounds very risky because I don’t know if I fully understand from a theoretical standpoint, but correct me if I am wrong:
by pooling everything by stratigraphic unit, it means that whatever is in that stratigraphic unit learns from observed value of other rows? For instance if in that stratigraphic unit there are goat bones and cattle bones, would they inform each other (as in partial pooling)?

Re: this

I think you are getting these.

The second model I am trying is a heteroscedastic one because the GP CI look awfully small. I can see there is a lot of variability, but from this I am not really understanding if at time X_i my measurements for animal K are more variable than in an earlier period. This is interesting for me because it would inform me on when certain practices are historically changing.

Regarding the varying sigma, what I think I am trying is a second HSGP on the log scale (for sigma).

Apologies if my answer is very confusing (it shows a bit how confused I am about this – I know ideally what I want, just difficult to get it right). But I am definitely going to try your suggestion tomorrow (the Sum-To-Zero Constraint)

Again, thank you, this is really really helpful to me!

I’m not sure I understand. Can you clarify?

Considering how much data you have, I think it’s not awfully small.

Note that, it will probably have a difficult posterior