Hi Stan community,
TL;DR I have a model that performs two sets of regressions on data, then uses the slopes of one set of regressions in a regression against the slopes from the other set of regressions (the sets of slopes are paired by their species). The model retrodicts all of the slopes, but fails in all of the standard deviations, massively inflating them, and ultimately ruining the slope-by-slope regression’s ability to retrodict by keeping every posterior incredibly vague.
I’ve been writing and re-writing a model for the past few days, and I’ve finally got it to (a) converge, (b) not produce divergent transitions, and (c) run in under an hour (sometimes under a minute now!). I was pretty thrilled, until I realized that my most important effect is not reproduced in retrodiction.
Model
I am looking at a system, where a change in a variable C in a species produces a change in a second variable D in that species (think of a change in their size impacting how much they eat, or a change in leg length impacting how fast they can run), and I know that both C and D are evolving over time (e.g. they are changing size over time, or e.g. they are changing leg length over time). I don’t have paired C-D knowledge in any of the species that I have C or D data for (i.e. “this individual has legs this long, and runs this fast” for a sufficient number of individuals to make a within-species regression of the impact of C on D for that species), but I know that C and D are changing in various ways for various species. My solution is to look at a regression: how does a change in C correspond to a change in D (controlling for relatedness between species). So I do a linear regression between the C-by-year slope and the D-by-year slope.
The model performs two sets of regressions on data. Let’s call them C-by-year and D-by-year.
y_C is the year vector for C, y_D is the year vector for D, and the regressions are structured by-species, these look like
C_i\sim\textrm{normal}(\beta_{0,C,\textrm{species & site}_C[i]}+\beta_{1,C,\textrm{species}_C[i]}\cdot y_{C,\textrm{species}[i]},\ \sigma_{C,\textrm{species}_C[i]})
D_j\sim\textrm{normal}(\beta_{0,D,\textrm{species}_D[j]}+\beta_{1,D,\textrm{species}_D[j]}\cdot y_{D,\textrm{species}_D[j]},\ \sigma_{D,\textrm{species}_D[j]})
For i=1,\cdots,n_\textrm{C} and j=1,\cdots,n_\textrm{D}. It’s worth noting that C has a species-site intercept for \beta_{0,C} because there is geographical variation in C that we have to control for (e.g. lots of measurements at site 1—which has inherently larger values of C—one year, but very few in the following years, leading to an inaccurate decrease in C over time if we don’t control for this).
This is all pretty standard so far—just slopes and intercepts. We then do a regression of the rates of change of C and the rates of change of D. This is,
\beta_{1,D,k}\sim\textrm{normal}(B_0+B_1\cdot \beta_{1,D,k}, \Sigma)
Where k=1,\cdots,n_\textrm{species}. So far, \Sigma isn’t a covariance matrix to perform phylogenetic control, it’s just a standard deviation. (One day it will become a covariance matrix, and perhaps you’ll hear from me again when I’m fighting to get that to work!)
Hopefully this all lines up with the verbal model I described earlier: regression on C over time, regression on D over time, regression on the slopes of D by the slopes of C, paired by species. With the example from earlier, this would be “All species are getting longer legs (at various rates). We believe that longer legs make faster species. Are we seeing that the species with legs that are getting longer at the quickest rate are also getting faster at the quickest rate?” This is a slightly silly example, but hopefully it illustrates the model I’m trying to construct.
(Aside: A quick note is that I send in C and D measurements pre-scaled per species. That is, each measurement is in terms of within-species Z-scores, so that the regressions don’t have to hunt around to figure out the scale. I also send in y_C and y_D pre-scaled per species for the same reason. I’ve resolved some biasing and errors in the pairs plots by doing this.)
Data
I have tons of data: I have 30,000 measurements of C, and probably about 3000 measurements of D (I haven’t finished processing that dataset yet), 1200 species-sites, and 41 species. Realistically, there’s definitely enough data for clear, sharp posteriors on \beta_{1,C} and decently clear posteriors on \beta_{1,D}.
I’ve been simulating data (using Stan, code included later, dput below) by taking my real dataset for C, scraping away all of the measurements of C to get the base structure (n_C, n_D, number of species, species, species-site, and so on), giving it arbitrary values for \beta_{0,\star}, \beta_{1,C}, values for B_\star, \sigma_\star, and \Sigma. I then simulate the data, and get simulated measurements for C, D, etc. that follow the structure of the model.
The number of species in the real dataset, and thus the simulated dataset, was originally 41. I tried this and got terrible results and it was taking a long time, so I’ve filtered to just 5 species—I got identical issues, so that’s what I’ll show here. So for the actual simulated data you’ll see here, I have n_C=789, n_D=250, and the number of species-sites is 108 (and number of species is 5).
The following attachment [dput.txt] (47.2 KB) is the dput of a list which can be used in the data= argument of sampling in Stan of this 5-species simulated data.
The Issue
Since I’ve simulated the data, I know that the model and the data fit well together. I used sufficiently small values of \sigma_\star and \Sigma in my simulation that there should be good signal.
All of the \beta_{\star,\star} values are being perfectly retrodicted really well—for all but maybe 1 or 2, the mode of the posterior sits exactly on the true value used to generate the data. The \sigma_\star values, on the other hand, leave a lot to be desired.
The \sigma_\star values are highly inflated. Here’s the prior for \sigma_C (with the real value in vertical dashed line):
And here’s the posterior for \sigma_C (with the real value in dashed line again):
Prior for \sigma_D:
Posterior for \sigma_D:
I’m baffled, I have no clue what’s happening to cause this.
The impact (I believe these two problems are connected) is that the posteriors for the other parameters are still incredibly vague. Here’s the posterior for \beta_{1,C}:
And the posterior for \beta_{1,D}:
Notice that the posteriors for \sigma_D are relatively close to 0 for 3, 4, and 5, and that the posteriors for \beta_{1,D} are relatively sharp for 3, 4, and 5. So these issues feel connected.
Questions
How do I fix these inflated \sigma_{\star} estimates? Why are they happening? Are the vague posteriors in \beta_{\star,\star} connected tot he inflated \sigma_\star estimates?
As always, thank you so much for being an amazing community, I’m excited to see what you have to say about this conundrum!
Here’s all of the extra information you might need:
Model Code
I use 6000 iter with 3000 warmup and adapt_delta=0.999 to remove the last 5-30 divergences that sometimes pop up. I use seed=42.
data {
int nSpecies; // Number of species
int nSpeciesSites; // Number of species-site combinations
int nC; // Number of C measurements
vector[nC] scaled_C_measure; // Scaled C measurements
vector[nC] scaled_year_C; // Years of C measurements
array[nC] int<lower=1,upper=nSpecies> species_C; // Integer indicating species and site of each measurement
array[nC] int<lower=1,upper=nSpeciesSites> species_site_C; // Integer indicating species and site of each measurement
int nD; // Number of D measurements
vector[nD] scaled_D_measure; // D measurements
vector[nD] scaled_year_D; // Years of D measurements
array[nD] int<lower=1,upper=nSpecies> species_D; // Species of each segment
}
parameters {
real BETA_0;
real BETA_1;
real<lower=0.0001> SIGMA;
vector[nSpeciesSites] beta_0_C;
vector[nSpecies] beta_1_C;
vector<lower=0.0001>[nSpecies] sigma_C;
vector[nSpecies] beta_0_D;
vector[nSpecies] beta_1_D;
vector<lower=0.0001>[nSpecies] sigma_D;
}
model {
// Priors
BETA_0 ~ normal(0,2);
BETA_1 ~ normal(0,2);
SIGMA ~ exponential(1);
beta_0_C ~ normal(0,1);
beta_1_C ~ normal(0,1);
sigma_C ~ exponential(1);
beta_0_D ~ normal(0,1);
// beta_1_D absent because used in BETA regression
sigma_D ~ exponential(1);
// C regressions
vector[nC] C_mu = beta_0_C[species_site_C] + beta_1_C[species_C] .* scaled_year_C;
vector[nC] C_sd = sigma_C[species_C];
vector[nC] Z_scaled_C_measure = (scaled_C_measure - C_mu) ./ C_sd;
Z_scaled_C_measure ~ std_normal();
// D regressions
vector[nD] aco_mu = beta_0_D[species_D] + beta_1_D[species_D] .* scaled_year_D;
vector[nD] aco_sd = sigma_D[species_D];
vector[nD] Z_scaled_D_measure = (scaled_D_measure - aco_mu) ./ aco_sd;
Z_scaled_D_measure ~ std_normal();
// Final regression
vector[nSpecies] MU = BETA_0 + BETA_1 * beta_1_C;
vector[nSpecies] Z_beta_1_D = (beta_1_D - MU)/SIGMA;
Z_beta_1_D ~ std_normal();
}
Generate Data Code
I use seed=42
data {
int nSpecies; // Number of species
int nSpeciesSites; // Number of species-site combinations
int nC; // Number of C measurements
vector[nC] scaled_year_C; // Years of C measurements
array[nC] int<lower=1,upper=nSpecies> species_C; // Species of each segment
array[nC] int<lower=1,upper=nSpeciesSites> species_site_C; // Integer indicating species and site of each segment
vector[nSpeciesSites] beta_0_C;
vector[nSpecies] beta_1_C;
vector<lower=0.0001>[nSpecies] sigma_C;
int nD; // Number of D measurements
vector[nD] scaled_year_D; // Years of D measurements
array[nD] int<lower=1,upper=nSpecies> species_D; // Species of each segment
vector[nSpecies] beta_0_D;
vector<lower=0.0001>[nSpecies] sigma_D;
real BETA_0;
real BETA_1;
real<lower=0.0001> SIGMA;
}
generated quantities {
// Generate beta_1_D via the BETAs, SIGMA, and beta_1_C
vector[nSpecies] beta_1_D = to_vector(normal_rng(BETA_0 + BETA_1*beta_1_C, SIGMA));
// Generate scaled_C_measure from scaled_year_C, beta_*_C, sigma_C and species_site_intercept (plus the categorical species_C and species_site_C)
vector[nC] scaled_C_measure = to_vector(normal_rng(beta_0_C[species_site_C] + beta_1_C[species_C] .* scaled_year_C, sigma_C[species_C]));
// Generate scaled_D_measure from scaled_year_D, beta_*_D, and sigma_D (plus the categorical species_D)
vector[nD] scaled_D_measure = to_vector(normal_rng(beta_0_D[species_D] + beta_1_D[species_D] .* scaled_year_D, sigma_D[species_D]));
}
Prior Code
I use seed=42. (There are lots of errors when I run this, but I imagine it’s just because it’s sliding all over the “posterior” geometry. In any case, it produces the prior results I expect, so I haven’t been too worried. Do let me know if I’ve been doing this wrong, though!)
data {
int nSpecies; // Number of species
int nSpeciesSites; // Number of species-site combinations
}
parameters {
real BETA_0;
real BETA_1;
real<lower=0.0001> SIGMA;
vector[nSpeciesSites] beta_0_C;
vector[nSpecies] beta_1_C;
vector<lower=0.0001>[nSpecies] sigma_C;
vector[nSpecies] beta_0_D;
vector[nSpecies] beta_1_D;
vector<lower=0.0001>[nSpecies] sigma_D;
}
model {
// Priors
BETA_0 ~ normal(0,2);
BETA_1 ~ normal(0,2);
SIGMA ~ exponential(1);
beta_0_C ~ normal(0,1);
beta_1_C ~ normal(0,1);
sigma_C ~ exponential(1);
beta_0_D ~ normal(0,1);
sigma_D ~ exponential(1);
}





