Hello,
I am working on a Bayesian hierarchical model in Stan to predict the shell brightness of snail species based on several factors: habitat type, vegetation zone, island area, and island age. Additionally, I want to include a phylogenetic random effect to account for shared evolutionary history among species.
My dataset has some species for which I lack complete data on habitat, vegetation, or brightness. The phylogenetic information should help improve predictions for these species by leveraging the phylogenetic covariance matrix, which represents the relatedness among species.
I am not sure if the approach I am using to add this phylogenetic information is correct or not. I havent found examples and perhaps I am missing common methods that are used to do this.
Hereâs the core of my Stan model:
-
Microhabitat and Vegetation Zone Models: I model the probability of a species being in specific habitats (e.g., arboreal or arid) using binomial logistic regression with hierarchical priors.
-
Brightness Model: Brightness is modeled as a log-normal variable, with species-level and island-level random effects. I also include a phylogenetic random effect that uses a multivariate normal distribution with a covariance structure informed by the phylogenetic covariance matrix.
-
Phylogenetic Information: The phylogenetic effect (
sp_phylo_effect
) is included in the model as a random effect with a covariance structure based on the phylogenetic covariance matrix (phylo_cov_matrix
).
\text{true_ln_bright}_i = \mu_{\ln(\text{bright})} + \beta_{\text{species}, i} + \beta_{\text{phylo}, i} + \beta_{\text{island}, \text{index}(i)} + \gamma_{\text{arbor}} \cdot \text{arboreal_prob}_i +
\gamma_{\text{arid}} \cdot \text{arid_prob}_i + \gamma_{\text{age}} \cdot \text{island_age}_i + \gamma_{\text{area}} \cdot \ln(\text{island_area}_i)
Parameters:
- l true_ln_bright,iâ: represents the modeled log-transformed brightness for each species i.
- Îźln(bright)â: Overall mean log brightness.
- βspecies,iâ: Random effect for each species.
*βphylo,iâ: Phylogenetic random effect, modeled as a multivariate normal with a covariance matrix proportional to phylo_cov_matrix. - βisland,index(i)â: Random effect for islands.
- Îłarborâ, Îłaridâ, Îłageâ, Îłareaâ: Coefficients for habitat, vegetation, island age, and island area.
Inclusion of Phylogenetic Information
The phylogenetic information is incorporated into the model using a random effect that accounts for the shared evolutionary history among species. This is done by adding a phylogenetic random effect (sp_phylo_effect
) for each species, which follows a multivariate normal distribution with a covariance structure derived from a phylogenetic covariance matrix.
Detailed Explanation:
- Phylogenetic Covariance Matrix (
phylo_cov_matrix
):
- This matrix captures the phylogenetic relationships among species. Each element (i,j) of the matrix represents the expected covariance between the traits (in this case, brightness) of species i and species j due to their shared evolutionary history.
- The matrix is computed from a phylogenetic tree, where the distance between nodes reflects the evolutionary distance between species. The closer two species are on the tree, the higher their covariance.
- Phylogenetic Random Effect (
sp_phylo_effect
):
- A vector of random effects, βphylo,iâ, is defined for each species i. This vector is modeled to follow a multivariate normal distribution:
\beta_{\text{phylo}} \sim \text{MVN}\left(0, \sigma_{\text{phylo}}^2 \cdot \text{phylo_cov_matrix}\right)
where:- βphyloâ is the vector of phylogenetic random effects for all species.
- MVN(0,Ďphylo2â phylo_cov_matrix) denotes a multivariate normal distribution with a mean vector of 0 and a covariance matrix given by the phylogenetic covariance matrix scaled by Ďphylo2.
- Ďphyloâ is the scaling factor (standard deviation) for the phylogenetic effect, controlling the overall magnitude of the phylogenetic influence on brightness.
Is using a multivariate normal distribution with a covariance matrix derived from the phylogenetic tree the best way to model phylogenetic random effects in this context?
Other questions:
- Are there specific considerations or alternative priors for the scaling factor (
phylo_sd
) or the random effects (sp_phylo_effect
) that could improve model performance?
Thank you very much for your time!
Stan code:
data {
int N_spp; // Number of species (47)
int N_island; // Number of islands
array[N_spp] int<lower=1, upper=N_spp> sp_id;
vector[N_spp] island_area;
vector[N_spp] island_age;
array[N_spp] int<lower=1,upper=N_island> island_index_spp;
// Habitat data
int N_habitat;
array[N_habitat] int habitat_arboreal;
array[N_habitat] int total_hab;
array[N_habitat] int<lower=1,upper=N_spp> sp_index_hab;
// Vegetation zone data
int N_veg;
array[N_veg] int veg_arid;
array[N_veg] int total_veg;
array[N_veg] int<lower=1,upper=N_spp> sp_index_veg;
// Brightness data
int N_bright;
vector[N_bright] brightness_obs;
array[N_bright] int<lower=1,upper=N_spp> sp_index_bright;
// Phylogenetic covariance matrix
matrix[N_spp, N_spp] phylo_cov_matrix; // Phylogenetic covariance matrix
}
transformed data {
vector[N_bright] log_brightness_obs = log(brightness_obs);
vector[N_spp] log_island_area = log(island_area);
}
parameters {
// Microhabitat
real mu_arboreal;
real<lower=0> sd_arboreal;
vector[N_spp] arboreal_prob_logit;
// Vegetation zone
real mu_arid;
real<lower=0> sd_arid;
vector[N_spp] arid_prob_logit;
// Brightness
real mu_ln_bright;
real<lower=0> sd_ln_bright_sp;
real<lower=0> sd_ln_bright_island;
vector[N_spp] sp_effect_bright;
vector[N_island] island_effect_bright;
real slope_arbor;
real slope_arid;
real slope_age;
real slope_area;
real<lower=0> sigma_bright;
// Phylogenetic random effects
vector[N_spp] sp_phylo_effect; // Phylogenetic random effects
real<lower=0> phylo_sd; // Scaling factor for phylogenetic effect
}
transformed parameters {
vector[N_spp] arboreal_prob = inv_logit(arboreal_prob_logit);
vector[N_spp] arid_prob = inv_logit(arid_prob_logit);
}
model {
// Hierarchical priors for the probability of microhabitat
mu_arboreal ~ std_normal();
sd_arboreal ~ exponential(1);
arboreal_prob_logit ~ normal(mu_arboreal, sd_arboreal);
habitat_arboreal ~ binomial_logit(total_hab, arboreal_prob_logit[sp_index_hab]);
// Hierarchical priors for vegetation zone
mu_arid ~ std_normal();
sd_arid ~ exponential(1);
arid_prob_logit ~ normal(mu_arid, sd_arid);
veg_arid ~ binomial_logit(total_veg, arid_prob_logit[sp_index_veg]);
// Priors on brightness parameters
mu_ln_bright ~ normal(8, 2);
sd_ln_bright_sp ~ exponential(.5);
sd_ln_bright_island ~ exponential(.5);
slope_arbor ~ normal(0, .5);
slope_arid ~ normal(0, .5);
slope_age ~ normal(0, .5);
slope_area ~ normal(0, .5);
sigma_bright ~ exponential(.5);
// Phylogenetic effect prior using phylogenetic covariance matrix
sp_phylo_effect ~ multi_normal(rep_vector(0, N_spp), phylo_sd^2 * phylo_cov_matrix);
// Hierarchical priors for species and island effects on brightness
sp_effect_bright ~ normal(0, sd_ln_bright_sp);
island_effect_bright ~ normal(0, sd_ln_bright_island);
// Model for true brightness (species-level trait)
vector[N_spp] true_ln_bright = mu_ln_bright +
sp_effect_bright +
sp_phylo_effect + // Include phylogenetic effect
island_effect_bright[island_index_spp] +
slope_arbor * arboreal_prob +
slope_arid * arid_prob +
slope_age * island_age +
slope_area * log_island_area;
// Likelihood for observed brightness
log_brightness_obs ~ normal(true_ln_bright[sp_index_bright], sigma_bright);
}
generated quantities {
vector[N_spp] true_ln_bright = mu_ln_bright +
sp_effect_bright +
sp_phylo_effect + // Include phylogenetic effect
island_effect_bright[island_index_spp] +
slope_arbor * arboreal_prob +
slope_arid * arid_prob +
slope_age * island_age +
slope_area * log_island_area;
}