Hello,
I am working on a Bayesian hierarchical model in Stan to predict the shell brightness of snail species based on several factors: habitat type, vegetation zone, island area, and island age. Additionally, I want to include a phylogenetic random effect to account for shared evolutionary history among species.
My dataset has some species for which I lack complete data on habitat, vegetation, or brightness. The phylogenetic information should help improve predictions for these species by leveraging the phylogenetic covariance matrix, which represents the relatedness among species.
I am not sure if the approach I am using to add this phylogenetic information is correct or not. I havent found examples and perhaps I am missing common methods that are used to do this.
Hereâs the core of my Stan model:

Microhabitat and Vegetation Zone Models: I model the probability of a species being in specific habitats (e.g., arboreal or arid) using binomial logistic regression with hierarchical priors.

Brightness Model: Brightness is modeled as a lognormal variable, with specieslevel and islandlevel random effects. I also include a phylogenetic random effect that uses a multivariate normal distribution with a covariance structure informed by the phylogenetic covariance matrix.

Phylogenetic Information: The phylogenetic effect (
sp_phylo_effect
) is included in the model as a random effect with a covariance structure based on the phylogenetic covariance matrix (phylo_cov_matrix
).
\text{true_ln_bright}_i = \mu_{\ln(\text{bright})} + \beta_{\text{species}, i} + \beta_{\text{phylo}, i} + \beta_{\text{island}, \text{index}(i)} + \gamma_{\text{arbor}} \cdot \text{arboreal_prob}_i +
\gamma_{\text{arid}} \cdot \text{arid_prob}_i + \gamma_{\text{age}} \cdot \text{island_age}_i + \gamma_{\text{area}} \cdot \ln(\text{island_area}_i)
Parameters:
 l true_ln_bright,iâ: represents the modeled logtransformed brightness for each species i.
 Îźln(bright)â: Overall mean log brightness.
 Î˛species,iâ: Random effect for each species.
*Î˛phylo,iâ: Phylogenetic random effect, modeled as a multivariate normal with a covariance matrix proportional to phylo_cov_matrix.  Î˛island,index(i)â: Random effect for islands.
 Îłarborâ, Îłaridâ, Îłageâ, Îłareaâ: Coefficients for habitat, vegetation, island age, and island area.
Inclusion of Phylogenetic Information
The phylogenetic information is incorporated into the model using a random effect that accounts for the shared evolutionary history among species. This is done by adding a phylogenetic random effect (sp_phylo_effect
) for each species, which follows a multivariate normal distribution with a covariance structure derived from a phylogenetic covariance matrix.
Detailed Explanation:
 Phylogenetic Covariance Matrix (
phylo_cov_matrix
):
 This matrix captures the phylogenetic relationships among species. Each element (i,j) of the matrix represents the expected covariance between the traits (in this case, brightness) of species i and species j due to their shared evolutionary history.
 The matrix is computed from a phylogenetic tree, where the distance between nodes reflects the evolutionary distance between species. The closer two species are on the tree, the higher their covariance.
 Phylogenetic Random Effect (
sp_phylo_effect
):
 A vector of random effects, Î˛phylo,iâ, is defined for each species i. This vector is modeled to follow a multivariate normal distribution:
\beta_{\text{phylo}} \sim \text{MVN}\left(0, \sigma_{\text{phylo}}^2 \cdot \text{phylo_cov_matrix}\right)
where: Î˛phyloâ is the vector of phylogenetic random effects for all species.
 MVN(0,Ďphylo2â phylo_cov_matrix) denotes a multivariate normal distribution with a mean vector of 0 and a covariance matrix given by the phylogenetic covariance matrix scaled by Ďphylo2.
 Ďphyloâ is the scaling factor (standard deviation) for the phylogenetic effect, controlling the overall magnitude of the phylogenetic influence on brightness.
Is using a multivariate normal distribution with a covariance matrix derived from the phylogenetic tree the best way to model phylogenetic random effects in this context?
Other questions:
 Are there specific considerations or alternative priors for the scaling factor (
phylo_sd
) or the random effects (sp_phylo_effect
) that could improve model performance?
Thank you very much for your time!
Stan code:
data {
int N_spp; // Number of species (47)
int N_island; // Number of islands
array[N_spp] int<lower=1, upper=N_spp> sp_id;
vector[N_spp] island_area;
vector[N_spp] island_age;
array[N_spp] int<lower=1,upper=N_island> island_index_spp;
// Habitat data
int N_habitat;
array[N_habitat] int habitat_arboreal;
array[N_habitat] int total_hab;
array[N_habitat] int<lower=1,upper=N_spp> sp_index_hab;
// Vegetation zone data
int N_veg;
array[N_veg] int veg_arid;
array[N_veg] int total_veg;
array[N_veg] int<lower=1,upper=N_spp> sp_index_veg;
// Brightness data
int N_bright;
vector[N_bright] brightness_obs;
array[N_bright] int<lower=1,upper=N_spp> sp_index_bright;
// Phylogenetic covariance matrix
matrix[N_spp, N_spp] phylo_cov_matrix; // Phylogenetic covariance matrix
}
transformed data {
vector[N_bright] log_brightness_obs = log(brightness_obs);
vector[N_spp] log_island_area = log(island_area);
}
parameters {
// Microhabitat
real mu_arboreal;
real<lower=0> sd_arboreal;
vector[N_spp] arboreal_prob_logit;
// Vegetation zone
real mu_arid;
real<lower=0> sd_arid;
vector[N_spp] arid_prob_logit;
// Brightness
real mu_ln_bright;
real<lower=0> sd_ln_bright_sp;
real<lower=0> sd_ln_bright_island;
vector[N_spp] sp_effect_bright;
vector[N_island] island_effect_bright;
real slope_arbor;
real slope_arid;
real slope_age;
real slope_area;
real<lower=0> sigma_bright;
// Phylogenetic random effects
vector[N_spp] sp_phylo_effect; // Phylogenetic random effects
real<lower=0> phylo_sd; // Scaling factor for phylogenetic effect
}
transformed parameters {
vector[N_spp] arboreal_prob = inv_logit(arboreal_prob_logit);
vector[N_spp] arid_prob = inv_logit(arid_prob_logit);
}
model {
// Hierarchical priors for the probability of microhabitat
mu_arboreal ~ std_normal();
sd_arboreal ~ exponential(1);
arboreal_prob_logit ~ normal(mu_arboreal, sd_arboreal);
habitat_arboreal ~ binomial_logit(total_hab, arboreal_prob_logit[sp_index_hab]);
// Hierarchical priors for vegetation zone
mu_arid ~ std_normal();
sd_arid ~ exponential(1);
arid_prob_logit ~ normal(mu_arid, sd_arid);
veg_arid ~ binomial_logit(total_veg, arid_prob_logit[sp_index_veg]);
// Priors on brightness parameters
mu_ln_bright ~ normal(8, 2);
sd_ln_bright_sp ~ exponential(.5);
sd_ln_bright_island ~ exponential(.5);
slope_arbor ~ normal(0, .5);
slope_arid ~ normal(0, .5);
slope_age ~ normal(0, .5);
slope_area ~ normal(0, .5);
sigma_bright ~ exponential(.5);
// Phylogenetic effect prior using phylogenetic covariance matrix
sp_phylo_effect ~ multi_normal(rep_vector(0, N_spp), phylo_sd^2 * phylo_cov_matrix);
// Hierarchical priors for species and island effects on brightness
sp_effect_bright ~ normal(0, sd_ln_bright_sp);
island_effect_bright ~ normal(0, sd_ln_bright_island);
// Model for true brightness (specieslevel trait)
vector[N_spp] true_ln_bright = mu_ln_bright +
sp_effect_bright +
sp_phylo_effect + // Include phylogenetic effect
island_effect_bright[island_index_spp] +
slope_arbor * arboreal_prob +
slope_arid * arid_prob +
slope_age * island_age +
slope_area * log_island_area;
// Likelihood for observed brightness
log_brightness_obs ~ normal(true_ln_bright[sp_index_bright], sigma_bright);
}
generated quantities {
vector[N_spp] true_ln_bright = mu_ln_bright +
sp_effect_bright +
sp_phylo_effect + // Include phylogenetic effect
island_effect_bright[island_index_spp] +
slope_arbor * arboreal_prob +
slope_arid * arid_prob +
slope_age * island_age +
slope_area * log_island_area;
}