Phylogenetic regression in Stan: can this model be improved?

Hello everyone!
I’m trying to practice phylogenetic regressions and I’d like to fit them in raw Stan, rather than brms or another package.

I’m working with R simulation code from Ives’ book on correlated data, but I’ve written the Stan model myself. I’m looking for feedback on the way I wrote this model. Should I be using a Cholesky decomposition somehow? is there a more flexible or general approach?

simulation code

## simulate data
set.seed(1618)
n <- 20
b0 <- 0
b1 <- .5
lam.x <- .98
lam.e <- .5
phy <- ape::compute.brlen(
  ape::rtree(n=n),
  method = "Grafen",
  power = 1)

plot(phy)


phy.x <- phylolm::transf.branch.lengths(
  phy=phy, model="lambda",
  parameters=list(lambda = lam.x))$tree

phy.e <- phylolm::transf.branch.lengths(
  phy=phy, model="lambda",
  parameters=list(lambda = lam.e))$tree

x <- ape::rTraitCont(phy.x, model = "BM", sigma = 1)
e <- ape::rTraitCont(phy.e, model = "BM", sigma = .3)
x <- x[match(names(e), names(x))]
Y <- b0 + b1 * x + e
Y <- array(Y)
rownames(Y) <- phy$tip.label

plot(x, Y)

Created on 2024-09-27 with reprex v2.1.1

Stan code:

data {
  int n;
  int s;
  vector[n] x;
  vector[n] y;
  matrix[s, s] phyvcv;
}
parameters {
  real b0;
  real b1;
  real sigma_x;
  real sigma_y;
  real logit_lambda_x;
  real logit_lambda_y;
}
transformed parameters {
  real<lower=0,upper=1> lambda_x;
  lambda_x = inv_logit(logit_lambda_x);
  // y
  real<lower=0,upper=1> lambda_y;
  lambda_y = inv_logit(logit_lambda_y);
}
model {
  b0 ~ std_normal();
  b1 ~ normal(.5, .5);
  sigma_x ~ exponential(1);
  sigma_y ~ exponential(1);
  logit_lambda_x ~ normal(3, .2);
  logit_lambda_y ~ normal(0, .2);

  matrix[s, s] vcv_x;
  vcv_x = add_diag(sigma_x^2*lambda_x*phyvcv, sigma_x^2*(1 - lambda_x));


  matrix[s, s] vcv_y;
  vcv_y = add_diag(sigma_y^2*lambda_y*phyvcv, sigma_y^2*(1 - lambda_y));


  x ~ multi_normal(rep_vector(0, n), vcv_x);
  y ~ multi_normal(b0 + b1*x, vcv_y);
}

Results

Parameter recovery is pretty good! very encouraging, but once again I would love any constructive feedback! Thanks :)

> phylo_sample
       variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
 lp__           23.73  24.08 1.90 1.71 20.07 26.12 1.00     1771     2248
 b0             -0.08  -0.08 0.12 0.12 -0.28  0.12 1.00     2671     2639
 b1              0.55   0.55 0.09 0.08  0.41  0.69 1.00     2482     2482
 sigma_x         0.97   0.94 0.16 0.15  0.74  1.27 1.00     3658     2083
 sigma_y         0.22   0.21 0.04 0.04  0.16  0.29 1.00     2989     1939
 logit_lambda_x  3.01   3.01 0.20 0.19  2.68  3.33 1.00     4703     3031
 logit_lambda_y  0.00   0.00 0.20 0.20 -0.33  0.34 1.00     4438     2770
 lambda_x        0.95   0.95 0.01 0.01  0.94  0.97 1.00     4703     3031
 lambda_y        0.50   0.50 0.05 0.05  0.42  0.58 1.00     4438     2770

Hi, @aammd. Thanks for sharing everything here. I don’t see how this code could run without s and n being the same value, because x is an n-vector and it is being drawn multi-normal with a covariance matrix vcv_x that is s x s. I also see that the data simulation did not involve the variable s. It’s also unusual to have only a single observation for a multi-normal as it’s hard to infer things like variance. Are you going to use this in a situation where there would be more x and y values to sample? Otherwise, your priors are going to have a huge effect on the posterior with just a single data point. Imagine taking y ~ normal(mu, sigma), where y is only a single data point and mu and sigma have priors—it’s not enough data to get a good handle on either mu or sigma, so the prior has a huge effect.

The code was fine as you wrote it. I pulled out a factor in the covariance and added offsets and multipliers to move the normal distributions back closer to standard normal on the unconstrained scale. And I used compound declare and define to shorten things. I would’ve added the constraint cov_matrix to phyvcv but I didn’t have the patience to work out whether it’s required to be symmetric positive definite in order for the result to be—if you add enough to the diagonal, anything’s positive semi-definite.

data {
  int<lower=0> n;
  int<lower=0> s;
  vector[n] x;
  vector[n] y;
  matrix[s, s] phyvcv;
}
transformed data {
  vector[n] zero_vec = rep_vector(0, n);
}
parameters {
  real b0;
  real<offset=0.5, multiplier=0.5> b1;
  real sigma_x;
  real sigma_y;
  real<offset=3, multiplier=0.2> logit_lambda_x;
  real<multiplier=0.2> logit_lambda_y;
}
transformed parameters {
  real<lower=0, upper=1> lambda_x = inv_logit(logit_lambda_x);
  real<lower=0, upper=1> lambda_y = inv_logit(logit_lambda_y);
}
model {
  b0 ~ std_normal();
  b1 ~ normal(0.5, 0.5);
  sigma_x ~ exponential(1);
  sigma_y ~ exponential(1);
  logit_lambda_x ~ normal(3, 0.2);
  logit_lambda_y ~ normal(0, 0.2);
  matrix[s, s] vcv_x 
    = sigma_x^2 * add_diag(lambda_x * phyvcv, 1 - lambda_x);
  matrix[s, s] vcv_y
    = sigma_y^2 * add_diag(lambda_y * phyvcv, 1 - lambda_y);
  x ~ multi_normal(zero_vec, vcv_x);
  y ~ multi_normal(b0 + b1 * x, vcv_y);
}

A way you could make this faster is to parameterize the covariance matrix as a Cholesky factor. The multiplication is easy, but I’m not sure how to do the equivalent of add_diag.

For simplicity, I’d be inclined to just define lambda_x and lambda_y as parameters and give them beta priors that closely mimic your normal(3, 0.2) and normal(0, 0.2), both of which are unimodal once transformed with inverse logit.

1 Like

hi @Bob_Carpenter ! thanks so much for this feedback! I had a few comments/questions

  1. When we use offset and multiplier on an unconstrained parameter, should the prior on that parameter then be std_normal() ? In your edits you kept the original values in the prior as well.
  2. Is it somehow more efficient to sample a constrained value on e.g. the logit scale (the lambda parameters above)? I guess I was also thinking vaguely of adding hyperparameters on lambda to partially pool across multiple traits – what is the average “phylogenetic signal” for traits in this group?
  3. yes indeed, there will be more X and Y values to sample! In my first example I was sticking very close to the textbook example I was following. In real life, we have one set of individuals measured for Trait X, and another set of individuals measured for Trait Y. Of course, there’s also missing values and uneven sampling effort.

The model below includes @Bob_Carpenter 's changes and is getting closer to my real-world application. Posting it here for the next person who wants to do phylogenetic regression in Stan:

data {
  int<lower=0> s;
  // x trait
  int<lower=0> n_x;
  vector[n_x] x_obs;
  array[n_x] int<lower=1,upper=s> sp_id_x;
  // y trait
  int<lower=0> n_y;
  vector[n_y] y_obs;
  array[n_y] int<lower=1,upper=s> sp_id_y;
  matrix[s, s] phyvcv;
}
transformed data {
  vector[s] zero_vec = rep_vector(0, s);
}
parameters {
  real<offset=.5,multiplier=.8> b0_y;
  real<offset=0.5, multiplier=0.5> b1;
  real<lower=0> sigma_x;
  real<lower=0> sigma_y;
  real<lower=0, upper=1> lambda_x;
  real<lower=0, upper=1> lambda_y;
  vector[s] x_avg;
  vector[s] y_avg;
  real<lower=0> sigma_x_obs;
  real<lower=0> sigma_y_obs;
}
model {
  matrix[s, s] vcv_x
    = sigma_x^2 * add_diag(lambda_x * phyvcv, 1 - lambda_x);
  matrix[s, s] vcv_y
    = sigma_y^2 * add_diag(lambda_y * phyvcv, 1 - lambda_y);

  b0_y ~ std_normal();
  b1 ~ std_normal();
  sigma_x ~ exponential(1);
  sigma_y ~ exponential(1);
  lambda_x ~ beta(9, 1);
  lambda_y ~ beta(5, 5);
  sigma_x_obs ~ exponential(1);
  sigma_y_obs ~ exponential(1);
  // species averages
  x_avg ~ multi_normal(zero_vec, vcv_x);
  y_avg ~ multi_normal(b0_y + b1 * x_avg, vcv_y);
  // observations of these
  x_obs ~ normal(x_avg[sp_id_x], sigma_x_obs);
  y_obs ~ normal(y_avg[sp_id_y], sigma_y_obs);

}

and here is code to simulate data and fit this model:

library(tidyverse)
require(ape)

set.seed(1618)

# set true parameter values
n <- 20
b0_y <- .5
b1 <- .5
lam.x <- .98
lam.e <- .5
sigma_x <- 1
sigma_y <- .3

# simulate phylogeny
phy <- ape::compute.brlen(
  ape::rtree(n=n),
  method = "Grafen",
  power = 1)

plot(phy)


# get names from this matrix! needs to line up perfectly
phyvcv <- ape::vcv(phy)

distmat_names <- dimnames(phyvcv)[[1]]

# observations per species
n_obs <- 15


phy.x <- phylolm::transf.branch.lengths(
  phy=phy, model="lambda",
  parameters=list(lambda = lam.x))$tree

phy.e <- phylolm::transf.branch.lengths(
  phy=phy, model="lambda",
  parameters=list(lambda = lam.e))$tree

x <- ape::rTraitCont(phy.x, model = "BM", sigma = sigma_x)
e <- ape::rTraitCont(phy.e, model = "BM", sigma = sigma_y)
x <- x[match(names(e), names(x))]

## calculate Y
Y <- b0_y + b1 * x + e

# Y <- array(Y)
names(Y) <- phy$tip.label

plot(x, Y)


obs_xy_df <- tibble(x, Y, sp_name = names(x)) |> 
  mutate(
    sp_id = as.numeric(
      factor(sp_name, 
             levels = distmat_names))) |> 
  rowwise() |> 
  mutate(obs_x = list(
    rnorm(n_obs, mean = x, sd = .3)),
    obs_y = list(rnorm(n_obs, mean = Y, sd = .3)))


x_obs_df <- obs_xy_df |> 
  select(sp_id, obs_x) |> unnest(obs_x)

y_obs_df <- obs_xy_df |> 
  select(sp_id, obs_y) |> unnest(obs_y)

phylo_obs_sample <- phylo_obs$sample(data = list(
  s = n,
  # trait x
  n_x = nrow(x_obs_df),
  x_obs = x_obs_df$obs_x,
  sp_id_x = x_obs_df$sp_id,
  # trait y
  n_y = nrow(y_obs_df),
  y_obs = y_obs_df$obs_y,
  sp_id_y = y_obs_df$sp_id,
  # phylogeny
  phyvcv = phyvcv
),parallel_chains = 4, refresh = 1000)

phylo_obs_sample$summary(variables = c(
  "b0_y", "b1", "sigma_x", "sigma_y", "lambda_x", "lambda_y", "sigma_x_obs", "sigma_y_obs"
))

That was intentional. The right way to code a non-centered parameterization is

parameters {
  real mu;
  real<lower=0> sigma;
  vector<offset=mu, multiplier=sigma>[N] alpha;   // NON-CENTERED
  // vector[N] alpha;  // CENTERED
}
model {
  alpha ~ normal(mu, sigma);

The only difference is whether to include the affine transform in the declaration—the distribution statement stays the same.

Alternatively, you can code the non-centered parameterization explicitly as follows (and if you’re using a multivariate transform, the explicit approach is the only one available). Using offset and multiplier lead to the same posterior density and geometry as this explicit definition:

parameters {
  real mu;
  real<lower=0> sigma;
  vector[N] alpha_std; 
}
transformed parameters {
  vector[N] alpha = mu + sigma * alpha_std;
}
model {
  alpha_std ~ normal(0, 1);

It all depends on the posterior geometry. Ideally, you will get a posterior with well-conditioned Hessians everywhere that is as close to standard normal as possible (independent, unit-scale, log concave).

Stan requires all parameters to be mapped to the unconstrained scale. So you can provide priors on the logit scale and then transform. For example, a normal-logistic looks like this:

real logit_alpha;
...
logit_alpha ~ logistic(0, 1);
...
real alpha = inv_logit(logit_alpha);

In this code, alpha is in (0, 1) and has a uniform distribution. You can use other unconstrained priors other than standard logistic and get other distributions over alpha. This works out to exactly the same thing as doing this:

real<lower=0, upper=1> alpha;

with no prior—it gives alpha a uniform distribution on (0, 1). It does this by adjusting for the Jacobin of the inverse logit, which is what we use to transform the unconstrained values to constrained values under the hood, and if you look at the Jacobian of that transform, it’s the standard logistic distribution. That is, the Jacobian correction is the absolute value of the derivative of the transform, which is

\left| \frac{d}{d u} \textrm{logit}^{-1}(u) \right| = \textrm{logit}^{-1}(u) \cdot (1 - \textrm{logit}^{-1}(u)) = \textrm{logistic}(u \mid 0, 1).

Thank you so much for this explanation! for anyone reading along, here is a link to the reference manual section on affine transformations. I admit it is counterintuitive syntax for me, since it really looks like the transformation is being implemented twice!

I also really appreciate the details about how constrained parameters are sampled! thanks again :)

The only transform being implemented is one that multiplies the unconstrained parameters by \sigma and adds \mu. The idea then is that if you have an unconstrained value \theta^\text{unc} \in \mathbb{R}, then we map to \theta = \mu + \sigma \cdot \theta^\text{unc} and take \theta \sim \textrm{normal}(\mu, \sigma), which implies that \theta^\text{unc} \sim \textrm{normal}(0, 1).

So it gives us exactly the same distribution over \theta^\text{unc} as taking \theta^\text{unc} \sim \textrm{normal}(0, 1) and then setting \theta = \mu + \sigma \cdot \theta^\text{unc}.

The transformation does happen twice in some sense, because in the offset/multiplier version, Stan we will calculate (\theta - \mu) / \sigma to evaluate \log \textrm{normal}(\theta \mid \mu, \sigma). So it’s a tiny bit more work than doing this all manually, but the result is the same sampling geometry over \theta^\text{unc}.

1 Like