Marginalizing over latent discrete parameters

I have a question regarding marginalization of latent discrete parameters. I have a model with some continuous parameters: theta(real), s_1(simplex), s_2(simplex), T_N(real), deltas(vector). The likelihood of the data depends on the continuous parameters and the discrete parameters: int_param(int[]). Then I need to marginalize those discrete parameters, but the number of possible values for the array int_param is different depending on the current values of continuous parameters.

I tried to put a uniform prior: log(1/P)=-log(P) over the P different possible values of int_param array.

Does this strategy would cause that the values of the continuous parameters with more possible values of int_param array tends to be more proposed during HMC ?

  real<lower=0.0, upper=2.0> K;
  int<lower=2> N;
  int<lower=2> n;
  int <lower=0> L;
  int<lower=0, upper=4> data_array[n,L];
  vector<lower=0.0001>[N] deltas;
  real<lower=0.0> T_N;
  real<lower=0.0> theta;
  simplex[N] s_1;
  simplex[n] s_2;
transformed parameters{
   positive_ordered[n-1] t=head(cumulative_sum(s_1),n-1) * T_N;
model {
vector[total_sample_size-1] J_diag;
 deltas ~ exponential(rep_vector(1, N)));
  theta ~ exponential(1);
 T_N~conditionalDensityTN(deltas[N], n);
//Jacobian adjustment of s_1 to t 
 J_diag= rep_vector(T_N,n-1);
  target += sum(log(J_diag));
target+=log_costum_likelihood(data_array, T_N, deltas, theta, t, s_1, S_2);//this function try to sum over all possible //values of  int_param compatible with the continuous parameters

Hi Fabian,

I thought that P (the number of possible values) was determined by the values of the remaining covariates? Do you really want to inject information about the likely covariate values via a nongenerative prior in this way? Also, it seems like P must be an integer, but then it cannot be a parameter in the model (it would need to be marginalized) and cannot be given a prior directly. Furthermore, P is absent from the model code that you’ve included.

I think you might get better answers to your question if you include the full model specification, including the likelihood function.

One possible approach to the marginalization might be to sum over all of the parameter values that you expect to ever be possible, where values that are impossible for a given realization of continuous parameters are given zeros in the sum. I think (but I’m not sure and I haven’t tried it), that in order for this to have a good chance of being effective in Stan, you’d hope that the likelihood of a given discrete value approaches zero as the parameters approach the boundary where that discrete value is disallowed. Otherwise, the likelihood function will be discontinuous in the continuous parameters, and Stan will very probably have trouble.


Yes, P ( the integer number of possible values in the discrete array of size N:int_param) is not fixed. P depends on covariates: the vector of continuous parameters t and deltas. For example, for some values of the covariates, the possible values of int_param are [16, 21], [34, 21], [23, 21](P=3 and N=2) . For other values of the covariates the number of possible values of int_param can be different from P=3. The maximum number of P can be computed analytically; MaxP= n^{N-1}, but the actual P can be very different form MaxP.

I used P just to put some prior probability to each “configuration”(a combination of values in the vector int_param) when I marginalize.

The full model is more than 1000 lines of code.

I will try to use your advice : use a uniform of prior 1/MaxP over all configurations.

Second that.


You cannot put a prior directly over the configurations, because the configurations are discrete and therefore a Stan model cannot be parameterized in terms of the configurations. Your prior over the configurations will be whatever is implied by your priors on the continuous parameters that determine what configurations are possible or likely.

1 Like

Thanks for your answers.
In the change point model in the Stan Manual: 7.2 Change point models | Stan User’s Guide, they used a uniform prior on the discrete parameter s.

Here is the problem I am trying to model: the configurations are the combinations of indexes of father nodes of N sub trees(the biggest subtree is the whole tree itself). The positions of the continuous variables T_1, T_2 determines the indexes of the roots of the subtree: 6 is always the root of the biggest subtree(the whole tree).

The first element in the configuration is the index of the root of the subtree below T_1(there are 4 choices:1, 2, 8 and 5), and the second element is the index of the root of subtree below T_2(there are 3 choices: 7, 8 and 5).
I want to infer continuous T_1, T_2 , deltas given observed information at the tips.

The configurations determine a clustering on the tips for a fixed tree topology.


My intuition is that this will be really hard with Stan, because if I understand correctly the likelihood function will be discontinuous in T1 and T2.

1 Like

I agree.

These pictures make me think that may be of some use to you? I don’t have any experience in phylogeny or with beast, so I can’t be sure.


I do, but I need to understand the problem better before making an informed recommendation.


The likelihood for the model that I am using is not implemented in Beast, MrBayes, RevBayes,… . Then I cannot use Beast directly(I would need to implement a new package for Beast )

phylostan might give you some implementation ideas. Also, what happens when you estimate parameters with tree structure fixed? I understand the drive to integrate over phylogenetic uncertainty, but a first stab would be understanding just how much variation we’re talking about in the context of your model and data.

Yes, the starting point for my implementation was phylostan. There are differences between phylostan and our model: 1-our model uses a more complex coalescent tree prior than the Kingman coalescent and 2-there is an error model at the tips of the tree. Also the phylostan code assumes only one population and not several sub populations(like in the question here)

In summary, our model with one population works very well(in 2000 iterations, the Rhats are 1) and infer correctly the true parameters(for simulated data). I made improvements in the implementation for computing the Felsenstein likelihood compared to phylostan. For the stan model with more than one sub population, one needs to either marginalize the positions of the MRCAS of the sub populations or jointly infer those positions.

I also checked other papers relating inference on coalescent models:

Luiz Max Carvalho, we can collaborate if you are interested in this.

1 Like

Sure, drop me a line and we can talk more. With the condition that whatever we find out in terms of Stan usability is posted here to give back to the community.

1 Like

I was wondering whether there is a simplified version of this problem to just infer the clustering tree and sample over possible tree configurations without caring for time depth.

I have a few considerations to make about that:

i. Branching order (“clustering tree”) and branch length (“time depth”) are inextricably linked, so while it is possible to sample complete trees and ignore their branch lengths, I have yet to see an approach where one disregards lengths entirely;
ii. The tree configurations are usually the major obstruction to efficient sampling from phylogenetic targets.

But if you can expand a bit more on the motivation, maybe there is some established technique that can be leveraged.

1 Like

Thanks for your answer Max.

(Disclaimer that I know very little about inducing a phylogeny)

What I am thinking of is trying to sample clustering trees based on aligned binary sequences.
So, imagine we have four sequences: [110], [101], [111], [000]. We assume that these four sequences are the result of one original sequence (also 3 positions), and sequential flipping of 1 to 0 or 0 to 1 in each position.

Here is an example of two potential trees one could build from these sequences.

My naive approach would be that each bit is modeled something like this:

So, in the context of this problem, branch length is not really interesting. What is interesting is the clustering and the node paramaters (a, b, …).

I don’t know if this makes any sense.


I should add, the motivation for doing this is that I’d like to expand this with spatial dependencies between nodes and leaf nodes. So the final model should assign some probability to the three structure and some probability to spatial diffusion.

1 Like

Thanks for sharing @Matias_Guzman_Naranjo . Your approach seem to be related with hierarchical clustering,(which has been applied to linguistics before ). I agree with @maxbiostat that both clustering tree(the topology of the tree) and branch lengths are intrinsically linked: the longer the branch the more changes you accumulate along the branch( true for genetic sequences and also for linguistics).

By the way, there is a Time Marginalized Coalescent (TMC) prior when you are not interested in the branch lengths( In my case, I am interested in marginalizing the mutations, not the branch lengths(which are informative of growth rates and other parameters of interest)

Your approach seem to be related with hierarchical

Yes, that’s what I was wondering, whether these algorithms can be implemented in Stan. I’ve seen packages for Bayesian hierarchical clustering with dirichlet diffusion trees, but haven’t been able to find Stan implementations.

By the way, there is a Time Marginalized Coalescent (TMC) prior when you are not interested in the branch lengths(

Thanks for the paper, that’s very interesting!

@Matias_Guzman_Naranjo, I implemented the previous model in Stan with fixed number of change points along the branches of the tree. The model do not mixed well and it did not converge. Probably the likelihood of the model was discontinuous for different placements of the change points on the tree .

Could you share it? I would like to take a look.