Constraning probabilities in a HMM

Hi everyone,

I am fairly new to coding in STAN so appologies in advance if this is a simple question.

TLDR: How can I constrain specific probabilities in a simplex? E.g. set a min and max. Right now I am forcing this interval in the transformed parameters block by literally changing the values, but I don’t know if this is a good approach as maybe internally the model just keeps trying to go in that direction. Would soft-penalties or something else be a better approach?

For some context regarding the data, I am analysing timeseries from different sources, that at each timepoint can have show either an inactive state (0) or 1 of 11 active states (1-11). Key aspects: No states can be active at the same time, the inactive state is unbalancedly present (like 50% of the timepoints will be in the inactive state).

I am developing a Hidden Markov Model to study these timeseries, with the assumption that states come about with different underlying rules and that those rules might change depending on characteristics of the sources, i.e. covariate metrics. Intially I coded a relatively simple HMM initially but the posteriors were too broad and did not converge, so I started implementing some assumptions that are well supported by the underlying theory of my field.

One assumption is that the hidden states should have biased self-transitions. The way I did this was by implementing priors in the model:

parameters {
    simplex[M] meta_base_trans[M]
}
model{
  for(m in 1:M){
    meta_base_trans[m] ~ dirichlet(append_row(append_row(rep_vector(1.0, m - 1), rep_vector(8.0, 1)), rep_vector(1.0, M - m)));
  }
}

(which to guarantee it works I also implemented as)

    real selfBias = 2.0;
    if (m == 1){ meta_base_trans[m] ~ dirichlet(append_row(rep_vector(selfBias, 1), rep_vector(1.0, M - 1))); }
    else if (m == M){ meta_base_trans[m] ~ dirichlet(append_row(rep_vector(1.0, M - 1), rep_vector(selfBias, 1))); }
    else { meta_base_trans[m] ~ dirichlet(append_row(append_row(rep_vector(1.0, m-1), rep_vector(selfBias, 1)), rep_vector(1.0, M-m)));}

However, the problem is that very quickly all my models started having the probability of self-transition for the hidden states to be ~1. This meant that effectively the probability space was not being explored and for example I would get posterior for emissions that would only have sample for one metastate. So I came up with two ideas

(1) Not allow the probabilities of transitions between metastates to be under 0.05 nor above 0.95, which I implemented as:

transformed parameters {
  simplex[M] meta_trans_perPat[P, M];            // Transition probabilities for metastates for each patient
  vector[M] meta_trans_perPat_constrained[P, M]; // Transition probabilities for metastates for each patient
  vector[M] meta_trans_perPat_constrained_logits[P, M]; // Logs of the transition probabilities for metastates for each patient
  for (p in 1:P) {
    for (m_from in 1:M) {
      meta_trans_perPat[p, m_from] = softmax(log(meta_base_trans[m_from]) + to_vector(X[p] * beta_trans_meta));
      // constrain
      meta_trans_perPat_constrained[p, m_from] = (meta_trans_perPat[p, m_from] * 0.9) + 0.05;
      // normalise to ensure row sum to 1
      meta_trans_perPat_constrained[p, m_from] = meta_trans_perPat_constrained[p, m_from] / sum(meta_trans_perPat_constrained[p, m_from]);
      
      meta_trans_perPat_constrained_logits[p, m_from] = log(meta_trans_perPat_constrained[p, m_from]);
      
    }
  }
}

(2) Concentrate emissions of the inactive state to an inactive metastate. This is something that is supported conceptually, and as you can see below I give minimal leeway for an inactive state to occur sporadically in an active metastate, or for an active state to occur sporadically during the inactive metastate. I implemented this as:

transformed  parameters{
  simplex[O] meta_emiss_contrained[M]; 
  vector[O] meta_emiss_logits[M];      // Logs of emission probabilities for each metastate
  for (m in 1:M) {
    if(m == 1) {
      // Metastate only emits inactive states
      meta_emiss_contrained[m]    = rep_vector(0.01/(O-1), O);
      meta_emiss_contrained[m][1] = 0.99;
    } else {
      meta_emiss_contrained[m] = meta_base_emiss[m];
    }
    
    meta_emiss_logits[m] = log(meta_emiss_contrained[m]);
  }
}

model {
  for (m in 1:M) {
    if(m==1) { meta_base_emiss[m] ~ dirichlet(append_row(rep_vector(75.0, 1), rep_vector(0.1, O - 1))); }
    else { meta_base_emiss[m] ~ dirichlet(append_row(rep_vector(0.1, 1), rep_vector(100/(O-1), O - 1)));}
  }
}

Now my question is basically if this is well implemented or not? Since I am new to STAN logic I am not sure this is the best way to guarentee constraints in individual values of probabilities. I know that in the parameters I can declare for example: int<lower=1, upper=5> X; but how would you do that for a single element of a simplex in a vector of simplexes without doing what I did?

Thank you very much in advance for the help!

1 Like

There’s no way to specify that directly in the constraint—it has to be constructed on a case-by-case as Stan is currently implemented.

Also, you have to make sure that the sum of the lower bounds is less than 1 and the sum of the upper bounds is greater than 1 for a solution to exist.

The Dirichlet you’ve constructed has a mean for the self element of selfBias / (selfBias + M - 1) and a mean for the other elements of 1 / (selfBias + M - 1). You don’t get a hard constraint. I’d actually suggest doing it this way if you have prior knowledge of the system.

The prior’s only going to drive it as indicate above—the posterior’s going to be driven by the likelihood (i.e., the data).

If you wanted a hard constraint, say that the self transition is at least 50%, you could do this:

parameters {
  vector[N] simplex[N] theta_base;
...
transformed parameters {
  vector[N] simplex[N] theta;
  for (n in 1:N) {
    // sum(theta) = 0.5 after:
    theta[n] = 0.5 * theta_base[n];
    // theta[n, n] > 50% and sum(theta) = 1 after:
    theta[n, n] += 0.5; 
  }
}
...
model {
  theta ~ dirichlet(...whatever...);
  ...

You can do various things like this for lower and upper bounds, but be careful about unsatisfiable constraints—the sum of the lower bounds needs to be less than 1 and the sum of the upper bounds needs to be greater than 1.

P.S. append_row accepts scalar, so you don’t have to build up 1-vectors.

In case it’s helpful, one way to construct an n-dimensional simplex with arbitrary constraints on a single element is to declare the element with appropriate constraints, and then to multiply that element’s complement by an (n-1) dimensional simplex.

One way to construct a simplex with arbitrary, compatible (i.e. consistent) constraints on two elements is to declare one element with appropriate constraints, then the second element with appropriate constraints, truncated if necessary by the additional constraint that it cannot exceed the complement of the first element. Then multiply the complement of the sum of the first two elements by an (n-2) dimensional simplex.

This can be iterated, in a similar vein, for constraints on progressively more elements.

One general challenge is in ensuring that the implied prior on the overall simplex doesn’t get truncated/contorted into something undesirable.

Hi Bob,

Thank you very much for taking the time and the advice, especially the aprt about paying attention to the lower and upper bounds. I think that might have been problematic and led to the issues I was seeing. So the informed priors seem to be the way to go.

In the meantime I saw somewhere the concept of a soft penalisation, where basically I would penalise the target log-likelihood depending on how much a certain parameter went past a lower/upper bound. Something like this for a lower bound at a specific value:

real penalisation;
real lowerBound_log
if(meta_trans_perPat_logits[1,1, ipat] < lowerBound_log) {
target += -penalisation * (meta_trans_perPat_logits[1,1,ipat] - lowerBound_log)^2;
}

While this makes sense to me conceptually, I wouldn’t be sure of which values to use for the penalisation. In my implementation I have both meta_trans defined as a parameter, and meta_trans_perPat defined in the transformed parameters as a result of each patient’s covariate influence on meta_trans. So I suppose penalising based on meta_trans_perPat (which repeats for every patient) vs one time based on meta_trans after the patient cycle is done would imply different values.

Would definitely appreciate your thoughts on this and if it makes sense, then on what would be an effective range of values to use.

Thanks again for your time!

Hi Jsocolar,

Thanks for your time. Just to be sure I understand, are you saying I can declare element individually and then attribute them to the simplex and STAN will guarantee that the constraints are respected? I am not sure I fully understood so here’s some mock code:

data {  int<lower=1> N;  }

parameters {
  real<lower=0.1, upper=0.3> x1;  // Will be used to constrain x[1] between 0.1 and 0.3
  real<lower=0.2, upper=0.4> x2;  // same as above but 0.2 to 0.4
  simplex[N - 2] x_remaining; 
}

transformed parameters {
  vector[N] x_full;  
  real<lower=0, upper=1> remaining_sum;   // will have the probability left after x1 and x2 and used to scale

  remaining_sum = 1 - x1 - x2; 
  x[1] = x1;
  x[2] = x2;

  // Scale the remaining simplex to fill the remaining probability (otherwise can't guarantee < 1)
  for (i in 3:(N)) {
    x_full[i] = remaining_sum * x_remaining[i-2];
  }
}



By doing this STAN will always respect that x_full[1] has to be between 0.1 and 0.3? That's what I understood, but I guess that means there is some hidden internal logic?

Thanks!

where you have

x[1] = x1;
x[2] = x2;

you need

x_full[1] = x1;
x_full[2] = x2;

This straightforwardly guarantees that the constrains are respected, because x1 is constrained, and x_full[1] is equal to x1. Nothing mysterious internally here :)

It gets more complicated if the sum of the upper bound constraints on x1 and x2 exceeds 1. Then you need to constrain x2 to be less than the minimum of its upper bound and the complement of x1.

That’s exactly what my example does.

This is constraining the stick-breaking process we use to do the generation. We could build this in with interval constraints per element and we’d code it just as @jsocolar indicated.

If you construct it coherently, yes. (Also, it’s “Stan” because it’s not an acronym.). Your parameters block is fine as there’s no way for x1 and x2 to run out of bounds (if they do, you need to do what @jsocolar was suggesting).

You don’t need the loop, though.

simplex[N] x_full = append_row(x1, append_row(x2, (1 - x1 - x2) * x_remaining));

This is a little less efficient than

simplex[N] x_full;
x_full[1] = x1;
x_full[2] = x2;
x_full[3:] = x_remaining;

Declaring as a simplex will provide a check that the simplex constraint is satisfied. It’s not necessary and adds a little overhead, but I feel that’s usually worth it for the safety.

Yup, I just was exposing the logic here so it would be clear how to generalize to arbitrary constraints including simultaneous lower and upper bounds, place constraints on multiple elements, etc :)

Thanks again for the detailed explanation!

Could I also get your opinion on using a soft penalisation?

If by “soft penalisation” you mean adding a prior, then yes, you can do that. Stan lets you pile on priors because they don’t have to be normalized. You just have to be careful if there’s truncation, as there will be with a simplex.

For example, if you declare a simplex this way,

parameters {
  simplex[N] theta;
}

then it implicitly gets a uniform distribution over simplexes. You can then write things like this:

for (n in 1:N) {
  theta[n] ~ beta(alpha[n], beta[n]);
}

When you do that, the prior becomes this

p(\theta) = \text{I}_{\Delta^{N-1}}(\theta) \cdot \prod_{n=1}^N \text{beta}(\theta_n \mid \alpha_n, \beta_n)

The first term is an indicator that \theta is a simplex with N components (i.e., it’s N-1 dimensional as a manifold).

By soft penalisation I mean what I described in my reply below (which I think went unnoticed):

Basically it’s adding a negative penalisation to the target log likelihood depending on how much a specific parameters deviates from a upper/lower bound. I can repeat here the code:

real penalisation;
real lowerBound_log
if(meta_trans_perPat_logits[1,1, ipat] < lowerBound_log) {
target += -penalisation * (meta_trans_perPat_logits[1,1,ipat] - lowerBound_log)^2;
}

adnd my main concern/question:
"While this makes sense to me conceptually, I wouldn’t be sure of which values to use for the penalisation. In my implementation I have both meta_trans defined as a parameter, and meta_trans_perPat defined in the transformed parameters as a result of each patient’s covariate influence on meta_trans. So I suppose penalising based on meta_trans_perPat (which repeats for every patient) vs one time based on meta_trans after the patient cycle is done would imply different values.

Would definitely appreciate your thoughts on this and if it makes sense, then on what would be an effective range of values to use."

You’re right the I had missed that. Thanks for persisting.

You can do this, but in the Bayesian world, we call these “priors” rather than penalty functions (if you’re doing penalized maximum likelihood they’d be called “penalty functions”). And usually you want to make the priors proper densities so the whole result is a prior distribution over a simplex. Stan lets you just multiply unnormalized densities, so you can do what you’re trying to do above and there shouldn’t be a problem with propriety because the simplex values are bounded between 0 and 1.

You have to be careful about introducing points of non-differentiability, which can thwart HMC trying to cross those points. It’s better if everything is smooth.

So you might just want to think about finding something like a beta distribution that will do roughly the right thing more smoothly than having a hard cutoff.

1 Like