Bayesian finite mixture model in stan

Dear fellow stan users:

My model is a 2-component mixture of normal with a constraint such that the overall mean is modelled by a cyclic b-spline.

I am trying to model hour of day temporal effect where my y_it (i=1:13, t=1:24, in other words, there are 13 IID observations per hour of day) is simulated from

##simulate a single day total PNC
TT = rep(1:24)
T = length(TT)
N = 13

##first mixture component hourly temporal effect
sinT = 0.5*sin(2*pi*TT/24)
##second mixture component hourly temporal effect
cosT = 0.5*cos(2*pi*TT/24)+10
mu <-matrix(c(sinT, cosT),T,2);

sigma <- c(0.1);
lambda = runif(1,0,1) ## when simulating the data, mixture weights are not time-dependent but when modelling the 
##dynamic processes, we allow mixture weights to be time dependent to capture the correlation between observations. 

##simulate component indicator
z <-matrix(0, 13, T)
for(i in 1:T){
  z[,i] <- rbinom(N, 1, lambda) + 1;

y <- matrix(0, N, T)
for(i in 1:T){
  y[,i] <- mu[i,z[,i]] + rnorm(N,0, sigma)

I would like to estimate time-dependent mixing proportions theta (a vector of 24 elements), time-dependent location mu (a vector of 24 elements ) and a global sigma.

Specifically, I want to use a cyclic B-spline to capture the hourly temporal effect and I rather than modelling mixture component as 2 cyclic B-spline, I model the overall mean
that is theta[t] * mu[t,1] +(1-theta[t]) * mu[t,2] = f(x_t) where f(x_t) = B_T (cyclic bspline basis with 6 bases) * theta_cyclic (its corresponding spline coefficients to be estimated as well).

In addition, I impose a second-order penalty prior on theta_cyclic to ensure that the spline fit is not too wiggly.

With all the above specification, my model cannot run and it gives the following error message:
Initialization between (-2, 2) failed after 100 attempts.
Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.

I am assuming it has something to do with my highly constrained parameters or there are something I overlook?

Any suggestions would be greatly appreciate it.

Thanks, in advancedata.r (1.5 KB)
classical_gauss_mix.stan (3.8 KB)

bspline basis construction function.r (3.8 KB)

last file contains the functions for constructing cyclic bspline basis functions and 2nd order penalty prior for theta_cyclic parameters.

[edit: escaped code blocks]

This almost always arises for one of two reasons:

  1. missing constraint on a parameter—every legal value of the parameters matching the constraints should lead to a finite log density (i.e., non-zero density)

  2. the problem is so tightly constrained that random inits take you so far into the tail that there are numerical issues causing rejections.

You should be seeing error messages about which things failed, which should give you some hints. If you’re using a pre-2.16 Stan, you should upgrade because we fixed the warning message output during initialization.

You can also save yourself a whole bunch of cutting and pasting by just declaring

ordered[2] mu_order1; //ordered mean
ordered[2] mu_order2;
ordered[2] mu_order3;
ordered[2] mu_order4;
ordered[2] mu_order5;
ordered[2] mu_order6;
ordered[2] mu_order7;
ordered[2] mu_order8;
ordered[2] mu_order9;
ordered[2] mu_order10;
ordered[2] mu_order11;
ordered[2] mu_order12;
ordered[2] mu_order13;
ordered[2] mu_order14;
ordered[2] mu_order15;
ordered[2] mu_order16;
ordered[2] mu_order17;
ordered[2] mu_order18;
ordered[2] mu_order19;
ordered[2] mu_order20;
ordered[2] mu_order21;
ordered[2] mu_order22;
ordered[2] mu_order23;
ordered[2] mu_order24;


ordered[2] mu_order[24];

and then in transformed parameters, if you really need a 24x2 matrix for something, doing

matrxi[24, 2] mu;
for (n in 1:24) mu[n] = mu_ordered[n]';


 for(t in 1:T){
    mu[t,1] ~ normal(0,2);
    mu[t,2] ~normal(10,2);

can be just:

mu[, 1] ~ normal(0, 2);
mu[, 2] ~ normal(10, 2);

You should also check that tau_y * lambda_temp * K_T is always positive definite.

Hi Bob, thanks for your reply. I have fixed some of the errors in my model and updated the latest version and now I see the error message. The error message indicates that my mixing weights lambda_t is not between (0,1). However, given the error message, I am still unable to get my model running.

Essentially, my model has two parts.
one is the 2-component mixture per time period. At a given point in time, the overall mean of y_t_mix = lambda_t * mu1_t + (1-lambda_t) *mu2_t;

the second part is the cyclic b-spline where at a given point in time, the overall mean of y_t_spline = cyclic B-spline basis (constructed outside and input as data) %*% theta_cyclic (which is the coefficients to be estimated).

To link this 2 parts, I impose y_t_mix = y_t_spline and derive an equation on lambda_t.
lambda_t = (y_t_spline - mu2_t) / (mu1_t - mu2_t ) and the model likelihood is on the mixture model.

To figure out which part of my model goes wrong, I fit a model on bspline part only (where the likelihood function is on the spline part, hence response_t ~ normal( y_t_spline, sigma) but estimate lambda_t as above, importing simulated mu1_t and mu2_t and places no prior distribution or any constraints on lambda_t ( it should have a constraint between (0,1)). In this case, lambda_t can be estimated quite accurately as what I simulated. and the y_mix_spline also recovers my simulated data.

However, once I add on the (0,1) constraint and a customised prior on lambda_t , the model fails.
The customised prior is a penalised prior which penalises a large change in the mixing weights over time.

I am wondering are there any things that I overlook or simply my model is too restrictive to be modelled by stan?

Thanks in advance for any suggestions.


What guarantees that

lambda_t = (y_t_spline - mu2_t) / (mu1_t - mu2_t )`

is between zero and one? For a start, you are going to constrain

mu1_t - mu2_t > 0

y_t_spline > mu2_t

y_t_spline - mu2_t < mu1_t - mu_2_t

So you need to declare all these parameters with constraints that make that true.

The alternative is to make some predictor unconstrained then apply a logit link (that is, apply inv_logit(x) to some unconstrained value x, meaning you model the log odds directly rather than the probability).

Hi Bob:

thanks for your suggestion. I used the second option and now it runs.

Kindest regards,

Hi Bob and others stan users:

My model is currently running (at least not experience any numerical issues) however, still the outputs are not entirely as expected.

to recap, the model has two structures.
firstly, a 2-component mixture Gaussian model lambda_t * N(mu1, sigma1) + (1-lambda_t) * N(mu2, sigma2) where the mixing weights are time-dependent while the component locations and scales are the same for all time periods. My data y_(it) are simulated from this mixture model. Currently, y is a matrix of dimension 20*24, in other words, there are 20 IID observations per hour of the day.

The second part of my model is that I enforce the mean of the weighted mixtures of Gaussian to be modelled by a cyclic bspline model f(x_t) = B_T ( that is a cyclic bspline basis functions) * beta ( bspline coefficients). beta is a vector of 6 elements as I predetermine the number of basis in my cyclic bspline to be 6.
The enforced relationship is
B_T * beta = f(x_t) = lambda_t *mu1 + (1-lambda_t) *mu2.
However, I cannot let f(x_t) equal to B_T * beta and at the same time equal to lambda_t *mu1 + (1-lambda_t) *mu2 and thus I rewrite the above relationship in terms of a function of lambda_t

for(t in 1:24){
lambda_t = (f(x_t)-mu2) / (mu1-mu2) }

Also I enforce the variance of the weighted mixtures of Gaussian in the model to be
for(t in 1:24){
y_variance[t] = lambda[t] *(pow(sigma[1],2)+pow(mu[1],2)) + (1-lambda[t]) *(pow(sigma[2],2)+
pow(mu[2],2)) - pow((lambda[t]*mu[1]+(1-lambda[t])*mu[2]),2);

I assign uninformative priors on lambda_t. mu1, mu2, sigma1 and sigma2 and beta.
for the model likelihood, I use the log_mix

for(t in 1:24){
for(i in 1:N){
target += log_mix(theta[t], normal_lpdf(y[i,t] | mu[1], sigma[1]),
normal_lpdf(y[i,t] | mu[2], sigma[2]));



I would expect to see the posterior samples of f(x_t) to be exactly the same as samples constructed from lambda_t*mu1 + (1-lambda_t) *mu2.
However, this is not the case, while the posterior mean of the mixture model follows closely with the true simulated data mean trend, the bspline model tends to be more variable than the true simulated data mean trend.

What I really want is the equation of f(x_t) = lambda_t*mu1 + (1-lambda_t) *mu2 to hold while at the same time f(x_t) is constructed by B_T (input data ) * beta

I think my model makes logical sense but somehow I probably coded wrongly so my model is not producing what I want.

Any suggestions or feedback are greatly appreciated.

Thanks in advance,

functions { 
  //Define log probability density function 
  real my_penalised_beta_lpdf( vector theta, real phi, int period) {
    vector[period] loglikelihood;
    for(t in 1:period){
      loglikelihood[t] = (beta_lpdf(theta[t] | 1,1))
      - ((1/phi)*sum((theta[2:period]-theta[1:(period-1)]) .* (theta[2:period]-theta[1:(period-1)])))/period-
        ((1/phi)*sum(((1-theta[2:period])-(1-theta[1:(period-1)])) .* ((1-theta[2:period])-(1-theta[1:(period-1)]))))/period ;

data {
  int<lower=0> tt;//total number of time periods =96, repeat of 4 days 
  int<lower=0> N;//obs per day=20
  int<lower =0> num_basis; //currently 6 
  matrix[tt,num_basis+1] B_overall; //cyclic bspline for hourly temporal effect
  matrix[N,tt] y;
  matrix[num_basis,num_basis] K_T;
  //real phi;

  vector[num_basis] theta_cyclic;
  real<lower=0> lambda_temp;//penalty parameter for temporal effect
  real<lower=0> phi;//penalty parameter for temporal effect
  ordered[2] mu; //ordered mean
  vector<lower=0>[2] sigma;//component sigma

transformed parameters{
  vector[tt] theta;//matrix containing p_t for all N observations over time
  vector [tt] y_mean;
  real deltatheta_cyclic;
  vector[num_basis+1] theta_overall;
  vector[tt] y_variance;//also model on the overall variance of the bspline
deltatheta_cyclic = sum(B_overall[,2:(num_basis+1)] * theta_cyclic)/

theta_overall[2:(num_basis+1)] = theta_cyclic - deltatheta_cyclic;

theta_overall[1] = deltatheta_cyclic;

y_mean = (B_overall * theta_overall);

  for(t in 1:tt){
    theta[t] = inv_logit((y_mean[t]-mu[2])/(mu[1]-mu[2]));
  for(t in 1:tt){
    y_variance[t] = theta[t] *(pow(sigma[1],2)+pow(mu[1],2)) + (1-theta[t]) *(pow(sigma[2],2)+
    pow(mu[2],2)) - pow((theta[t]*mu[1]+(1-theta[t])*mu[2]),2);

  int     period;
  matrix[num_basis+1, num_basis+1] A0;
  period = tt;
  A0=diag_matrix(to_vector(rep_array(0, num_basis+1)));
  A0[1,1] = 0.1;
  A0[2:(num_basis+1),2:(num_basis+1)]= lambda_temp * K_T;
  lambda_temp~gamma(1,10);  //in STAN, the parameterisation is shape and rate and thus the prior rate = 10
  phi~gamma(1,10);  //in STAN, the parameterisation is shape and rate and thus the prior rate = 10
  //penalty prior on theta_overall
  target += multi_normal_prec_lpdf (theta_overall | to_vector(rep_array(0, (num_basis+1))),A0);
  //penalised prior for time-dependent weight
 target += my_penalised_beta_lpdf( theta|phi, period);
  //prior on mixture component location
  to_row_vector(mu) ~normal(0,10);
  //loglikelihood on the mixture model
    for(t in 1:tt){
       for(i in 1:N){
      target += log_mix(theta[t], normal_lpdf(y[i,t] | mu[1], sigma[1]), 
                       normal_lpdf(y[i,t] | mu[2], sigma[2]));
  } //checked total log likelihood is calculated as expected
 // for(t in 1:tt){
    //   for(i in 1:N){
      //y[i,t] ~ normal(y_mean[t], sqrt(y_variance[t]));
   // }
 // } //checked total log likelihood is calculated as expected

generated quantities{
    vector[tt] y_mix;
 for(t in 1:tt){
      y_mix[t] = theta[t] *mu[1]+(1-theta[t])*mu[2]; 

The above is my stan code

Here is my simulated data

##simulate an example where we have time-dependent cyclic mixing weights lambda 
##time-independent component location and scale parameters
TT = rep(c(1:24),1)
tt = length(TT)
N = 20

##first mixture component hourly temporal effect
lambda = 0.45*sin(2*pi*TT/24)+.5

res = matrix(0,N,tt)
z = matrix(0, N, tt)

for(i in 1:N)
  for(t in 1:tt){
    z[i,t] <- sample(1:2, size = 1, prob = c(lambda[t], 1-lambda[t]), replace = T)
    res[i,t] = mu[z[i,t]] + sigma[z[i,t]] * rnorm(1)

B_T<-bspline(TT,8,2,1,xl=min(TT),xr=max(TT)) ##basis function for f(hour_t) 
K_T<-make.Crw2(6)+diag(num_basis)*0.0001 ##2nd order penalty matrix for f(hour_t)

bspline basis construction function.r (3.8 KB)

with the attached function to create cyclic bspline basis function

[edit: escaped code]

I’m afraid these really complicated programs are difficult to debug. It doesn’t help that I’ve never understood spline models.

You may want to start with something simpler and try to build up to something this complex stepwise. It can be faster as when you introduce a wrong step, you’ll know where it came from.

I couldn’t quite tell from the description if you were testing posterior interval coverage. Just comparing the posterior means shouldn’t matter much.

Also, Michael Betancourt wrote a tutorial on mixture models and some of the problems with identifying them—it’s on our web site under users >> documentation >> case studies.

Hi Bob:

thanks for your reply. Without getting into the details of my model, I think what I want to achieve in this model is simply modelling y_it coming from a mixture of 2 Normal distributions with mixing weights lambda_t, component 1 parameters (mu1, sigma1) and component 2 parameters (mu2, sigma2). I have implemented this simple model in STAN and it works fine.

The only twist to the above model is that I would want the mean of the overall mixture distribution to equal to a function f(x_t) modelled by bsplines, but we can just think of as a function of x_t for now.

In other words, I want f(x_t) = lambda_t*mu1 + (1-lambda_t) *mu2 to hold while f(x_t) = B*beta. B is just a data design matrix and beta is to be estimated.

I would want to write in the transformed parameters block something like
f(x_t) = Bbeta= lambda_t*mu1 + (1-lambda_t) *mu2,
but it is not allowed and f(x_t) will be overwritten.
Therefore, I write lambda_t as a function of f(x_t) and mu1, mu2 such that the above relationship is satisfied.
Finally, I assign priors on lambda_t, mu1, mu2 and beta.

In my posterior samples, I would expect to see B beta= lambda_t*mu1 + (1-lambda_t) *mu2 should hold, however, it does not. that is what bugs me and I am trying to resolve at the moment.

Thanks again for your time,

It’s usually much simpler and better to stick to simple generative processes—without knowing how your data comes about, it’s hard to say more.

If you really want to go the constrained route, the easiest way to do this is to declare mu_1 as a parameter and then solve for mu_2 given the spline value. That gives you the one degree of freedom (relative to the spline) that you need. Then you can put a prior on mu_1, but if you want to put a prior on mu_2 you’ll need the Jacobian for the constraint solution. Instead, you should be putting priors on the parameters for the spline, not on mu_2, which is just a constrained solution, not a free parameter.

I think you can do what you want if you model y as a mixture of N(B * beta + nu1, sigma1) and (B * Beta + nu2, sigma2), where nu2 = - lambda_t / (1 - lambda_t).

mu1 = B*beta + nu1, mu2 = B*beta + nu2

Hi stijn:

Thanks for your suggestion, I will give it a try.

I have implemented what you have suggested however this does not seem to work. however, just to be assured that I did understand what you have suggested.
In this case I need f(x_t) = lambda_t*mu_1t+ (1-lambda_t) mu_2t to hold while f(x_t) = Bbeta.
mixing weights lambda_t. component locations mu_1t, mu_2t are time dependent, while beta is a vector of size (number of cyclic bspline basis function * 1) and B is a cyclic bspline basis function of size ( t * num_basis).

Following your suggestion, I declare in the parameter block

vector[num_basis] theta_cyclic;

vector<lower=0,upper=1> [tt] lambda;//matrix containing p_t for all N observations over time

vector<lower=0>[2] sigma;//component sigma

To ensure identifiability, I force mu_2t > mu_1t is valid for all time periods, which means in the model block I have

for(t in 1:tt){
for(i in 1:N){
target += log_mix(lambda[t], normal_lpdf(y[i,t] | (y_mean[t]-1), sigma[1]),
normal_lpdf(y[i,t] | y_mean[t]+nu2[t], sigma[2]));


} //checked total log likelihood is calculated as expected

where y_mean = B * beta and this is equivalent to your suggestion of modelling the mixture

lambda_t * N(B * beta -1 , sigma1) + (1-lambda_t) * (B * Beta + nu2, sigma2), where nu2 = lambda_t / (1 - lambda_t).

Also to model mu_1t and mu_2t, since one of them now will not be a free parameter. I think I can write it deterministically.

I have attempted to
declare mu_1t in the transformed parameter block as mu_1t = (B * beta) [t] -1 .
since I have assigned priors on beta then no further priors are assigned.
I do not know the appropriate place where I should declare mu_2t and its relationship to mu_1t.
right now, it has been declared in the generated quantities block, I declare mu_2t = mu_1t + 1+ nu2[t].

However, my model does not recover the “true” values of mu_1t and mu_2t I simulated, the model however, roughly recover the mean trend modelled by y_mean = B * beta.

I have read Stan website case study regarding the fitting a mixture model. If I remove the bspline part fits the data only with the mixture model then with the ordered exchangeable prior on mu_1t and mu_2t then the model can correctly estimate and recover the simulated values. However, it does not seem to work in the case where I impose a constraint that the overall mean of the mixture model should equal to a smooth function modelled by the bspline. I am still a bit confused and any ideas are greatly appreciated.

What I meant was the following but there was a typo in my post.

  vector[tt] nu1;
  vector<lower=0,upper=1> [tt] lambda;
transformed parameter{
  vector[tt]<lower = 0> nu2;
  nu2 = - nu1 * lambda/(1 - lambda);
  for(t in 1:tt){
    for(i in 1:N){
      target += log_mix(lambda[t], normal_lpdf(y[i,t] | y_mean[t] + nu1[t]), sigma[1]),
                                   normal_lpdf(y[i,t] | y_mean[t] + nu2[t], sigma[2]));

The idea is that with the restriction on nu2, the expected value of lambda * nu1 + (1 - lambda) * nu2 = 0. So the expected value for y[i,t] is y_mean[t].