Fitting a multinomial distribution model where the probability is not a vaild simplex


#1

My response variable is counts [B, T] where at each time period, there are B categories whose
probability being in bin b at time t is theta[b,t]
theta[b,t] is modelled by the following function

 for(b in 1:B){
      for(t in 1:tt){
        theta[b,t] = lambda[t] * ( normal_cdf(int_bounds[(b+1),t], y_mean[t,1], sigma[1])- normal_cdf(int_bounds[b,t],  y_mean[t,1], sigma[1]) ) +
                                                (1-lambda[t]) * (normal_cdf(int_bounds[(b+1),t],  y_mean[t,2], sigma[2])- normal_cdf(int_bounds[b,t],  y_mean[t,2], sigma[2])); 
      }
    
    } //raw p_bt and we need to standardise it
    
    for(b in 1:B){
      for(t in 1:tt){
        theta_std[t,b] = theta[b,t] /sum(theta[,t]) ; 
      }
    
    }

int_bounds are data, y_mean, sigma lambda are parameters to be estimated where lambda is a probability between (0,1).

theta[b,t] is unnormalised and thus I normalised and called it theta_std [b,t] before using it in the multinomial likelihood function

However, I keep getting the error message that theta_std is not a valid simplex. I debugged this problem by printing out theta and theta_std. I found that sometimes theta is completely fine but there are times they are estimated as all zeros and the resulting normalised theta_std to be na, causing divergent transitions in my model. I am a bit confused about why it can be estimated fine sometimes while other times it cannot and perhaps some ways that I can use to fix my model to solve this problem?

output from the print

Iteration: 1 / 1 [100%]  (Sampling)
theta=[[0.0128309,0.0246669,0.0105406,0.00857686,0.00618963,0.00339627,0.00272815,0.0162677,0.00302745,0.0166727,0.0176373,0.030756,0.0394797,0.0273606,0.0333928,0.0346152,0.0385122,0.0296332,0.033669,0.0289281,0.0433364,0.0427811,0.0327213,0.0314117],[0.00964845,0.0203007,0.00773848,0.00690544,0.00546539,0.00301671,0.00216195,0.0120791,0.0024735,0.0142189,0.0149331,0.0253678,0.0329118,0.0195693,0.0271032,0.0250392,0.0313865,0.0226862,0.0261797,0.0193635,0.0322829,0.036499,0.025693,0.0244181],[0.00715083,0.0164166,0.00560917,0.0055164,0.00480449,0.00267197,0.00170794,0.00876152,0.00201405,0.0120003,0.0125237,0.020574,0.0267834,0.0135102,0.0215668,0.0173852,0.0249821,0.0169334,0.0198747,0.0124775,0.0233958,0.0304668,0.0197771,0.0185765],[0.00523892,0.0130455,0.00402374,0.00437413,0.00420473,0.00235985,0.00134597,0.00620548,0.0016354,0.0100229,0.0104067,0.0164335,0.0212844,0.00904429,0.0168537,0.011637,0.0194517,0.0123669,0.0147845,0.00781909,0.0165932,0.0248802,0.0149295,0.0138428],[0.0... <truncated>
theta_std=[[0.331763,0.249476,0.184896,0.135461,0.0984047],[0.291515,0.239915,0.194012,0.154173,0.120384],[0.342518,0.251462,0.18227,0.130752,0.0929975],[0.297639,0.239637,0.191433,0.151794,0.119497],[0.254428,0.224657,0.197491,0.172838,0.150587],[0.251148,0.223081,0.197588,0.174507,0.153676],[0.303038,0.240146,0.189715,0.149508,0.117593],[0.341739,0.253748,0.184055,0.13036,0.0900979],[0.289004,0.236123,0.192264,0.156118,0.12649],[0.272434,0.232339,0.196086,0.163775,0.135366],[0.275279,0.233073,0.195466,0.162425,0.133757],[0.289943,0.239147,0.193955,0.154922,0.122033],[0.288218,0.24027,0.195529,0.155385,0.120599],[0.362929,0.259579,0.179208,0.119969,0.0783146],[0.298503,0.242279,0.192788,0.150657,0.115772],[0.359723,0.260208,0.180668,0.120932,0.0784688],[0.298161,0.242993,0.193411,0.150594,0.114841],[0.327476,0.250706,0.187131,0.136667,0.09802],[0.319683,0.248573,0.188707,0.140376,0.10266],[0.394096,0.263795,0.169985,0.106522,0.0656021],[0.340794,0.25387,0.183983,0.130488,0.0908652],[0... <truncated>

theta=[[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]]
theta_std=[[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan],[nan,nan,nan,nan,nan]]

In file included from C:/Program Files/R/R-3.3.1/library/BH/include/boost/config.hpp:39:0,
                 from C:/Program Files/R/R-3.3.1/library/BH/include/boost/math/tools/config.hpp:13,
                 from C:/Program Files/R/R-3.3.1/library/StanHeaders/include/stan/math/rev/core/var.hpp:7,
                 from C:/Program Files/R/R-3.3.1/library/StanHeaders/include/stan/math/rev/core/gevv_vvv_vari.hpp:5,
                 from C:/Program Files/R/R-3.3.1/library/StanHeaders/include/stan/math/rev/core.hpp:12,
                 from C:/Program Files/R/R-3.3.1/library/StanHeaders/include/stan/math/rev/mat.hpp:4,
                 from C:/Program Files/R/R-3.3.1/library/StanHeaders/include/stan/math.hpp:4,
                 from C:/Program Files/R/R-3.3.1/library/StanHeaders/include/src/stan/model/model_header.hpp:4,
                 from file211039751bd1.cpp:8:
C:/Program Files/R/R-3.3.1/library/BH/include/boost/config/compiler/gcc.hpp:186:0: warning: "BOOST_NO_CXX11_RVALUE_REFERENCES" redefined
 #  define BOOST_NO_CXX11_RVALUE_REFERENCES
 ^
<command-line>:0:0: note: this is the location of the previous definition

SAMPLING FOR MODEL 'test_classical_gauss_mix_bspline_hist' NOW (CHAIN 1).
Rejecting initial value:
  Error evaluating the log probability at the initial value.
validate transformed params: theta_std[k0__] is not a valid simplex. sum(theta_std[k0__]) = nan, but should be 1

Any suggestions are greatly appreciated,

Thanks,

Jaslene

[edit: escaped code]


#2

softmax is the usual way to normalize as it then treats the values as being on the log odds scale. You can run into problems if all your input values are zero as you get a zero output value after dividing by zero. softmax solves that problem.


#3

Thanks, Bob, I will have a look at softmax.


#4

Another related problem, I slightly change my model
where the model has two components shared by all time periods
component 1 (mu1, sigma1) component 2 (mu2, sigma3)
whereas there is a mixing weight lambda_t that is time-dependent and changes over time.
Instead of assigning an independent prior on lambda_t ~ Dirichlet, I use the prior from the correlated topic model
where eta_t is the natural parameterisation of lambda_t and assigned a multinormal prior where the covariance matrix can model the dependency between mixing weights within a time period and then use softmax to backtransform to lambda_t.
Overall, I want the overall mixture mean lambda_t*mu1 + (1-lambda_t)mu2 = B * beta_spline (where B is a bspline basis matrix with 6 basis functions and beta_spline is a 61 basis coefficients)

In my simulated data mu2>mu1 and to ensure locations are identifiable, I use ordered[2] mu in the parameter block

data {
  int<lower=0> tt;//24 hours=1-day 
  
  int<lower=0> N;//obs per day=100
  
  int<lower =0> num_basis; //currently 6 
  
  matrix[tt,num_basis] B_T; //cyclic bspline for hourly temporal effect
  
  matrix[num_basis,num_basis] K_T; //cyclic bspline for hourly temporal effect
  
  matrix[N,tt] y;
  
  
  
}

parameters{
  
  vector[num_basis] theta_overall;//bspline coefficients for overall mixture mean
  
  vector<lower=0>[2] sigma;//scale of all 2 mixture components shared by all hourly blocks===Time invariant
  
  vector[2] eta[tt];//natural parameterisation of the time-dependent mixing weights dim tt*2
  
  vector[2] mu_eta;//hyperparameter mean for eta;
  
  corr_matrix[2] Omega;//correlation matrix for eta;
  
  vector<lower=0>[2] sigma_eta;//scales for eta;
  
 
}

transformed parameters{
  
  vector[tt] y_mean;//overall smooth time trend
  
  matrix[tt,2] lambda;//dim tt*B correlated time-dependent mixing weights
  
  ordered[2] mu; //location of first 2 mixture components shared by all hourly blocks===Time invariant

  
  cov_matrix[2] Sigma_eta; //covariance matrix for eta
  
  y_mean = B_T * theta_overall;
  
  for(t in 1:tt){
    lambda[t] = to_row_vector(softmax(eta[t]));
  }
  
  **mu = (inverse(lambda' * lambda) *(lambda'*(y_mean)));**
  
  for (m in 1:2){
Sigma_eta[m, m] = sigma_eta[m] * sigma_eta[m] * Omega[m, m];}

for (m in 1:(2-1)) {
for (n in (m+1):2) {
Sigma_eta[m, n] = sigma_eta[m] * sigma_eta[n] * Omega[m, n];
Sigma_eta[n, m] = Sigma_eta[m, n];
}
}  
   print("mu=",mu);
    
  }
  
  
model{
  int period;
  
  matrix[num_basis, num_basis] A0;
  
  vector[tt] lambda_1;//first component mixing weight
  
  A0=diag_matrix(to_vector(rep_array(0, num_basis)));
  
  A0[1:(num_basis),1:(num_basis)]= 0.1 * K_T;
  
  sigma ~ cauchy(0,2);
  
  mu_eta ~ normal(0,5);
  
  Omega ~lkj_corr(2.0);
  
  sigma_eta ~ cauchy(0,5);
  
  
  for(t in 1:tt){
    lambda_1[t] = lambda[t,1];
  }
  
  //mu[1]~normal(0,10);
  
  //mu[2]~normal(5,10);
  
  //penalty prior on theta_overall
  target += multi_normal_prec_lpdf (theta_overall | to_vector(rep_array(0, (num_basis))),A0);
  
  //prior for natural parameterisation on eta
  for(t in 1:tt){
    eta[t]~multi_normal(mu_eta, Sigma_eta);
  }
  
  //likelihood function for y_it
  
  for(t in 1:tt){
    for(i in 1:N){
      target += log_mix(lambda_1[t], normal_lpdf(y[i,t] | mu[1], sigma[1]), 
                        normal_lpdf(y[i,t] | mu[2], sigma[2]));
      
    }
  } //checked total log likelihood is calculated as expected
  
}

generated quantities{
  vector[tt] y_mix_overall;
  
  for(t in 1:tt){
    y_mix_overall[t] = lambda[t,1] *mu[1] + lambda[t,2] *mu[2];
    
  }
}

To enforce the relationship that the overall mixture mean is a smooth function modelled by a bspline, I derive
mixture components to be the least square solution highlighted above.

When running the programme, it gives me error that mu is not a valid ordered vector, so I debug by printing out mu values.
I see that in the first half of the runs, mu1<mu2 and they are estimated very close to the ‘true’ values (5,10)
However, they switch the position in the remaining iterations and thus result in: There were 502 divergent transitions after warmup. Increasing adapt_delta above 0.9 may help.
I guess these divergent transitions are due to label switching in mu?

I also place ordered prior on mu[1] ~ normal(5,10) and mu[2] ~ normal(10,10) but it does not seem to solve the unidentifiable problem–which is the common remedy suggested in the case study.

I am wondering are there any other ways that I prevent mu from label switching?

Any ideas are greatly appreciated,
thanks in advance.

Jaslene

[edit: escaped model code]


#5

Using the code in the case study, I also plot the posterior chains for mu estimates.
Rplot.pdf (92.5 KB)

In this case, 4 chains are run. ‘True’ mu = c(5, 10).
Actually only chain 2 does not perform well while chains 1 3 and 4 all find the correct estimates of mu.

Any one has any ideas why this situation might occur? thanks in advance


#6

I have added a constraint which now my model can estimate mu1 and mu2 correctly. However, there are still error messages
There were 4463 divergent transitions after warmup. Increasing adapt_delta above 0.9 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
However, when I plot the posterior plots for all 4 chains, my graphs show just a single mode
Rplot.pdf (399.8 KB)

In this case, for mu1 and mu2 they are estimated at the right spot (5, 10) in all 4 chains represented by different colours.
In this case, should I still be worried about the divergent transition error messages? I will increase adapt_delta>0.9 but I suspect it would anything to reduce the number of divergent transitions.

Or perhaps the divergent transitions come from other parameters that I will need to check by plotting the pairs plot.

Thanks in advance,
Jaslene


#7

Yes, the divergences can bias estimates. What they indicate is that Stan couldn’t follow the proper Hamiltonian trajectory numerically in its sampler.

This can happen if there are missing constraints on parameters, or if there’s a funnel-like posterior. You may also be running into problems with identifiability with softmax. Similarly, heavy tailed priors like cauchy(0, 5) can lead to problems when there’s not enough data to concentrate the posterior near reasonably small values.

A0=diag_matrix(to_vector(rep_array(0, num_basis)));

A0[1:(num_basis),1:(num_basis)]= 0.1 * K_T;

That first statement isn’t doing anything given that you reset it. And what you should be doing is just

matrix[num_basis, num_basis] A0 = 0.1 * K_T;

But then you can see that A0 doesn’t depend on any parameters, so you could put it in transformed data block, which is much more efficient. You should also define that vector of zeroes for multi_normal_prec.

transformed data {
  vector[num_basis] zeros = rep_vector(0, num_basis);
}
...
model {
  ...
  theta_overall ~ multi_normal_prec_lpdf(zeros, A0); 

Defining A0 as data and using the sampling statement will let it know it doesn’t need to calculate a determinant for A0 because it’s a constant.


#8

Hi Bob:

thanks again for your reply. I have incorporated what you have suggested into my model.

A perhaps more general question regarding the divergence.

While running, I will receive error message such as:

SAMPLING FOR MODEL ‘test_classical_gauss_mix_v8’ NOW (CHAIN 3).
Rejecting initial value:
Error evaluating the log probability at the initial value.
validate transformed params: mu is not a valid ordered vector. The element at 2 is -0.0591488, but should be greater than the previous element, 0.369184
Rejecting initial value:
Error evaluating the log probability at the initial value.
validate transformed params: y_mean[3] is -0.973971, but must be greater than or equal to -0.960758

I place some constraints on mu and y_mean in transformed parameter block.

However, when the programme terminates, I also receive:
There were 1150 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help

My understanding is that these 1150 divergent transitions actually refer to the failed fulfilment of parameters constraints and thus to get rid of these divergent transitions I will need find ways to ensure that my constraints met right?

Since the number of divergence is quite large and thus I should not trust my posterior samples for inferences right?

Thanks in advance,
Jaslene


#9

Divergences come from many sources. Yes, you can have failed parameter constraints, but you can also have areas of high curvature where the Hamiltonian solver we use just can’t follow along using only gradient information.

I definitely wouldn’t trust the posterior sample with a thousand divergent transitions.

It looks like you may be missing an ordering constraint somewhere.


#10

Thanks for your reply. It makes sense to me. I think the divergence transitions are mainly due to my parameter constraints and due to the nature of non-identifiability of my mixture model.
I have changed my model construction several times but now essentially, I have
two Gaussian mixture components that are shared in all time blocks. (mu1, sigma1) & (mu2, sigma2)
with time-dependent mixing weights lambda_t and thus
for ith observation in time period t, y_it ~ lambda_t * N( mu1, sigma1) + (1-lambda_t) * N(mu2, sigma2).
where following correlated topic model, I use natural parameterisation of lambda_t as theta_t and assign a MVN prior on theta_t and then use softmax to back-transform to simplex(2).
The twist here is that I need to have the overall mixture mean lambda_t * mu1 + (1-lambda_t) * mu2 to be a smooth function over time and thus this smooth function is modelled by a bspline with time as the basis function and coefficients.
the function is Bbeta ( where B is just a basis function and is known, beta is the coefficients to be estimated).
lambda_t * mu1 + (1-lambda_t) * mu2 = B
beta

Hence I solve for mu = inverse(lambda_t’ * lambda_t ) * lambda_t’ * B*beta

Firstly, I use centred-parameterisation where in the transformed parameter block, mu is defined as above, however to ensure mu are identifiable, I impose ordered[2] mu, however, this is where keep getting the error messages that mu[2] should be greater than mu[1] etc.

Further I tried non-centred parameterisation of mu, essentially I want to assign a MVN (u_raw, Sigma_u) prior on mu where mu defined as above.

Hence I first define [ordered] mu_raw = inverse(lambda_t’ * lambda_t ) * lambda_t’ * Bbeta
then the scale tau, L the cholesky factor of the correlation matrix and alpha.
following the manual, mu = mu_raw + tau.
(L*alpha)
with uninformative prior tau~ Cauchy (0,2)
L ~lkj_cholesky_corr(2) and alpha ~ normal(0,1)
However, NCP does not solve the problem of getting divergence message in the end.
In this case, I am a bit lost and do not know how to proceed next.

Naively, I fit a model that loses all the constraints on mu defining vector[2] mu and only specify that mu = inverse(lambda_t’ * lambda_t ) * lambda_t’ * B*beta in the transformed parameter block but with quite strong non-exchangeable
prior in the model block
mu[1] ~ normal(5,1)
mu[2] ~normal(10,1) since I know the ‘true’ mu = c( 5,10) , in this case, the model only shows less than 10 divergent transitions after warmup. However, it would become problematic when I fit the real data on hand since I have little knowledge regarding the values of the mixture component.

I have attached CP and NCP parameterisation of the model , any suggestions are greatly appreciated.

mix_v8_cp.stan (3.2 KB)

mix_v8_ncp.stan (3.4 KB)


#11

inverse() is very unstable. Rather than inverse(a) * b, you might try a \ b, which computes the same thing more stably.

You put ordered[2] mu in the transformd parameters. That only does validation. If it fails to validate, you get divergences (can’t get more divergent than a rejection).


#12

thanks Bob, I will incorporate what you have suggested into the model construction.