Chains not Mixing, Help! Mixed Model Ranking with Covariates & Heterogenous Rankers

Hey Stan Users:

I got in over my head with a mixed model that clusters respondents based on their rankings. I can’t get my alpha intercept to mix right and can’t get the chains converging, with bi-modal outcomes occurring. I bet I’m doing something stupid, and I’m hoping you could kindly point it out.

A couple quick facts:

  • I’m standardizing my covariates before running rStan.
  • I’m using some simple synthetic data, which I’ll include. (This data might be the cause of my problems?)
  • This work is based on a paper I’ll link below. Basically it describes a model for clustering rankers based on the covariates of the entities being ranked. I naively believe I can use the same model but adjust it to use covariates of rankers.
  • I am super new at this (rStan & Bayes Modeling)

Here’s the link to the paper: Bayesian Aggregation of Rank Data with Covariates & Heterogenous Rankers

My Data:

  • ranker_covariates.csv (140 Bytes)
  • ranks.csv (472 Bytes)
  • Note: Basically it’s 20 rankers, with rankers 1-10 ranking things in exactly opposite ways from 11-20 to test clustering. I’m using a single covariate, with the value 1 assigned to rankers 1-10 and value 2 to rankers 11-20.

Here are the priors, and I’ve only been experimenting with alpha and beta:

  • scale_sd_prior = 0.1, # Basically provides a mean of 1 on lognormal
  • gamma = 1, # Seemed appropriate given the below paper
  • sigma_alpha = 1, # Below paper suggests making alpha small to let beta do more work
  • sigma_beta = 100 # Below paper suggested diffuse prior
  • M = 5 # My truncated dirichlet process uses only a few clusters for testing right now

Thank you for your help!

-Mike


data { 
	int<lower = 1> N; //The number of data points, the number of ranking lists
	int<lower=1> K;  // number of items being ranked 
	int<lower = 1> C; // The number of covariates
	int<lower = 1> J; //Number of raters
	int<lower = 1> M; //Max. number of clusters for truncated dirichlet process
	
	int<lower=1,upper=K> y[N,K];   // y[i] is a row vector of rankings for the ith data point
	matrix[N, C] X;  // these are covariates on a per ranking basis
	int<lower = 1> rater[N]; //Index the rater who produced rank i
	
	real<lower = 0> scale_sd_prior;  //Ranker specific error 
	real<lower=0> gamma; // DP concentration parameter
  real<lower = 0> sigma_alpha;  //hyperparameter intercept for baseline Gaussian for DP
  real<lower = 0> sigma_beta; //hyperparameter beta for baseline Gaussian for DP
}

transformed data {
	int y_argsort[N, K];
	for (i in 1:N){		
		y_argsort[i] = sort_indices_asc(y[i]);//Indices of rankings
	}
}

parameters { 
	ordered[K] z_hat[N]; //the estimate of the true underlying score of each ranker
	matrix[C, K-1] beta_cluster[M]; //matrix of differences from the first ranking (fixed at zero) for cluster M
	matrix[K, M] alpha_cluster; //intercepts on each cluster M
	vector<lower = 0>[J] scale; //individual error captured per person
  vector<lower=0,upper=1>[M-1] v;  // DP stick-breaking weights
} 

transformed parameters{	
	matrix[N, K] mu_part_cluster[M];
	vector[K] mu_part_vector[M, N];
	
	//Builds a matrix of betas that are prepended with 0's (for identifiability)
	for (m in 1:M) {
	  mu_part_cluster[m] = append_col(rep_vector(0, N), X * beta_cluster[m]);
	      for (i in 1:N) {
	        //Sorting ranks by y_argsort
          mu_part_vector[m, i] = to_vector(mu_part_cluster[m][i, y_argsort[i]]);
        }
  }
  
  //This breaks the stick for dirichlet process
  vector[M] log_pi_raw;
  log_pi_raw[1] = log(v[1]);
  for (m in 2:(M-1)){
      log_pi_raw[m] = log(v[m]) + sum(log(1 - v[:m-1]));
  }
  log_pi_raw[M] = sum(log(1 - v));
  simplex[M] pi = softmax(log_pi_raw);
	  
}

model { 
  //prior for ranker specific error
	scale ~ lognormal(0, scale_sd_prior);
	
	// Prior for stick-breaking weights v
  for (m in 1:(M-1)) {
    v[m] ~ beta(1, gamma);
  }
	
	//prior for cluster betas
	for (m in 1:M){
    to_vector(beta_cluster[m]) ~ normal(0, sigma_beta);
	}
	
	//prior for cluster alphas
	to_vector(alpha_cluster) ~ normal(0, sigma_alpha);
	
	//likelihood on per cluster basis
	for (i in 1:N){	
    vector[M] log_lik_cluster;
		// Likelihood for cluster rankings
    for (m in 1:M) {
      log_lik_cluster[m] = normal_lpdf(z_hat[i] | alpha_cluster[, m] + mu_part_vector[m][i] * 1/scale[rater[i]], 1);
    }
    //marginalize out cluster weights
		target += log_sum_exp(log_lik_cluster + log(pi));  // marginalize out cluster weights
	}
} 	

My biggest piece of advice here is to build up one step at a time from a simpler model. What’s the closest model to this that you do have working?

I’d also urge you to keep the models well indented so they’re easier to read (eliminate tabs if that’s the source of the problem)

Is there a reason you’re not just implementing a standard Dirichlet mixture here? I don’t think you need your own stick-breaking process implementation.

This is very unstable. If there’s any chance v can be near zero, you want to use log1m(v) in Stan for arithmetic stability (and log1p(-v) in other software), because log is very low precision near 1.

I’d recommend doing that in the transformed data block of the Stan program to put less burden on users of your model and to make sure you control the transform in the right place.

It shouldn’t cause problems if you have reasonable priors and generate from the priors. If you just make it up it might be inconsistent and Stan’s very bad at fitting misspecified models.

Thanks for linking, but this is the kind of detailed question that gets harder and harder to answer if we need to refer to external papers.

As I mentioned in the User’s Guide, these open-ended mixture models are almost never identified. So you should only be running a single chain from a good initialization point if you want to use any of our posterior analysis tools (though none of them can be trusted in a multi-modal model as you can’t fit them properly with sampling).

It’s best to allocate the vector of zeros once in transformed data and reuse it. The way you have it, it’s reallocating every loop for every log density eval.

Too bad we don’t have tensor arithmetic!

1 Like