Hierarchical probit model with difficult structure

I am having trouble recovering known parameters from simulated data using STAN. I am translating the model into stan from a Gibbs sampler hand-rolled in R. The Gibbs sampler works, but is slow. I am using the same priors in STAN as in the Gibbs sampler, which are weakly informative.

A wikisurvey is a kind of survey where questions take the form of pairwise comparisons between two items. I use K to represent the number of items. There are K*(K-1) possible comparisons, but
respondents only provide a limited set of responses. I use J to refer to the number of respondents.

The object of inference in this model is the KxJ ‘opinion matrix’ Theta. Each item in theta represents a respondent’s opinion of an item. Phi(Theta_1_j - Theta_2_j) is the probability that respondent j prefers item 1 to item 2.

A second parameter mu[K], allows for partial pooling of information about the items and estimates the overall rank of the items. The model is hierarchical as theta_j_k ~ normal(mu[k]).

I am aware of two reasons the model is difficult to fit:

  1. Most values of Theta are uninformed. Respondents do not generally see every item.
  2. The mapping between Theta and Y seems difficult to sample from.

Below is my stan code. An R file with code to run the model on simulated data is attached. simulate_wikisurvey_unvectorized.R (2.2 KB)
Running the model gives me max treedepth warnings. I tried raising the max tree depth to 20! But the warnings persisted. I guess this means the problem is with the model.

I have also tried a couple variations of the model including parameterizing sigma, sampling theta once for each response, and a complicated (possibly buggy) vectorization scheme that eliminated the loops in the model. These attempts were fruitless. The following code is version that most clearly aligns with the data-generation process.

Thank you for your help!

data{
    // number of respondents
    int<lower=1> J;

    // number of responses
    int<lower=1> V;

    // number of items
    int<lower=1> K;

   // the next four vectors describe each response 
   // the item shown on the left
    int<lower=1, upper=K> ll[V];

    // the item shown on the right 
    int<lower=1, upper=K> rr[V];

    // respondent
    int<lower=1, upper=J> jj[V]; 

    // y = 1 if the left item is chosen
    int<lower=0,upper=1>  y_vec[V];

    // hyper prior for variance of mu. set to 4
    real<lower=0> p_tau_2_0;

    // hyper prior for mean of mu. set to 0
    real p_mu_0;

    // for now I treat sigma as a fixed prior. set to 1
    real<lower=0> sigma;
 }

transformed data {
    vector[K] mu_0;
    vector[K] tau_2_0;

    for (k in 1:K){
       mu_0[k] = p_mu_0; 
       tau_2_0[k] = p_tau_2_0; 
    }      
  
   // pin an arbitrary mu to 0
   tau_2_0[1] = 0.000001;
}

parameters{
  matrix[K,J] theta;
  vector[K] mu;
}


model{
  mu ~ normal(mu_0, tau_2_0);
  // sample theta
  for(k in 1:K){
    theta[k] ~ normal(mu[k], sigma);
  }

  for (v in 1:V){
    y_vec[v] ~ bernoulli(Phi_approx(theta[ll[v],jj[v]] - theta[rr[v],jj[v]]));
  }
}

This is not a good way to do that. A better way is to declare

parameters {
  vector[J] theta_raw[K];
  vector[K] mu_short;
}
model {
  vector[K] mu;
  vector[V] p;
  mu[1:(K-1)] = mu_short;
  mu[K] = 0;
  ...
}

Then, you are going to need to Matt trick the thetas with

transformed parameters {
  vector[J] theta[K];
  for (k in 1:K) theta[k] = mu + sigma * theta_raw[k];
}

and that is particularly true when you get around to estimating sigma.

And then, you model will go faster if you create the probabilities as a vector and do the likelihood in one shot with

for (v in 1:V) p[v] = Phi_approx(theta[ll[v],jj[v]] - theta[rr[v],jj[v]]);
y_vec ~ bernoulli(p);

But I have a hard time seeing how it is reasonable to treat y_vec as conditionally independent. You are probably going to need something like multivariate probit, which is difficult. See
https://groups.google.com/d/msg/stan-users/GuWUJogum1o/LvxjlUBnBwAJ

Thank you for the helpful and rapid response

I was wondering if this was better than fixed assignment. But i guess it is better to not treat unknown quantities as parameters.

Oh fun! Now I get to go learn about the Matt trick. It seems like this was part of the problem. By the way, I think you got the J and K switched around. Also, shouldn’t the definition of mu be in the transformed parameters block?

Do you mind elaborating here? Are you suggesting a respondent-level rather than response-level model?

After taking your suggestions, It seems like something else is missing. I fit a model with max_treedepth = 10 that had divergent iterations (a new problem!) and still hit max tree depth. I am currently trying with a greater tree depth, more warmup iterations, and denser data. Wish me luck!

Yes, mu should be in the transformed parameters block or theta should be defined in the model block. But row_vector[J] theta[K] is analogous to matrix[K,J] theta although easier to loop from 1 to K with.

I guess. All of the responses by the same respondent are presumably not conditionally independent.

Got it. It’s a row_vector.
Thanks for your help. I’ll update if things look like they are working.

Wait so the matt trick should be

for (k in 1:K) theta[k] = mu[k] + sigma * theta_raw[k];  

?

Still trying to wrap my head around what this should do.

That would correspond to what you had originally, although it implies all elements of theta[k] have the same expectation.

That is the assumption for now. My goal is to eventually relax that assumption and also pool information by respondents.

It seems like we got a working model specification!
Thanks for all the help. This is my first attempt at specifying my own STAN model and you help is invaluable.

Let me see if I can explain how the ‘Matt trick’ applies to this case.

The ‘Matt trick’ is to re-parameterize variables that were on different scales so they are on the same scale.
In my case, each theta[k] is on a different scale given by mu[k].

for (k in 1:K) theta[k] = mu[k] + sigma * theta_raw[k];  

Sets up theta_raw to be standard normal. So we can do.

model{
  vector[V] p; 

  mu_short ~ normal(mu_0, tau_2_0);

  for(k in 1:K) theta_raw[k] ~ normal(0, 1);

  for (v in 1:V){
    p[v] = Phi_approx(theta[ll[v]][jj[v]] - theta[rr[v]][jj[v]]);q
  }

y_vec ~ bernoulli(p);

}

Yeah, it is really that each element of theta_raw has the same unit-scale (a priori) and is independent of mu, sigma, and everything else. If you made theta primitive and did theta[k] ~ normal(mu, sigma) none of that holds, which usually makes it more difficult for NUTS.