Hierarchical probit model with difficult structure

specification
fitting-issues

#1

I am having trouble recovering known parameters from simulated data using STAN. I am translating the model into stan from a Gibbs sampler hand-rolled in R. The Gibbs sampler works, but is slow. I am using the same priors in STAN as in the Gibbs sampler, which are weakly informative.

A wikisurvey is a kind of survey where questions take the form of pairwise comparisons between two items. I use K to represent the number of items. There are K*(K-1) possible comparisons, but
respondents only provide a limited set of responses. I use J to refer to the number of respondents.

The object of inference in this model is the KxJ ‘opinion matrix’ Theta. Each item in theta represents a respondent’s opinion of an item. Phi(Theta_1_j - Theta_2_j) is the probability that respondent j prefers item 1 to item 2.

A second parameter mu[K], allows for partial pooling of information about the items and estimates the overall rank of the items. The model is hierarchical as theta_j_k ~ normal(mu[k]).

I am aware of two reasons the model is difficult to fit:

  1. Most values of Theta are uninformed. Respondents do not generally see every item.
  2. The mapping between Theta and Y seems difficult to sample from.

Below is my stan code. An R file with code to run the model on simulated data is attached. simulate_wikisurvey_unvectorized.R (2.2 KB)
Running the model gives me max treedepth warnings. I tried raising the max tree depth to 20! But the warnings persisted. I guess this means the problem is with the model.

I have also tried a couple variations of the model including parameterizing sigma, sampling theta once for each response, and a complicated (possibly buggy) vectorization scheme that eliminated the loops in the model. These attempts were fruitless. The following code is version that most clearly aligns with the data-generation process.

Thank you for your help!

data{
    // number of respondents
    int<lower=1> J;

    // number of responses
    int<lower=1> V;

    // number of items
    int<lower=1> K;

   // the next four vectors describe each response 
   // the item shown on the left
    int<lower=1, upper=K> ll[V];

    // the item shown on the right 
    int<lower=1, upper=K> rr[V];

    // respondent
    int<lower=1, upper=J> jj[V]; 

    // y = 1 if the left item is chosen
    int<lower=0,upper=1>  y_vec[V];

    // hyper prior for variance of mu. set to 4
    real<lower=0> p_tau_2_0;

    // hyper prior for mean of mu. set to 0
    real p_mu_0;

    // for now I treat sigma as a fixed prior. set to 1
    real<lower=0> sigma;
 }

transformed data {
    vector[K] mu_0;
    vector[K] tau_2_0;

    for (k in 1:K){
       mu_0[k] = p_mu_0; 
       tau_2_0[k] = p_tau_2_0; 
    }      
  
   // pin an arbitrary mu to 0
   tau_2_0[1] = 0.000001;
}

parameters{
  matrix[K,J] theta;
  vector[K] mu;
}


model{
  mu ~ normal(mu_0, tau_2_0);
  // sample theta
  for(k in 1:K){
    theta[k] ~ normal(mu[k], sigma);
  }

  for (v in 1:V){
    y_vec[v] ~ bernoulli(Phi_approx(theta[ll[v],jj[v]] - theta[rr[v],jj[v]]));
  }
}

#2

This is not a good way to do that. A better way is to declare

parameters {
  vector[J] theta_raw[K];
  vector[K] mu_short;
}
model {
  vector[K] mu;
  vector[V] p;
  mu[1:(K-1)] = mu_short;
  mu[K] = 0;
  ...
}

Then, you are going to need to Matt trick the thetas with

transformed parameters {
  vector[J] theta[K];
  for (k in 1:K) theta[k] = mu + sigma * theta_raw[k];
}

and that is particularly true when you get around to estimating sigma.

And then, you model will go faster if you create the probabilities as a vector and do the likelihood in one shot with

for (v in 1:V) p[v] = Phi_approx(theta[ll[v],jj[v]] - theta[rr[v],jj[v]]);
y_vec ~ bernoulli(p);

But I have a hard time seeing how it is reasonable to treat y_vec as conditionally independent. You are probably going to need something like multivariate probit, which is difficult. See
https://groups.google.com/d/msg/stan-users/GuWUJogum1o/LvxjlUBnBwAJ


#3

Thank you for the helpful and rapid response

I was wondering if this was better than fixed assignment. But i guess it is better to not treat unknown quantities as parameters.

Oh fun! Now I get to go learn about the Matt trick. It seems like this was part of the problem. By the way, I think you got the J and K switched around. Also, shouldn’t the definition of mu be in the transformed parameters block?

Do you mind elaborating here? Are you suggesting a respondent-level rather than response-level model?

After taking your suggestions, It seems like something else is missing. I fit a model with max_treedepth = 10 that had divergent iterations (a new problem!) and still hit max tree depth. I am currently trying with a greater tree depth, more warmup iterations, and denser data. Wish me luck!


#4

Yes, mu should be in the transformed parameters block or theta should be defined in the model block. But row_vector[J] theta[K] is analogous to matrix[K,J] theta although easier to loop from 1 to K with.

I guess. All of the responses by the same respondent are presumably not conditionally independent.


#5

Got it. It’s a row_vector.
Thanks for your help. I’ll update if things look like they are working.


#6

Wait so the matt trick should be

for (k in 1:K) theta[k] = mu[k] + sigma * theta_raw[k];  

?

Still trying to wrap my head around what this should do.


#7

That would correspond to what you had originally, although it implies all elements of theta[k] have the same expectation.


#8

That is the assumption for now. My goal is to eventually relax that assumption and also pool information by respondents.

It seems like we got a working model specification!
Thanks for all the help. This is my first attempt at specifying my own STAN model and you help is invaluable.

Let me see if I can explain how the ‘Matt trick’ applies to this case.

The ‘Matt trick’ is to re-parameterize variables that were on different scales so they are on the same scale.
In my case, each theta[k] is on a different scale given by mu[k].

for (k in 1:K) theta[k] = mu[k] + sigma * theta_raw[k];  

Sets up theta_raw to be standard normal. So we can do.

model{
  vector[V] p; 

  mu_short ~ normal(mu_0, tau_2_0);

  for(k in 1:K) theta_raw[k] ~ normal(0, 1);

  for (v in 1:V){
    p[v] = Phi_approx(theta[ll[v]][jj[v]] - theta[rr[v]][jj[v]]);q
  }

y_vec ~ bernoulli(p);

}

#9

Yeah, it is really that each element of theta_raw has the same unit-scale (a priori) and is independent of mu, sigma, and everything else. If you made theta primitive and did theta[k] ~ normal(mu, sigma) none of that holds, which usually makes it more difficult for NUTS.


Constraining Latent Factor model / baysian probabalisic matrix factorization to remove multimodality