Problems to model choice probabilites directly instead of multinomial data with a dirichlet distribution

Hey all,

I’m working with a model with a multinomial likelihood to estimate parameters based on given responses of 5 different categories of some task. The data looks like this (a frequency table):

> data
        ID Freetime    PC IIP IOP DIP DOP NPL
  [1,]   1      0.0 0.476 238  86 107  38  31
  [2,]   1      0.2 0.674 337  73  16  47  27
  [3,]   1      0.4 0.710 355  76  15  33  21
  [4,]   1      0.8 0.790 395  49   9  26  21

IIP to NPL are frequencies of given responses per category. As the recovery of this particular model is not that good, I wanted to try to model the data with a higher “resolution” and directly use the probabilities to inform the parameters. I wanted to do this with a dirichlet distribution. As \alpha for the dirichlet, I wanted to use the general count of the response set. For example, the response set for the 5 categories is

     IIP IOP DIP DIOP NPL
[1,]   1   4   1    4  16

Which means that there are 1 possible alternatives for the first category IIP, 4 for the IOP and so on. This seemed pretty straight forward to me, but the model is not performing at all, meaning that the parameter estimates for the hyper parameters are totally of. Here is my model so far, with the dirichlet:

I use a probability matrix as data instead of the pure counts:

        IIP   IOP   DIP   DOP   NPL
  [1,] 0.476 0.172 0.214 0.076 0.062
  [2,] 0.674 0.146 0.032 0.094 0.054
  [3,] 0.710 0.152 0.030 0.066 0.042
  [4,] 0.790 0.098 0.018 0.052 0.042

Here’s the model:

functions{
 .....
  }
  
}

data {
  int <lower=0> N;  // number rownumber
  int <lower=0> K;  // categories 
  int <lower=0> J;  // Dims of Cov Matrix
  int <lower=0> Con;  // number of FT categories
  vector[K] R;         // number of responses per category
  vector [K] p[N*Con];   // observed data
  real scale_b; // set scaling for background noise
  vector[Con] Freetime;  // freetime conditions after distractor encoding (e.g. 500 ms, 1000 ms or 1500 ms) 
  int retrievals;

}

parameters {
  
  // Defining vector for hyper and subject parameters 
  
  cholesky_factor_corr[J] L_Omega;
  vector<lower=0>[J] sigma;
  vector[J] hyper_pars;
  matrix[J,N] theta;
  
}

transformed parameters {
  vector [K] count[N*Con]; 
   
   
  
  // non-centered multivariate
  matrix[J,N] subj_pars =  (
    diag_pre_multiply( sigma, L_Omega )
    * theta
    + rep_matrix(hyper_pars,N)
    ) ;
    
    // Transform f Parameter
    real mu_f = inv_logit(hyper_pars[3]);
    row_vector[N] f = inv_logit(subj_pars[3,]);
    
    // activations
    real acts_IIP[N*Con];
    real acts_IOP[N*Con];
    real acts_DIP[N*Con];
    real acts_DIOP[N*Con];
    real acts_NPL[N*Con];
    
    
    // probabilities
    vector[K] probs[N*Con];
    real SummedActs[N*Con];
    
    
    // loop over subjects and conditions to compute activations and probabilites
    
    
    for (i in 1:N){ // for each subject
    
    for(j in 1:Con) {
      
      acts_IIP[j + (i-1)*Con] = scale_b +  ((1+subj_pars[4,i]*Freetime[j])* subj_pars[1,i]) + subj_pars[2,i]; // Item in Position                      
      acts_IOP[j + (i-1)*Con] = scale_b + subj_pars[2,i];        // Item in Other Position
      acts_DIP[j + (i-1)*Con] = scale_b + f[i]*((exp(-subj_pars[5,i]*Freetime[j])*subj_pars[1,i]) + subj_pars[2,i]); // Distractor in Position
      acts_DIOP[j + (i-1)*Con] = scale_b + f[i]*subj_pars[2,i]; // Distractor in other Position
      acts_NPL[j + (i-1)*Con] = scale_b; // non presented Lure
      
      SummedActs[j + (i-1)*Con] = R[1] * acts_IIP[j + (i-1)*Con] + R[2] * acts_IOP[j + (i-1)*Con] + R[3] * acts_DIP[j + (i-1)*Con]+ // account for different numbers auf DIP / DIOP 
      R[4] * acts_DIOP[j + (i-1)*Con]+ R[5] * acts_NPL[j + (i-1)*Con];
      
      probs[j + (i-1)*Con,1] = (R[1] * acts_IIP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);  
      probs[j + (i-1)*Con,2] = (R[2] * acts_IOP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
      probs[j + (i-1)*Con,3] = (R[3] * acts_DIP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
      probs[j + (i-1)*Con,4] = (R[4] * acts_DIOP[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
      probs[j + (i-1)*Con,5] = (R[5] * acts_NPL[j + (i-1)*Con]) ./ (SummedActs[j + (i-1)*Con]);
    }
    }
}


model {
  
  // priors for hyper parameters
  
  hyper_pars[1] ~ normal(20,10); // c
  hyper_pars[2] ~ normal(2,10); // a
  hyper_pars[3] ~ normal(0,10); // f
  hyper_pars[4] ~ normal(1,10); // EE
  hyper_pars[5] ~ normal(1,10); // r
  
  
  // 
  L_Omega ~ lkj_corr_cholesky(2);
  sigma ~ gamma(1,0.01);
  
  
  // Loop over subjects
  
  for (i in 1:N) 
  {
    
    theta[,i] ~ normal(0,1);
    
  }
  
  
  
  for (j in 1:Con) {
    for (i in 1:N) {
  // draw data from probabilities determined by MMM parms
      
     p[j + (i-1)*Con,] ~ dirichlet(R);
      
      
    }
  }
}

generated quantities{
  
  vector[(J*(J-1))%/%2] cor_mat_lower_tri;
  //real p_rep[N*Con,K];
  
  
  cor_mat_lower_tri = flatten_lower_tri(multiply_lower_tri_self_transpose(L_Omega));
 
}


I’m using Luces Choice rule to normalize the activations which arise from the estimatetd parameters into probabilities (end of the transformed parameters block), where R is the composite or “weight” for the response category - which is already taken into account in this normalization. I don’t know whether I’m totally off here, but as I understood, the dirichlet should do the trick, but I don’t really get it to work. Or should I in general try to use a multivariate beta instead?

thanks in adavance

jan

Hi, @jangoe. think it may take a couple questions to get on the same page with terminology.

I don’t know what PC or Freetime or ID are, but if you look at the last 5 columns, that looks like multinomial data. In which case, letting y_1 = (238, 86, 107, 38, 31) and so on, the natural way to model this is:

\alpha \sim ...

\theta \sim \textrm{Dirichlet}(\alpha)

y_n \sim \textrm{Multinomial}(\alpha) \textrm{ for } n \in 1{:}N

This lets you take the counts into consideration as well as the proportions. You usually don’t want to use a Dirichlet regression unless you don’t have the counts as it loses information.

So I couldn’t figure out what was going on in the rest of the code and discussion. Are you perhaps trying to build a logistic mulitivariate normal instead of Dirichlet? That works exactly the same way only with the multivariate normal prior instead of the Dirichlet. That potentially lets you fit correlations as well as concentration, but you need a fair amount of data to do this well or good priors.

I don’t know what a response set is. I though the responses were the five headers here. Or is each of these broken up into subcategories? If they are broken up into subcategories, do you have the subcategory counts?

Hey Bob, thanks for reaching out and explaining the logic !

Yes it is multinomial data, like you already pointed out, y_1 are category counts. This counts come from a set of alternatives, which have to be chosen.

The “response set” (sorry for not adressing it in the original question) are the initial available alterantives for one choice process - that means that in each “response set (or choice alternatives)” there a n alternatives for each category. So here we have

1 alternatives for category A
4 for category B
1 for category C
4 for category D
and 16 for category E

So in total one has to choose 1 answer out of 26 possibilities for every attempt. So the prior guessing probability of choosing the correct answer for one position would be 1/26 and the probability of choosing category D would be 4/26. As this is close to (at least) my understanding of \alpha in the Dirichlet, I thought I could/ or had to use this information as well to model it.

In you explanation, you wrote

\alpha ~ ...

what exactly would now be alpha here - the given probability vector from the data or some general prior ? The rest is totally clear to me.
Thanks for your kind explanation and your time !

Greetings

Jan

I am not totally sure what your model is trying to accomplish, but if you are trying to model the probability of choosing one choice out of N, where N is greater than 2, you can use softmax/logit to transform the values influencing the probability of each choice into values from 0 to 1, where the values 1:N sum to 1 (e.g., normalize), and then model categorical outcomes, all using the categorical_logit function in stan. This is implemented with code like this, again assuming I’m not mixing up what your code is doing:

choice[i,j] ~ categorical_logit(values_influencing_choice[,i,j]); 

where choice is N x Con (I think this is variable p in your code) and takes integer values 1:K, and values_influencing_choice is K x N x Con (I think this would be the numerator of what is going into your probs variable). Note that I’m using 2 or 3D variables to index subjects, categories, etc. instead of using a 1D variable and calculating the position with something like j + (i-1)*Con - this tends to be easier to read and less error-prone.