Hierarchical multi-logit regression


#1

Hi,

I am a newbie trying to use STAN for a very specific problem.

Has anyone tried to code a hierarchical multi-logit regression??? I am in marketing and am doing Conjoint Analysis. I was given some legacy WinBugs codes by my employer, but I am trying to switch to STAN.

I am following Examples 8.6 and 8.9 from the user manual and hoping to combine and modify those STAN codes. But I find myself having to declare a 3-dimensional matrix for the beta coefficients - for all the possible choice outcomes (K), for all the attributes (D), and for all the people (L).

parameters {
matrix [K,D] mu;
matrix [K,D] sigma;
matrix [K, D, L] beta;
}

My questions are:

  1. Is 3-dimensional matrix even allowed in STAN?
  2. The manual seems to suggest that an array of vectors is more efficient than matrices. But I don’t know how to express a 3-dim matrix as an array of vectors…

Sorry if my questions are too basic. Any help is appreciated.

Cheers~


#2

There is no such thing as a three-dimensional matrix in Stan where a matrix is defined in the linear algebra sense. You can, however, have a set (called an array in Stan) of L matrices that each have K rows and D columns. To do so, write

parameters {
  matrix[K,D] mu;
  matrix<lower=0>[K,D] sigma; // impose the constraint
  matrix[K,D] beta[L];
}

The beta object can be accessed (almost) like a list of KxD matrices in R, except that you just use single brackets (as in beta[1]) rather than double brackets (like beta[[1]]) to access the matrix for the first person in the data.

However, you need to keep in mind that in multinomial logit, it is conventional to restrict one column of beta[i] to be all zeros, so really everything would be declared with D - 1 columns. Also, in a hierarchical model, you probably want to use a non-centered re-parameterization of the normal prior, but we can get to that after you have something running with your parameterization.


#3

Yes, thanks for the reminder. I plan to add a vector of zeros, as discussed after Example 8.6.

parameters {
matrix[K,D] mu;
matrix<lower=0>[K,D] sigma; // impose the constraint
matrix[K,D] beta[L];
}

Great! I just have a follow-up question.
In my problem, L, the number of people, is going to be greater than K and D. Is it better to do: matrix[L,K] beta[D] or matrix[L,K] beta[D] ?


#4

It is going to be better to do

parameters {
  matrix[K, D - 1] mu;
  matrix<lower=0>[K, D - 1] sigma;
  matrix[K, D - 1] beta[L];
}

in order to align beta[i] with mu and sigma.


#5

I was trying to do this:
for (i in 1:L)
beta[i] ~ normal(mu, sigma)

But it gives me an error. The error message says the output argument for the normal() is either real or vector. But I cannot sample a matrix as the output, which is what beta[i] is. This is in spite of the fact that mu and sigma are matrices of the same dimension.

Hmm…but how can I rectify this issue, given how I declare beta???


#6
vector[K * (D - 1)] mu_vec = to_vector(mu);
vector[K * (D - 1)] sigma_vec = to_vector(sigma);
for (i in 1:L) to_vector(beta[i]) ~ normal(mu_vec, sigma_vec);

#7

For these two lines:

vector[K * (D - 1)] mu_vec = to_vector(mu);
vector[K * (D - 1)] sigma_vec = to_vector(sigma);

Should I put them in the model{} block? Or should I put them in the transformed parameters{} block?

Thanks again.


#8

model


#9

Thanks for all the support so far. My model is running and I am just tuning my hyperpriors. However, now I have another programming-related question.

parameters {
matrix[K, D - 1] mu;
matrix<lower=0>[K, D - 1] sigma;
matrix[K, D - 1] beta[L];
}
in order to align beta[i] with mu and sigma.

In my setup, beta is a K by (D-1) matrix of parameters. Now, suppose I want to restrict some of the elements of beta to be 0. The easiest thing to do this in my mind is to define another K by (D-1) matrix that consists of only 1’s and 0’s. I can even supply this binary matrix as data input. Then I can do an element-by-element matrix multiplication between beta and this user-supplied matrix in the transformed parameters block. But is this doable in Stan? If not, do you have any work-around to recommend? Thanks again!


#10

That is not a good way to do it in Stan as you won’t be able to identify the multiplied parameters that get forgotten. Instead, read the missing data chapter of the manual. You need a parameter vector the size of the number of non-fixed values, then you need to copy those into a matrix and fill in the known values. Then you won’t have redundant and unconstrained parameters.


#11

Thanks. I understand that I do not want to have redundant parameters.

But can STAN do element-by-element multiplication between two matrices of the same size? I am just trying to figure out programmatically how to fill in the desired matrix with known and unknown values. But I also prepare that I may have to rethink my whole setup, too…


#12

Yes, it is the .* operator.


#13

And even if it couldn’t, it’s a simple loop. Don’t be afraid of loops in Stan—especially when there aren’t any densities inside. It’s not slow to loop like an interpreted language such as R or Python or JAGS or BUGS.


#14

It’s good to know about the loop. I will give it a shot!

I have another clarifying question.

parameters {
matrix[K,D] beta;
}
model {
for (k in 1:K)
beta[k] ~ normal(0, 5);
for (n in 1:N)
y[n] ~ categorical(softmax(beta * x[n]));
}

This is from the manual. It is understood to have omitted the transformed parameters block for the zero coefficients.

But my question is hopefully silly. How does the categorical function (and also categorical_logit) know that the Pr(y=1) is aligned with beta[1]? There is no a priori sorting done on the data of y, so y[1], the first observation can be anything from 1,…,K.

Thanks again.


#15

Yes, y[1] can be any of 1:K—I don’t understand why you think that’s a problem. beta[1] is a D-vector representing the coefficients for the outcome 1 by definition. You can look up the definition of categorical_lpmf in the manual. beta * x[n] is just multiplying a matrix beta by a vector x[n].


#16

Dear Stan Board,

I have a follow up to this discussion about left hand side sampling statements when working with multidimensional arrays. I have a multivariate hierarchical distribution across seasons (S). For each season I want to have a hierarchical distribution with seasonal means, but a common covariance matrix. Where I am getting stuck is how to make sure for each season S, the parameter vector theta is drawn from a multivariate normal distribution. I have been trying the to_vector command, but without much success. Ignoring the seasonal piece, the following works fine where theta is declared as a vector[K] theta[G], and mu_theta as a vector [K], e.g.,

theta ~ multi_normal_cholesky(mu_theta, diag_pre_multiply(sigma, Lcorr));

Now, however, with the seasons, I have theta declared as matrix[G, K] theta [S]. I have played around with the to_vector command suggested above, but haven’t found the right way to make it work. Below is the relevant code for my model. Any suggestions would be much appreciated.

cholesky_factor_corr[K] Lcorr;
  vector<lower=0> [K] tau; 
  matrix[G, K] theta [S]; // Parameters for Each Year.  
  vector<lower=0> [K] mu_theta[S]; // Means for Parameters by Season

model {
  tau ~ student_t(4, 0, 1); //Prior on Standard Deviations
  Lcorr ~ lkj_corr_cholesky(1);
  
  //prior for hierarchical means
  for(i in 1:S){
  mu_theta[i,1] ~ normal(.2, 0.1);  
  mu_theta[i,2] ~ normal(.5, 0.2);
  mu_theta[i,3] ~ normal(225, 15);
  mu_theta[i,4] ~ lognormal(3, 0.5);
  mu_theta[i,5] ~ lognormal(3, 0.5);
  }
  

  for(j in 1:K){
  for(i in 1:S){
  theta[i,,j] ~ multi_normal_cholesky(mu_theta[i], diag_pre_multiply(tau, Lcorr));
  }
  }

#17

I figured out the issue and below is the solution in case anyone else runs into the same problem with indexing over a 3d array and using the multivariate normal density.

  for(i in 1:S){
  for(j in 1:G){
  row(theta[i], j) ~ multi_normal_cholesky(mu_theta[i], diag_pre_multiply(tau, Lcorr));
  }
  }

#18

I’m not sure the type of theta here, but you can probably just use theta[i, j] ~ ....


#19

Hi Bob,

Thanks for the note. Theta is defined as matrix[G, K] theta [S]. I was thinking the cost of loops was low in Stan given it compiles in C++, but will give your suggestion a shot.

~Colin


#20

That was just a suggestion to replace row(theta[i], j) with theta[i, j] — they’ll resolve to the same thing.

But now that I see what you’re doing, I can suggest what you really want to do. Let K = size(tau), and code this as:

{ 
  vector[K] theta_flat[S * G];
  int i = 1;
  for (s in 1:S) {
    for (g in 1:G) {
      theta_flat[i] = theta[s, g] - mu_theta[s];
      i += 1;
    }
  }
  theta_flat ~ multi_normal_cholesky(rep_vector(0, K), diag_pre_mutiply(tau, Lcorr));
}

Loops are fast, but having to continually diag-pre-multiply and factor isn’t. This makes sure you only factor the matrix once. This should be a lot faster than what you were doing.