Hierarchical multi-logit regression

sonicking · May 19, 2017, 6:34pm

Hi,

I am a newbie trying to use STAN for a very specific problem.

Has anyone tried to code a hierarchical multi-logit regression??? I am in marketing and am doing Conjoint Analysis. I was given some legacy WinBugs codes by my employer, but I am trying to switch to STAN.

I am following Examples 8.6 and 8.9 from the user manual and hoping to combine and modify those STAN codes. But I find myself having to declare a 3-dimensional matrix for the beta coefficients - for all the possible choice outcomes (K), for all the attributes (D), and for all the people (L).

parameters {
matrix [K,D] mu;
matrix [K,D] sigma;
matrix [K, D, L] beta;
}

My questions are:

Is 3-dimensional matrix even allowed in STAN?
The manual seems to suggest that an array of vectors is more efficient than matrices. But I don’t know how to express a 3-dim matrix as an array of vectors…

Sorry if my questions are too basic. Any help is appreciated.

Cheers~

bgoodri · May 19, 2017, 6:45pm

There is no such thing as a three-dimensional matrix in Stan where a matrix is defined in the linear algebra sense. You can, however, have a set (called an array in Stan) of L matrices that each have K rows and D columns. To do so, write

parameters {
  matrix[K,D] mu;
  matrix<lower=0>[K,D] sigma; // impose the constraint
  matrix[K,D] beta[L];
}

The beta object can be accessed (almost) like a list of KxD matrices in R, except that you just use single brackets (as in beta[1]) rather than double brackets (like beta[[1]]) to access the matrix for the first person in the data.

However, you need to keep in mind that in multinomial logit, it is conventional to restrict one column of beta[i] to be all zeros, so really everything would be declared with D - 1 columns. Also, in a hierarchical model, you probably want to use a non-centered re-parameterization of the normal prior, but we can get to that after you have something running with your parameterization.

sonicking · May 19, 2017, 7:01pm

Yes, thanks for the reminder. I plan to add a vector of zeros, as discussed after Example 8.6.

parameters {
matrix[K,D] mu;
matrix<lower=0>[K,D] sigma; // impose the constraint
matrix[K,D] beta[L];
}

Great! I just have a follow-up question.
In my problem, L, the number of people, is going to be greater than K and D. Is it better to do: matrix[L,K] beta[D] or matrix[L,K] beta[D] ?

bgoodri · May 19, 2017, 8:16pm

It is going to be better to do

parameters {
  matrix[K, D - 1] mu;
  matrix<lower=0>[K, D - 1] sigma;
  matrix[K, D - 1] beta[L];
}

in order to align beta[i] with mu and sigma.

sonicking · May 23, 2017, 8:36pm

I was trying to do this:
for (i in 1:L)
beta[i] ~ normal(mu, sigma)

But it gives me an error. The error message says the output argument for the normal() is either real or vector. But I cannot sample a matrix as the output, which is what beta[i] is. This is in spite of the fact that mu and sigma are matrices of the same dimension.

Hmm…but how can I rectify this issue, given how I declare beta???

bgoodri · May 23, 2017, 11:09pm

vector[K * (D - 1)] mu_vec = to_vector(mu);
vector[K * (D - 1)] sigma_vec = to_vector(sigma);
for (i in 1:L) to_vector(beta[i]) ~ normal(mu_vec, sigma_vec);

sonicking · May 24, 2017, 12:06am

For these two lines:

vector[K * (D - 1)] mu_vec = to_vector(mu);
vector[K * (D - 1)] sigma_vec = to_vector(sigma);

Should I put them in the model{} block? Or should I put them in the transformed parameters{} block?

Thanks again.

bgoodri · May 24, 2017, 1:45am

model

sonicking · May 25, 2017, 8:58pm

Thanks for all the support so far. My model is running and I am just tuning my hyperpriors. However, now I have another programming-related question.

parameters {
matrix[K, D - 1] mu;
matrix<lower=0>[K, D - 1] sigma;
matrix[K, D - 1] beta[L];
}
in order to align beta[i] with mu and sigma.

In my setup, beta is a K by (D-1) matrix of parameters. Now, suppose I want to restrict some of the elements of beta to be 0. The easiest thing to do this in my mind is to define another K by (D-1) matrix that consists of only 1’s and 0’s. I can even supply this binary matrix as data input. Then I can do an element-by-element matrix multiplication between beta and this user-supplied matrix in the transformed parameters block. But is this doable in Stan? If not, do you have any work-around to recommend? Thanks again!

Bob_Carpenter · May 26, 2017, 4:33am

That is not a good way to do it in Stan as you won’t be able to identify the multiplied parameters that get forgotten. Instead, read the missing data chapter of the manual. You need a parameter vector the size of the number of non-fixed values, then you need to copy those into a matrix and fill in the known values. Then you won’t have redundant and unconstrained parameters.

sonicking · May 26, 2017, 3:47pm

Thanks. I understand that I do not want to have redundant parameters.

But can STAN do element-by-element multiplication between two matrices of the same size? I am just trying to figure out programmatically how to fill in the desired matrix with known and unknown values. But I also prepare that I may have to rethink my whole setup, too…

bgoodri · May 26, 2017, 4:09pm

Yes, it is the .* operator.

Bob_Carpenter · May 26, 2017, 8:34pm

And even if it couldn’t, it’s a simple loop. Don’t be afraid of loops in Stan—especially when there aren’t any densities inside. It’s not slow to loop like an interpreted language such as R or Python or JAGS or BUGS.

sonicking · May 26, 2017, 11:58pm

It’s good to know about the loop. I will give it a shot!

I have another clarifying question.

parameters {
matrix[K,D] beta;
}
model {
for (k in 1:K)
beta[k] ~ normal(0, 5);
for (n in 1:N)
y[n] ~ categorical(softmax(beta * x[n]));
}

This is from the manual. It is understood to have omitted the transformed parameters block for the zero coefficients.

But my question is hopefully silly. How does the categorical function (and also categorical_logit) know that the Pr(y=1) is aligned with beta[1]? There is no a priori sorting done on the data of y, so y[1], the first observation can be anything from 1,…,K.

Thanks again.

Bob_Carpenter · May 28, 2017, 12:18am

Yes, y[1] can be any of 1:K—I don’t understand why you think that’s a problem. beta[1] is a D-vector representing the coefficients for the outcome 1 by definition. You can look up the definition of categorical_lpmf in the manual. beta * x[n] is just multiplying a matrix beta by a vector x[n].

cbeck · October 23, 2017, 11:18pm

Dear Stan Board,

I have a follow up to this discussion about left hand side sampling statements when working with multidimensional arrays. I have a multivariate hierarchical distribution across seasons (S). For each season I want to have a hierarchical distribution with seasonal means, but a common covariance matrix. Where I am getting stuck is how to make sure for each season S, the parameter vector theta is drawn from a multivariate normal distribution. I have been trying the to_vector command, but without much success. Ignoring the seasonal piece, the following works fine where theta is declared as a vector[K] theta[G], and mu_theta as a vector [K], e.g.,

theta ~ multi_normal_cholesky(mu_theta, diag_pre_multiply(sigma, Lcorr));

Now, however, with the seasons, I have theta declared as matrix[G, K] theta [S]. I have played around with the to_vector command suggested above, but haven’t found the right way to make it work. Below is the relevant code for my model. Any suggestions would be much appreciated.

cholesky_factor_corr[K] Lcorr;
  vector<lower=0> [K] tau; 
  matrix[G, K] theta [S]; // Parameters for Each Year.  
  vector<lower=0> [K] mu_theta[S]; // Means for Parameters by Season

model {
  tau ~ student_t(4, 0, 1); //Prior on Standard Deviations
  Lcorr ~ lkj_corr_cholesky(1);
  
  //prior for hierarchical means
  for(i in 1:S){
  mu_theta[i,1] ~ normal(.2, 0.1);  
  mu_theta[i,2] ~ normal(.5, 0.2);
  mu_theta[i,3] ~ normal(225, 15);
  mu_theta[i,4] ~ lognormal(3, 0.5);
  mu_theta[i,5] ~ lognormal(3, 0.5);
  }
  

  for(j in 1:K){
  for(i in 1:S){
  theta[i,,j] ~ multi_normal_cholesky(mu_theta[i], diag_pre_multiply(tau, Lcorr));
  }
  }

cbeck · November 1, 2017, 4:47pm

I figured out the issue and below is the solution in case anyone else runs into the same problem with indexing over a 3d array and using the multivariate normal density.

  for(i in 1:S){
  for(j in 1:G){
  row(theta[i], j) ~ multi_normal_cholesky(mu_theta[i], diag_pre_multiply(tau, Lcorr));
  }
  }

Bob_Carpenter · November 1, 2017, 6:23pm

I’m not sure the type of theta here, but you can probably just use theta[i, j] ~ ....

cbeck · November 1, 2017, 6:29pm

Hi Bob,

Thanks for the note. Theta is defined as matrix[G, K] theta [S]. I was thinking the cost of loops was low in Stan given it compiles in C++, but will give your suggestion a shot.

~Colin

Bob_Carpenter · November 1, 2017, 6:47pm

That was just a suggestion to replace row(theta[i], j) with theta[i, j] — they’ll resolve to the same thing.

But now that I see what you’re doing, I can suggest what you really want to do. Let K = size(tau), and code this as:

{ 
  vector[K] theta_flat[S * G];
  int i = 1;
  for (s in 1:S) {
    for (g in 1:G) {
      theta_flat[i] = theta[s, g] - mu_theta[s];
      i += 1;
    }
  }
  theta_flat ~ multi_normal_cholesky(rep_vector(0, K), diag_pre_mutiply(tau, Lcorr));
}

Loops are fast, but having to continually diag-pre-multiply and factor isn’t. This makes sure you only factor the matrix once. This should be a lot faster than what you were doing.

Topic		Replies	Views
Speeding up a hierarchical multinomial logit model Modeling	27	8163	October 18, 2022
Dirichlet-Multinomial for Transition Matrix Estimation Modeling rstan , specification , math	23	3568	February 27, 2022
Help speeding up a non-standard multi-logit model Modeling	11	977	February 23, 2018
Efficient computation of multi-logit Modeling	4	177	December 16, 2024
Array function coding error for multlevel logit model with multivariate priors Modeling rstan , specification	20	2364	April 2, 2022

Hierarchical multi-logit regression

Related topics