Priors on beta matrix rows

Hello, community!

I’m working on a categorical_logit regression model where each column of a matrix \beta represents the set of parameters associated with the j-category/answer, and each line represents the parameters associated with the k-covariate among all categories.

Due to a specific reason, I must apply priors in each row instead of each column.

From the Multi-logit regression page, I understood that the normal priors are being associated with each column of beta, i.e., each category/answer set of parameter:

model {
  matrix[N, K] x_beta = x * beta;

  to_vector(beta) ~ normal(0, 5);

  for (n in 1:N) {
    y[n] ~ categorical_logit(x_beta[n]');

  }
}

I’m wondering if I do something like below I’ll obtain what I want (priors on rows):

model {
  matrix[N, K] x_beta = x * beta;

  to_row_vector(beta) ~ normal(0, 5);

  for (n in 1:N) {
    y[n] ~ categorical_logit(x_beta[n]');

  }
}

I read the mixed-operations page, but I’m not sure if I got it right :/

image

Can anyone help with that, please?

Thanks in advance!

Both statements imply identical priors:

to_vector(beta) ~ normal(0, 5);
to_row_vector(beta) ~ normal(0, 5);

The normal() distribution is univariate, so when you specify a normal() prior for a vector or row-vector, Stan automatically broadcasts the prior so that each element of the vector is given the normal prior.

See the functions reference manual for more details: 14.8 Vectorization | Stan Functions Reference

Hello, @andrjohns!

Thanks for the clarification, it made sense to me.

I realized my approach was wrong, I should set a multivariate normal distribution for each \beta row.

Let me reformulate and specify my question better.

I have a set of \mathcal{B}_t^i varying according to t, as below:

\begin{equation} \mathcal{B}_t^{i}=\begin{bmatrix} \beta^{i,1}_{0,t} & \beta^{i,2}_{0,t} & \ldots & \beta^{i,S-1}_{0,t} \\ \beta^{i,1}_{1,t} & \beta^{i,2}_{1,t} & \ldots & \beta^{i,S-1}_{1,t} \\ \vdots & \vdots & \ddots & \vdots \\ \beta^{i,1}_{K-1,t} & \beta^{i,2}_{K-1,t} & \ldots & \beta^{i,S-1}_{K-1,t} \end{bmatrix} =\begin{bmatrix} \boldsymbol{\beta}^{i}_{0,t} \\ \boldsymbol{\beta}^{i}_{1,t} \\ \vdots \\ \boldsymbol{\beta}^{i}_{K-1,t} \end{bmatrix}_{K\times (S-1)}. \end{equation}

What I was wondering was if there is a vectorized way to assign a prior for each row of this column in a different approach than looping it:

parameters {

  matrix[D, K] beta;

  cholesky_factor_corr[K] Lcorr; // Cholesky factor (L_u matrix for R)
  
  vector<lower=0>[K] sigma; // Standard deviations for each element o beta rows
  
}

model{

  for(d in 1:D){  
    
  target +=  multi_normal_cholesky_lpdf(to_vector(beta[d]) | rep_vector(0, K-1), diag_pre_multiply(sigma, Lcorr)); 
 
    }
...
}

Furthermore, when working with the set of \mathcal{B}_t^i, \hspace{0.2cm} t ={1, 2, 3, … }, I’d like to use the mean posterior of \mathcal{B}_{t-1}^i as the prior of \mathcal{B}_t^i, i.e.,

\mathcal{B}_t^i[d]\sim \text{MVN}(\mathcal{B}_{t-1}^i[d], ...)

From the section mentioned in the quote below, if both matrices had the same dimension, do you think if I specify them in the right way in the model, it should make things work properly in a vectorized way?

Thanks again and sorry for any kind of confusion.

I’m afraid not. But the loop isn’t the problem here, it’s the multi_normal_cholesky solves that will dominate the computation here (much better than dense matrices as it’s only quadratic instead of cubic).

I don’t think you can vectorize this in a useful way, but it’s not going to matter. Stan’s loops are very fast. The only thing that’s useful about vectorizing is that it can reduce the size of the autodiff expression graph for derivatives, but that requires customization on a per-function basis, and we don’t have anything like that built-in. The penalty for loops here will be negligible compared to the cost of the density evals and gradients.

Thanks for the inputs, @Bob_Carpenter!

So, what you meant here is that this way I’m doing is probably the best one we have so far and that, besides it consumes a lot of computational effort, the use of multi_normal_cholesky distribution is the best choice as well, right?

Do you think I can reduce this computational time/cost by using reduce_sum functions?

I tried this approach here with a simpler example, but it seems the code didn’t get faster.
I’d appreciate your opinion in this topic if you can, as well.

Thanks a lot!