Efficiently sampling a matrix of independent parameters

performance

#1

What is the most efficient way to sample a matrix of parameters drawn independently from the same distribution? I’m working on a multinomial logit model where I have a matrix of coefficients \beta that includes one column for each of K-1 outcomes in the data. I have \beta declared as a D \times K-1 matrix where each element is distributed \mathcal{N}(\mu_\beta, \sigma_\beta), where \mu_\beta and \sigma_\beta are hyperpriors.

I’m currently using to_vector to draw \beta because you can’t have a matrix on the left hand side of a normal_lpdf. A MWE of the model is included below. Is this the most efficient approach to sampling these parameters?

data {
  
  int<lower=1> N; // number of observations
  int<lower=1> D; // number of predictors (p + 1)
  int<lower=2> K; // number of alternatives
  int<lower=1,upper=K> Y[N]; // response variable
  vector[D] X[N]; // design matrix

}

parameters {
  
  matrix[K-1,D] beta_free_raw; // regression coefficeients for K - 1 alternatives
  real mu_beta;
  real<lower=.001> sigma_beta;
  
}

transformed parameters {
  
  matrix[K-1,D] beta_free;
  matrix[K,D] beta;
  beta_free = mu_beta + sigma_beta * beta_free_raw; // re-center parameters
  beta = append_row(rep_row_vector(0, D), beta_free); // append row of 0s to top
  
}

model {
  
  mu_beta ~ normal(0, 5);
  sigma_beta ~ cauchy(0, 2.5);
  to_vector(beta_free_raw) ~ normal(0, 1);
  
  for (n in 1:N) Y[n] ~ categorical_logit(to_vector(beta * X[n]));
  
}

#2

Using to_vector(beta_free_raw) ~ normal(0, 1); but you should not read that as beta_free_raw is drawn from a standard normal distribution. Rather, the standard normal PDF in log form is being tacked on to the kernel for each element of beta_free_raw. That is why it is clearer to write target += normal_lpdf(to_vector(beta_free_raw) | 0, 1);.


#3

Got it. What are the practical differences between those two statements? Also, are there any other ways to write this model more efficiently?


#4

The practical difference is that it does not confuse yourself or others as to what Stan is doing. That and the target += version calculates constants, which is somewhat slower but necessary in some situations.


#5

Sorry if I wasn’t clear. I meant why is it important to remind yourself/others that you’re incrementing the log probability instead of drawing from a standard normal?


#6

Because in order to sample from a posterior distribution in Stan, the Stan program has to commit to a log-kernel of a density function. And the shape of that log-kernel affects how well NUTS can draw from that posterior distribution. All of the tricks to improve the effective sample size are based on manipulating the log-kernel to get better geometry. When you ask “Also, are there any other ways to write this model more efficiently?” that is equivalent to asking “Also, are there any other ways to manipulate the log-kernel to get better geometry?”


#7

It’s often possible (and useful) to think of a model generatively, but that’s not what the computation is doing. The computation is just computing a log density and its gradient and the geometry of that log density is what will determine statistical sampling efficiency.

I’d qualify that. There are ways to make the iterations go faster (usually precomputation and vectorization) and there are ways to make the geometry better so that mixing is better (usually with reparameterizations).