Kronecker Product with Vector

Dear Stan Board,

I am trying to fit a spatially varying parameter model and am running into some trouble with Kronecker products. This is the general form of the distribution I am trying to specify.

(\theta_{1s}, \theta_{2s}, \theta_{3s}) ~ MVN(1 \otimes \mu_s, H(\phi) \otimes \Sigma).

I am trying to figure out the best way to code the mean vector. It looks like you can’t use the rep function with a vector (e.g., rep(mu_s, 10)). I also looked at using append_row but it is not obvious how to use this iteratively using replicates of the same vector. Does anyone know of a simple way to replicate \mu_s for varying length of 1?

Colin

For the \mu_s replication, I think you could first do

matrix[N,10] copies = rep_matrix(mu_s, 10);

to get a matrix where each column is \mu_s, and then you could vectorize it

vector[10*N] copy_vector = to_vector(copies);

which I think will work in Stan, as it is column-major IIRC.

I would advise against such copying approaches. The reason is that you will increase the stack size of the autodiff graph by a lot which will make your program very slow. This type of models are better expressed by using loops which maybe take advantage of symmetries in your expression to avoid unneeded iterations.

Thanks for the suggestions. I also wrote a little function to do the product, but perhaps the copies function would be faster. @wds15, is there a way to avoid copying if I need the full mean vector to specify a multivariate normal distribution? In other words, if I specify a MVN distribution with 80 dimensions, I assume I must pass a vector of length 80 into the mean piece.

~ Colin

vector[rows(A)*rows(B)] kron;
  for (j in 1:rows(A)) {
    kron[((j - 1)*rows(B) + 1):(j*rows(B))] = B;
  }
return kron;
}```

Yes, no way around that. However, if you have many of those 80 dimensional observations, then it is likely better to perform the kronecker product in a for loop for each observation and add each observation one-by-one into the target log lik.

I am going to trust your expertise on this, but it runs a bit counter to my intuitions — maybe you can explain?

wouldn’t the copy operation only add a rep_matrix and and a to_vector node to the graph? Which are fairly simple linear operations. The loop approach would instead add N nodes, which seems like it would be slower?

Isn’t this in direct opposition to the usual advice about vectorizing where possible?

Sorry, there might be some distinction I am not grasping here.

No need to trust me on this one… I encourage you to try things out. I may have misunderstood your model. Coding this up in different variants is helpful to understand the inner mechanics of Stan. The logic as to what makes a program fast or slow is sometimes counter-intuitive. An example for that is rep_matrix in your case and another is diag_matrix (which creates NxN items on the stack and not just N).

No, the rep_matrix is the bummer. It creates per element off the new matrix a new element on the stack.

Yes it is. Stan is different to some extent here. However, if you need matrix products - use them!

Programming advice is always language-specific! There’s a chapter in the manual trying to explain the efficiency concerns for Stan (which are not only in computing the log density efficiently, but also in parameterizing efficiently).

For Stan, vectorizing where possible is good, all else being equal. What you have to watch out for is implicitly introducing new variables. So writing diag_matrix(sigma) unfortunately creates a whole new matrix and then copies sigma into the diagonal and copies 0 values into the other positions. That puts a lot of things on the stack that get derivatives computed even though they’re zero and we don’t care about them. So in this case, it’s better to just add directly in a loop.

As explained in the manual chapter, loops themselves aren’t the culprit, since that’s all compiled down to efficient C++.

If we can figure out how to add the right expression templates, this advice might change for Stan in the future.

1 Like

Well, I did mean for Stan :) there is a lot of advice about vectorizing both here on discourse and in the manual.

I can see how diag_matrix instantiates a matrix, which of course takes up some memory. But why does it increase the auto-diff stack disproportionately? Shouldn’t it just be treated as a vector->matrix function with a known derivative? Are you saying that all of the zero elements are differentiated through separately?

Yes. This is what happens.

To know the cost of some function you have to trace the code and look for how many var instances are created. It is not number of function calls, but mostly number of var variables created. For diag_matrix we throw NxN-N zeros onto the stack… which is why a loop is better in most cases.

Is auto-differentiation always in terms of scalars in Stan? I don’t know the internals, but it seems to be hinted at that vectorizing code elsewhere (e.g. not looping over gaussian variables) can lead to more efficient autodiff.

We track scalars, yes. Vectorizing is usually a very good things, because things can be expressed much more efficiently arithmetically if code is vectorized. For example, doing y ~ normal(mu, sigma) with y and sigma being vectors is a lot better than doing the for loop over each element! So do vectorize as a default, but there are these exceptions (whenever lots of unneeded vars are created). Another performance bummer is multiplication of a large and very sparse matrix with a vector (the current sparse matrix-vector multiply thing is not the fastest and not exploiting sparseness will make Stan track every 0).

I am not allowed to edit … I meant y and mu being vectors…

No, there are matrix autodiff. It all bottoms out in scalars. There’s an arXiv paper that explains how it all works, but it’s pretty standard autodiff.

You can’t edit your own posts? We should change config if possible!

@wds15 do you not see an option or does it give you an error or something? You should be able to edit your post. (Edit: like I just did to mine.)

@Bob_Carpenter I looked and there does seem to be a discourse admin setting limiting edits to within 86400 minutes (60 days, but yes the setting is in minutes!) of original post. Probably the default setting unless we change it, which we could. Not that it would help in the case of Sebastian’s post from 2 days ago. I don’t know why that doesn’t work.

I’ve had some success fitting spatially varying parameters with a matrix normal distribution.
If your \theta's are parameters, you may be able to solve your problem by using the non-centered parameterization. That should be able to avoid the Kronecker product entirely, I think.

1 Like

I can edit some posts (presumably those which I created), but here I see the “edit” link. Then I klick on that, can enter my text and after clicking “save edits” I see a “is forbidden” window and this forces me to abandon the edits.

Thanks everyone for the suggestions about the Kronecker product notation to represent matrix normal distributions. Another approach I am trying to avoid using the matrix normal is to break the spatial covariance and covariance between parameters into separate pieces–e.g.,

\epsilon \sim N_3(0, \Sigma)
\alpha_i \sim N_{28}(0, H(\phi_i)) where i=1 \dots 3 for 3 parameters.
and then putting these pieces together with the mean as
(\theta_1, \theta_2, \theta_3) = \mu_\theta + \alpha_{.} + \epsilon

This makes the model easier to follow, I think. However, I am getting some really bizarre posteriors for the parameters, where some are behaved and others do not fit the data at all. I know Stan is an imperative programming language and I am wondering if my sequence of building the covariance matrix is having an effect on the sampler. Has anyone tried such an approach before?

That should work for general covariance if it’s block diagonal. No idea how that ties into Kronecker products but you should be able to check you’re getting the right answer.

The sequence of building it shouldn’t matter—just the eventual result.

Bob,

Thanks for the reply. This is how I was trying to re-code the problem. Going from this:

\theta^{(\delta, \sigma_1, \sigma_2)}_{rs} \stackrel{ind}\sim N_{84}(\mathbf{1} \otimes \mu_{s,3:5}, H(\phi)\otimes \Sigma)

to

\epsilon_{rs} \stackrel{ind}\sim N_{3}(0, \Sigma) \\ \alpha_{.si} \stackrel{ind}\sim N_{28}(0, H(\phi_i))\\ \theta^{(\delta, \sigma_1, \sigma_2)}_{rs} = \mu^{(\delta, \sigma_1, \sigma_2)}_{s} + \alpha_{rs.} + \epsilon_{rs}

where i = 1 \dots 3 indicates the parameter.

I am going to simulate some data from my model to check my code and see if that turns up any issues.

Colin