Hi Dear STAN Community,
I am still a STAN rookie, having worked with it for the last 2 months, and I am still amazed by the concentration of technology and knowledge that is in there :-)
My post concerns the ability (or not) to define array of matrices (or vectors) of variable lengths in STAN scripts.
Indeed, I am working on a research project which consists in analyzing proteomics data acquired by mass spectrometry, and the problem is composed of a large series of linear models (one model per protein k : Y_k = X_k\beta_k + \epsilon_k for k:1,\ldots,K with K>1000) which I liked to connect together by defining a hierarchical structure for some of the \beta_{k} and/or for the \sigma_k parameters. However some of the regressors are specific in that they are not common to each linear sub-models, and therefore the number of columns p_k in the X_k matrices can vary a great deal between the different sub-models. Also the number of observations n_k vary between the different sub-models.
In order to keep the semantic clear, and in a kind of naive fashion, I’d like to define my input data like this:
data {
int<lower=1> K; // number of proteins
int<lower=1> N[K]; // number of observations per protein
int<lower=1> M[K]; // number of parameters in linear model (per protein)
vector[N] y[K]; // vectors of responses (per protein, each length = N[k])
matrix[N,M] X[K] ; // model matrices (per protein, each nrow = N[k], ncol = M[k])
}
But STAN complains that the dimensions of the y vectors and the X matrices should be one-dimensional.
I tried two solutions to circumvent this issue, none of them being completely satisfactory :
- Define the dimensions of the y’s and the X’s as the maximum dimensions encountered for any model. This works, but at the cost of a huge waste of memory (because a few models have much bigger N and M than the others), which in turn even precludes the parallelization of the simulation on my laptop due to memory constraints.
- Flatten everything in vectors, and keep track of the position of each row stored in intermediary index variables. This works also, and seems efficient in terms of running time and memory consumption, but creates an issue of semantic loss, and therefore big headaches and material post treatment work in order to find back ‘what parameter is what’ at analysis time.
I assume I am not the first one having this kind of question/issue, so would someone have a ‘magic trick’ to solve it ?
Thanks a lot,
Philippe