I’m trying to code a model that, there are 90 input features, these 90 features can be divided into 13 mutually exclusive groups. Features from the same group will follow the same normal distribution for each subject. We will end up with 13 normal distributions for each subject. we use the mean of normal distributions as new features. This can be seen as a dimensional reduction which reduce 90 features to 13 features. We use these reduced 13 features to fit regression model with a continuious outcome. If we dont give the true groups of each feature, but only give the total number of groups, then the group parameter, which is a simplex vector with 13 elements represents the probabilities of this feature belongs to which groups, can be generated from a multinomial distribution for each of 90 features. How do we code this in stan?
Hi, @Wei_Jia and sorry it’s taken us so long to respond. You’d write the model this way to extract the normal fits.
data {
int<lower=0> N_subjects;
array[90] int<lower=1, upper=13> group_id;
matrix[N_subjects, 90] x;
}
parameters {
vector[N, 13] mu;
vector<lower=0>[N, 13] sigma;
}
model {
for (n in 1:N_subjects) {
for (p in 1:90) {
x[n, p] ~ normal(mu[n, group_id[p]], sigma[n, group_id[p]]);
}
}
This assumes you know the group IDs and the data is univariate.
Are you in the situation where you don’t have the group IDs or the data is multivariate per “feature”? If so, it’s a clustering or mixture problem, which is harder. Presumably there the simplex is the probability of each of the 13 elements in a 13-dimensional mixture, but then you want the posterior probabilities of each of the 90 input features belonging to each of the 13 elements and you want to then work in expectation like in the EM algorithm. But it’s not an ordinary mixture problem in that you have N
90-vectors of observations and want to group the elements of those 90-vectors into 13 categories presumably in a way that’s consistent among the N
subjects.
Hi @Bob_Carpenter , thank you so much for your reply!
I’ve been searching on this problem since then. What I’m trying to build is a stochastic block model (SBM). Where for each subject, I have a graph with 90 nodes, these nodes can be grouped into 13 groups. What I have as input data is a 90*90 connectivity matrix for each subject. The SBM can cluster nodes based on their connectivity (e.g. connectivity of nodes from the same group would have a similar mean (from the same normal distribution). Can you give me some hit on how to code this in stan? Please let me know if I did not explain the problem clear.
Thank you!
I looked them up: Stochastic block model - Wikipedia
Like all clustering models, statistically this looks like a mixture. You might want to look at K-means clustering (just a normal mixture) to get an idea of how clustering that in Stan (there’s a chapter on mixtures in the User’s Guide). You need to marginalize out the cluster assignments, which winds up looking a lot like the “soft” version of K-means (i.e., the one based on running EM on the proper model, not the usual K-means that quantizes to 0/1).
The stochastic model is more challenging as it doesn’t have the conditionally i.i.d. formulation of a simple mixture. Like K-means clustering, the general maximum likelihood problem is NP-complete, so there’s not going to be a general sub-exponential algorithm. That’s not to say things can’t work on typical problems, but it’s cause for concern. I don’t see how to do the required marginalization for a stochastic block model without a combinatorial explosion—I’m not sure it’s even possible.
Discrete sampling might be possible here (not in Stan, though), but those algorithms usually struggle with mixtures due to the strong correlations of the discrete parameters.