Hi all!
I am trying to use a hierarchical mixture model for a classification task. After fitting this model to some training data I would like to evaluate this model on some validation or test data, without further fitting of the hyper parameters. How would I do this in Stan?
There are two reasons why I would like to avoid having an influence of the validation datapoints on the hyper parameters:
- saving computation: Fitting the model on the relatively large dataset takes a while and for application the model is meant to predict classifications for a single new group of data. It seems wasteful to repeat this computation for each new classification and keeping all data in would not be representative of this application.
- The validation data could be useful for learning a better model and for making a fair comparison to models which cannot take this into account I would like to prevent this from happening.
Concretely, I have the following setup:
The data come from K subjects with N_k observations each, which are meant to be classified into two classes based on a feature f, which contains a random offset, which is shared among all observations from the same subject. For simplicities sake assume that there is only one feature dimension, which I model as f = \mu_C + \mu_k + N(0, \sigma_C). In Stan notation this is:
data {
int<lower=1> N; // number of datapoints
int<lower=1> K; // number of subjects
real latents[N]; // feature
int<lower=0> targets[N]; // targets, i.e. categories
int subject[N]; // which subject goes with which observation
}
parameters {
real mu_k[K]; // subject shifts
real mu[2]; // class means
real<lower=0> sigma[2]; // standard deviation per class
real<lower=0> sigma_pop; // standard deviation of mu_k
}
model {
for (k in 1:K) {
mu_k[k] ~ normal(0, sigma_pop);
}
for (n in 1:N) {
latents[n] ~ normal(mu[targets[n]] + mu_k[subject[n]], sigma[targets[n]]);
}
}
This is shortened and simplified from the actual program. Please excuse typos etc.
After sampling from this models parameters (means and variances for the classes and sigma_pop) I get a new subject with N_k new observations f_val and would like to use stand to sample its mean mu_k and class memberships for the new observations without updating any estimates for the training data.
I have so far found two ways which do not reach my aims:
- pass f_val to the original model fit, which leads to model fits depending on this data
- Using the transformed parameters or a new model, which I did not get to sample the mu_k for the new validation subject
Any Ideas how this could or should be done?
You can use the generated quantities block to do calculations based on each posterior draw that do not influence the inference at all (generic docs are here)
For instance with your model:
data {
int<lower=1> N; // number of datapoints
int<lower=1> K; // number of subjects
real latents[N]; // feature
int<lower=0> targets[N]; // targets, i.e. categories
int subject[N]; // which subject goes with which observation
int<lower = 1> N_pred;
int<lower=0> targets_pred[N_pred]; // targets we're gonna make predictions on
int subject[N_pred];
}
...
generated quantities {
vector[N_pred] latent_pred;
for(n in 1:N_pred) {
latent_pred[n] = normal_rng(mu[targets_pred[n]] + mu_k[subject_pred[n]], sigma[targets_pred[n]]);
}
}
So we can do calculations in generated quantities that don’t affect the inference.
2 Likes
Hi!
First of all: Thanks for your answer!
Unfortunately your answer is not exactly what I was searching for. I am generally aware of the generated quantities block, but this seems not to work for the model I am looking at as the validation data will contain entirely new subjects, i.e. I will have to be able to sample mu_k for new subjects.
Also, a small point: The intended inference is the other way around: new latent vectors will be provided and the aim is to infer the target variable.
So this will probably have to be part of your model. Like, if you have a hierarchical model, you say:
\theta_{g[i]} \sim N(\mu, \tau)\\
y_i \sim N(\theta_{g[i]}, \sigma)
And if you have a new group come along, call it theta_{k}, you’d just draw it from the population level distribution theta_k \sim N(\mu, \tau). So that comes from the generative part of the model (groups come from a group level distribution).
So if you need to think about new groups, then think about how that would come up in the model. Are new groups the same as everything else?
1 Like
Thanks for keeping with me. The solution you now suggest unfortunately is exactly what I do not want to do. That solution is essentially the first version insufficient solution I mention in my original post.
To make my problem clearer, let me explain that your solution would probably run and work, but not do what I would like to do:
Essentially your solution would result in me adding the new groups of observations to my model and estimate their individual parameters while fitting the model. If I also add the information which targets the individual observations belong to as discrete parameters I would get a posterior over which target the observations belong to as well. In fact this system would be quite bright as it would be able to use the data that its supposed to produce predictions for, to do semi-supervised learning.
If I was only interested in getting the best possible guess for the data I currently have this would be the best solution. However I am in a typical machine learning prediction setting where I want to be able to produce good predictions for new groups of data coming in.
If I wanted to apply this solution this would imply that I need to resample the whole model including all the previous data to make a prediction for a new group. That would be computationally very inefficient and additionally it would not be comparable to competitor models, which are not given access to the prediction values for adjusting parameters.
Instead, I am searching for a method to sample only the parameters relevant for the new group of data to be classified, i.e. without further adjusting the distributions over the model parameters like the population means and standard deviations.
Not necessarily. Just draw the group-level parameters from the parent distribution in the generated quantities block. Then draw the membership condional on the value of the feature. In short, if your model is fully generative then you should have no issues simulating variables from the model on the generated quantities block.
@maxbiostat: Thank you for your comment, but that would be simulation and I want to do inference on new data, i.e. my model is fully generative, but in the opposite direction.
I have:
global parameters → group parameters → observations
now I want to go from observations back to the group parameters for new data without resampling the global parameters and the parameters for the other groups which constrain the global parameters.
Your idea would only allow me to move in the other direction. Of course I can simulate new groups and observations, but that is not what I want to do.
What you want to do is not posterior prediction, then. If you really want to learn the group parameters only for the new data, you either fix the nuisance parameters to their posterior means/medians or approximate their posteriors using some parametric family and the plug those in as informative priors.
1 Like
In a second model, that is.
Great, yes this would be a solution.
But by recommending this you are also saying that I cannot directly use the posterior samples I get from the first model to specify the informed priors for the second model, correct?
i.e. I have to summarize the samples into a point estimate or a distribution for which I have a density formula to make them usable for the second Stan run.
If you really want to, you could fit the second model n different times, fixing the nuisance parameters to n different samples from their joint posterior, and then you could combine the posterior chains for these n different runs for inference. This will be desirable, in particular, if
- You need to account for uncertainty in the nuisance parameters, and
- the nuisance parameters aren’t orthogonal, and it’s hard to write down a parametric approximation to their joint posterior distribution.
Great, that would indeed do what I want.
That sounds really inefficient though. I had hoped that there was a way to take the already sampled parameters into account for a single Stan run.
In theory, something like basing each step in the chain on a random sample from the training inference should be possible. Being possible in theory does not imply being efficient, useful or implemented of course.
Even if you could hack Stan to do this, you’d have to think very carefully about whether doing so would lead to problems with MCMC central limit theorems, geometric ergodicity, and/or effective adaptation.
Don’t worry, I am not going to hack Stan to do something it is not supposed to do. I just wanted to express my believe that a more efficient method for this could be possible. If Stan does not have a method for this I am not going to hack something together, but rather stick to the established methods. For those at least someone carefully thought very carefully about the potential problems you mention.