Hi everyone!
I hope you don’t mind a noob question, I’ve recently started exploring Stan. I’m trying to replicate a matrix completion method detailed in this paper to be used in the prediction of other thermodynamic properties. https://pubs.acs.org/doi/suppl/10.1021/acs.jpclett.9b03657/suppl_file/jz9b03657_si_001.pdf
In any case I was able to run the following code in CmdStan. using variational inference.
data {
int<lower=0> I;//solute
int<lower=0> J;// solvent
int<lower=0> K; // latent dimension
real ln_gamma [I,J];// matrix with missing data =-99.0
real<lower=0> sigma_0; //prior std dev
real<lower=0> lambda;// likelihood scale
}
parameters {
vector[K] u[I]; //solute feature vectors
vector[K] v[J]; // solvent feature vectors
}
model {
//prior draw feature vectors for all solutes and solvents
for (i in 1:I)
u[i] ~ normal(0,sigma_0);
for (j in 1:J)
v[j] ~ normal(0,sigma_0);
//likelihood: model the probability of ln_gamma as a Cauchy distribution
//around the dot product of the feature vectors
for (i in 1:I) {
for (j in 1:J) {
if (ln_gamma[i,j]!= -99.0) { //train to available data only
ln_gamma[i,j] ~ cauchy(u[i]' * v[j], lambda);
}
}
}
}
How do I move forward with this using cmdstan to generate the predictions from the feature vectors?
Thank you!
Hi Joshua, welcome to the forums!
If you want to generate the predicted value for each combination of solute and solvent feature vectors, you just need to a generated quantities block:
generated_quantities {
real ln_gamma_pred[I,J];
for (i in 1:I) {
for (j in 1:J) {
ln_gamma_pred[i,j] = cauchy_rng(u[i]' * v[j], lambda);
}
}
}
I’ve used real[I,J]
above for consistency with the rest of your code, but you should really use the matrix[I,J]
type as indexing is faster.
Additionally, it will be more efficient to use dot_product(u[i], v[j])
rather than the transposition and multiplication
Thank you for your prompt reply andrjohns! Just to confirm this means that I need to compilte a separate executable to generate the quantities?
You add the generated quantities block to your model, so your model would then look like:
data {
int<lower=0> I;//solute
int<lower=0> J;// solvent
int<lower=0> K; // latent dimension
real ln_gamma [I,J];// matrix with missing data =-99.0
real<lower=0> sigma_0; //prior std dev
real<lower=0> lambda;// likelihood scale
}
parameters {
vector[K] u[I]; //solute feature vectors
vector[K] v[J]; // solvent feature vectors
}
model {
//prior draw feature vectors for all solutes and solvents
for (i in 1:I)
u[i] ~ normal(0,sigma_0);
for (j in 1:J)
v[j] ~ normal(0,sigma_0);
//likelihood: model the probability of ln_gamma as a Cauchy distribution
//around the dot product of the feature vectors
for (i in 1:I) {
for (j in 1:J) {
if (ln_gamma[i,j]!= -99.0) { //train to available data only
ln_gamma[i,j] ~ cauchy(u[i]' * v[j], lambda);
}
}
}
}
generated_quantities {
real ln_gamma_pred[I,J];
for (i in 1:I) {
for (j in 1:J) {
ln_gamma_pred[i,j] = cauchy_rng(u[i]' * v[j], lambda);
}
}
}
Then your output will contain the predicted values
Thank you! Been experimenting with these. It seems that I get 1000 sample size of the posterior. If suppose, I want to get the posterior means, this should be the stansummary in cmdstan right?