Please share your Stan program and accompanying data if possible.
data{
int<lower = 1> I;
int<lower = 1> L;
int<lower = 0, upper = 1> obs[I,L];
real<lower=0> sigma_mu; // Standard Deviation of the prior for elements of mu
matrix[L,L] Sigma_prior;
}
parameters{
vector<lower=-2*sigma_mu, upper=2*sigma_mu>[L] mu; // Elements of vector mu
cov_matrix[L] Sigma;
vector[L] raw_sig_vec[I];
}
transformed parameters{
vector[L] sig_vec[I];
for (i in 1:I){
sig_vec[i] = inv_logit(raw_sig_vec[i]); // Gaussian Sample to Probabilities
}
}
model{
real item_prob;
mu ~ normal(0, sigma_mu);
Sigma ~ inv_wishart(L+1, Sigma_prior);
for (i in 1:I){
raw_sig_vec[i] ~ multi_normal(mu, Sigma);
}
for (i in 1:I){
item_prob = 1.0;
for (l in 1:L){
item_prob *= obs[i,l]*sig_vec[i,l]+(1-obs[i,l])*(1-sig_vec[i,l]);
}
target += log(item_prob);
}
}
I am a Stan newbie. I am trying to run this model for approximate values of I = 10000, L = 10. Unfortunately it is too slow. Running it for I = 1000, L = 2 takes 200+ seconds. If I replace the inv_logit by softmax, it becomes even slower. That makes me think that this transformation is crucial. Is there a way I can optimize this to make it run faster?
Any help is highly appreciated! :)
Wait. You are either doing something advanced that I’ve never seen before, or you’re doing something wrong. Can you explain your model more? I’ve never seen a model where binomial data are multiplied by parameters and the result simply log transformed to increment the log probability.
It is totally possible that I might be making a mistake.
What I am trying to capture here is that the different components for any item i are correlated with the correlation being captured by the multivariate normal distribution and then converted into probabilities.
The generative model is as follows:
Sample the mean vector mu ~ N (0, sigma_mu) and the covariance matrix Sigma ~ inv_wishart(nu, Sigma_prior).
For each item i in [1, I] do the following:
(i) Sample a raw vector of length L raw_sig_vec[i] ~ MVN (mu, Sigma).
(ii) Obtain a probability vector of length L sig_vec[i] = inv_logit(raw_sig_vec[i])
(iii) For each l in [1, L] observe obs[i,l] ~ Bernoulli (sig_vec[i,l])
The idea of the line
is that it is the probability of observing obs[i,1], obs[i, 2], …, obs[i,L] for a particular i.
Then I add the log probabilities of all the items i.
Why not just use the built-in bernoulli() function? Generally you should avoid trying to code things yourself when the devs have done the work to create a verified working/performant function for things.
That makes a lot of sense! I changed my implementation to the following:
data{
int<lower = 1> I;
int<lower = 1> L;
int<lower = 0, upper = 1> obs[I, L];
real<lower=0> sigma_mu; // Standard Deviation of the prior for elements of mu
matrix[L,L] Sigma_prior;
}
parameters{
vector<lower=-2*sigma_mu, upper=2*sigma_mu>[L] mu; // Elements of vector mu
cov_matrix[L] Sigma;
vector[L] raw_sig_vec[I];
}
model{
real item_prob;
mu ~ normal(0, sigma_mu);
Sigma ~ inv_wishart(L+1, Sigma_prior);
for (i in 1:I){
raw_sig_vec[i] ~ multi_normal(mu, Sigma);
obs[i] ~ bernoulli_logit(raw_sig_vec[i]);
}
}
Unfortunately, the runtime of this is also the same as the earlier code. Are there any other tricks I could use to speed this up?
We’re usually recommending LKJ priors rather than inverse-wisharts for covariance matrices. Perhaps this is not causing the above problem, but in general we don’t recommend inv-Wishart. Also the hard constraints on mu look weird.
There are a couple of things you can try. The first is to use a cholesky factor and LKJ prior instead of the inverse-wishart, you can then also use the vectorised multi_normal_cholesky distribution rather than a loop:
data{
int<lower = 1> I;
int<lower = 1> L;
int<lower = 0, upper = 1> obs[I, L];
real<lower=0> sigma_mu; // Standard Deviation of the prior for elements of mu
matrix[L,L] Sigma_prior;
}
parameters{
vector[L] mu; // Elements of vector mu
cholesky_factor_corr[L] Sigma;
vector[L] raw_sig_vec[I];
}
model{
mu ~ normal(0, sigma_mu);
Sigma ~ lkj_corr_cholesky(1); //You'll need to work out the appropriate prior
raw_sig_vec ~ multi_normal_cholesky(mu, Sigma);
for (i in 1:I){
obs[i] ~ bernoulli_logit(raw_sig_vec[i]);
}
}
I am trying to figure out what would be a good way of doing it. For a given ‘i’ and ‘l’, I currently have just one observation.
I wonder if I can probably apply that on the whole vector of length L. In that case I don’t know what the sufficient statistic would be. It surely wouldn’t be the mean as that would capture the correlation between the different components.
It’d be great if you could provide me some direction on how I should be approaching it. :)
Oh, sorry, I was misreading the model. I see now that as you say you only have one observation per combination of I and J, the sufficient statistics trick doesn’t apply.
I have a follow-up question on this. When I run this using pystan.sampling(), it works well. However, when I try doing pystan.optimizing(), I get an error:
Initial log joint probability = -5435.98
Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes
99 8196.55 0.00643506 2.28183e+06 0.03834 0.6638 157
Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes
199 10122.9 0.0126388 5.66937e+08 1.242 0.1242 296
Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes
Error evaluating model log probability: Non-finite gradient.
Error evaluating model log probability: Non-finite gradient.
Error evaluating model log probability: Non-finite gradient.
Error evaluating model log probability: Non-finite gradient.
Error evaluating model log probability: Non-finite gradient.
Error evaluating model log probability: Non-finite gradient.
Error evaluating model log probability: Non-finite gradient.
Error evaluating model log probability: Non-finite gradient.
214 11170.1 0.0376207 4.53054e+09 1e-12 0.001 406 LS failed, Hessian reset
Error evaluating model log probability: Non-finite gradient.
Error evaluating model log probability: Non-finite gradient.
Optimization terminated with error:
Line search failed to achieve a sufficient decrease, no more progress can be made
stderr: