We are very happy to announce that CmdStan 2.39.0 is out!
This release cycle brings the embedded Laplace approximation, new functions, and many fixes.
For details, see the blog post: Release of CmdStan 2.39 – The Stan Blog
We are very happy to announce that CmdStan 2.39.0 is out!
This release cycle brings the embedded Laplace approximation, new functions, and many fixes.
For details, see the blog post: Release of CmdStan 2.39 – The Stan Blog
Thank you @WardBrian and everyone else who contributed!
Thank you for introducing the embedded Laplace approximation in this release. Could you please point to a case study or example showing how to use it for fitting hierarchical models, such as a random-intercept and random-slope model, using CmdStanR?
Thanks
Hi @satpal -
I believe these kinds of materials are still being prepared by @charlesm93, @avehtari and company. The older examples I link to in the blog post may be informative for the general modeling technique, but the nuts and bolts of the syntax have changed and it wasn’t plugged all into cmdstanr at the time.
Hopefully an increasing number of resources will become available soon!
As varying intercept and slope models work usually quite well with HMC/NUTS and as INLA, TMB handle these if more speed is needed, and for bigger data full Bayes is not needed and lme4 can be used, we have not even planned this kind of case studies. I would expect than in Stan there wpould not be much speed benefit. The users guide has an example for Gaussian processes, and the paper Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond has and example with regularized horseshoe prior, which are more towards where Stan had benefit over those other packages and where the posterior has more challenging shape than in regular varying intercept and slope models. I’ll add examples for using Laplace to improve leave-one-out and leave-one-group-out cross-validation. If you think you have an example where HMC/NUTS struggles and there are Gaussian latent variables, start a new topic, post the example, and we can take a look whether the new embedded Laplace helps.
Can I perhaps suggest that Logistic Normal Multinomial (LNM) model could be a good application to document, where the explicit residual parametrisation is very hard to model at the moment.
I tag @Yingnan_Gao who is currently trying to adapt your binomial gaussian process example to LNM.
@Yingnan_Gao feel free to start a new post with your 3 component example (stan code and data) where the model struggles a lot. Please add “[Laplace approximation]” in the title.
Thanks @WardBrian and @avehtari
Below is an example of the type of hierarchical model I am trying to speed up. The example involves fitting a quadratic polynomial model to longitudinal data, but the time variable is transformed nonlinearly within the polynomial function as `(t−B) * exp(C)`, where B and C are parameters.
The actual model is much more complex, involving splines rather than a simple polynomial function, and it includes a larger number of parameters.
###########################################
# Stan code
###########################################
scode <-
"
functions {
// Define function to create polynomial design matrix
matrix poly_mat(vector t, int degree) {
int N = size(t);
int K = degree;
matrix[N, K] X;
for(i in 1:K) {
X[, i] = t ^ i;
}
return X;
}
// Evaluate polynomial
vector poly_eval(vector t, vector a, vector b, vector c, vector beta1, vector beta2) {
int degree = 2;
int N = size(t);
int K = degree;
vector[N] mu;
vector[N] T;
T = (t - b) .* exp(c);
matrix[N, K] X = poly_mat(T, K);
mu = a + X[, 1] .* beta1 + X[, 2] .* beta1;
return mu;
}
}
data {
int<lower=1> N;
vector[N] Y;
vector[N] time;
int<lower=1> J;
array[N] int<lower=1> id;
}
transformed data {
int<lower=1> K = 3; // Number of population level effects
int<lower=1> M = K; // Number of individual levels effects
int<lower=1> NC_1 = (K*(K-1))/2; // Number of correlation parameters
}
parameters {
real mu_a;
real mu_b;
real mu_c;
real mu_beta1;
real mu_beta2;
real<lower=0> sigma;
matrix[M, J] z;
vector<lower=0>[M] sd;
cholesky_factor_corr[M] L;
}
transformed parameters {
matrix[J, M] r_1 = transpose(diag_pre_multiply(sd, L) * z);
vector[J] re_a = r_1[, 1];
vector[J] re_b = r_1[, 2];
vector[J] re_c = r_1[, 3];
}
model {
vector[N] mu;
vector[N] A;
vector[N] B;
vector[N] C;
for (n in 1:N) {
A[n] = mu_a + re_a[id[n]] ;
B[n] = mu_b + re_b[id[n]] ;
C[n] = mu_c + re_c[id[n]] ;
}
vector[N] Beta1;
vector[N] Beta2;
for (n in 1:N) {
Beta1[n] = mu_beta1 ;
Beta2[n] = mu_beta2 ;
}
mu = (poly_eval(time , A , B , C, Beta1, Beta2));
target += normal_lupdf(Y | mu, sigma);
target += normal_lupdf(mu_a | 160, 20);
target += normal_lupdf(mu_b | 0, 1.0);
target += normal_lupdf(mu_c | 0, 0.1);
target += normal_lupdf(mu_beta1 | 5, 2);
target += normal_lupdf(mu_beta2 | 1, 1);
target += normal_lupdf(sigma | 1, 1);
target += normal_lupdf(sd[1] | 10, 10);
target += normal_lupdf(sd[2] | 1, 1);
target += normal_lupdf(sd[3] | 0.1, 0.5);
target += lkj_corr_cholesky_lupdf(L | 1);
target += std_normal_lupdf(to_vector(z));
}
generated quantities {
corr_matrix[M] Cor_1 = multiply_lower_tri_self_transpose(L);
vector<lower=-1,upper=1>[NC_1] cor_1;
for (k in 1:M) {
for (j in 1:(k - 1)) {
cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
}
}
// predict Y (population) and Yhat (individual) trajectories
vector[N] pred_Y;
vector[N] pred_Yhat;
{
vector[N] A;
vector[N] B;
vector[N] C;
for (n in 1:N) {
A[n] = mu_a + re_a[id[n]] ;
B[n] = mu_b + re_b[id[n]] ;
C[n] = mu_c + re_c[id[n]] ;
}
vector[N] Beta1;
vector[N] Beta2;
for (n in 1:N) {
Beta1[n] = mu_beta1 ;
Beta2[n] = mu_beta2 ;
}
pred_Y = (poly_eval(time, rep_vector(mu_a, N),
rep_vector(mu_b, N),
rep_vector(mu_c, N),
rep_vector(mu_beta1, N),
rep_vector(mu_beta1, N)
));
pred_Yhat = (poly_eval(time , A, B, C, Beta1, Beta2));
}
}
"
###########################################
# Data
###########################################
text_dat <- " id age height
1 1 9.544 136.8
2 1 10.040 139.9
3 1 10.538 142.3
4 1 11.058 145.9
5 1 12.055 155.0
6 1 12.550 159.4
7 1 13.070 161.9
8 1 13.569 163.1
9 1 14.067 164.3
10 1 14.543 165.4
11 1 15.157 165.5
12 1 15.576 165.8
13 1 15.964 165.8
14 2 9.503 134.6
15 2 9.999 137.6
16 2 10.497 140.5
17 2 11.017 143.7
18 3 9.396 129.6
19 3 10.390 133.8
20 3 10.910 135.5
21 3 11.389 137.4
22 3 11.907 140.4
23 3 12.402 145.7
24 3 12.923 150.2
25 4 9.240 135.7
26 4 9.736 138.2
27 4 10.234 141.1
28 4 10.754 143.9
29 4 11.233 146.1
30 4 11.751 149.4
31 4 12.246 152.1
32 4 12.767 156.3
33 4 13.763 164.3
34 4 14.240 165.9
35 4 14.853 168.0
36 4 15.277 168.4
37 5 10.182 148.3
38 5 10.678 151.6
39 5 11.176 154.6
40 5 12.175 160.6
41 5 12.693 163.6
42 5 13.188 166.3
43 5 13.708 167.4
44 5 14.207 168.6
45 5 14.705 169.9
46 5 15.181 170.5
47 5 15.795 170.7
48 5 16.214 171.4
49 5 16.602 171.4
50 6 9.547 142.2
51 6 10.042 146.0
52 6 10.541 149.1
53 6 11.061 152.4
54 6 11.540 154.1
55 6 12.057 155.3
56 6 12.553 156.6
57 6 13.073 155.4
58 6 13.572 156.7
59 6 14.070 156.3
60 6 14.546 156.5
61 6 15.159 156.5
62 6 15.578 155.8
63 6 15.967 156.2
64 7 8.912 123.0
65 7 9.407 125.1
66 7 9.906 127.8
67 7 10.426 130.3
68 7 10.905 132.5
69 7 11.422 134.2
70 7 11.918 136.3
71 7 12.438 138.4
72 8 9.229 128.7
73 8 9.725 130.9
74 8 10.223 134.2
75 8 10.743 136.2
76 8 11.222 140.0
77 8 11.740 142.9
78 8 12.235 146.6
79 8 12.756 150.6
80 8 13.254 154.7
81 8 13.752 158.4
82 8 14.229 161.2
83 8 14.842 162.0
84 8 15.261 162.5
85 8 15.650 162.8
86 9 9.440 139.2
87 9 9.936 143.5
88 9 10.434 148.5
89 9 10.954 152.0
90 9 11.433 154.7
91 9 12.446 158.3
92 9 12.966 159.2
93 9 13.963 160.7
94 10 8.312 123.4
95 10 8.808 125.8
96 10 9.306 127.6
97 10 9.826 130.5
98 10 10.305 133.1
99 10 10.823 137.0
100 10 11.318 140.4
101 10 11.838 145.0
102 10 12.337 148.5
103 10 12.835 150.8
104 10 13.311 152.6
105 10 14.344 154.9
106 10 14.732 155.3
107 11 9.670 140.4
108 11 10.166 142.8
109 11 10.664 145.7
110 11 11.184 148.4
111 11 11.663 150.4
112 11 12.181 154.0
113 11 12.676 158.4
114 11 13.196 162.6
115 11 13.695 165.9
116 11 14.193 167.6
117 11 14.669 169.3
118 11 15.283 170.1
119 11 15.707 170.6
120 11 16.090 170.0
121 12 10.103 140.3
122 12 10.598 142.5
123 12 11.097 144.9
124 12 11.617 147.3"
data <- read.table(text = text_dat, header = TRUE)
data$id <- as.integer(data$id)
###########################################
# Data - list format
###########################################
stan_data <- list(
N = nrow(data),
J = length(unique(data$id)),
id = data$id,
time = data$age,
Y = data$height
)
###########################################
# Model fit
###########################################
fit <- rstan::stan(
model_code = scode,
data = stan_data,
chains = 2,
cores = 2,
iter = 2000,
warmup = 1000,
seed = 12345,
init = 0,
control = list(adapt_delta = 0.95, max_treedepth = 12)
)
Thanks
For any model with a large number of observations N, crucial part is to have sparse Hessian of the log likelihood. This is the same with INLA and TMB, too, but they are specialized for this and do the sparse work automatically. Stan doesn’t have sparse matrices, and currently only block sparse Hessian with equal block sizes is supported (unequal Hessian block sizes can be managed in some cases by calling laplace_marginal function once per each block). The first step here would be to figure out whether Hessian of normal_lupdf(Y | mu, sigma) has block structure with respect to r_1 and conditional on mu_a, mu_b, mu_c, mu_beta1, mu_beta2, sigma. This is equivalent to checking that there are small groups of the log-likelihood vector that each depend only on small number of terms in
r_1. Can you do that?
Thanks @avehtari for looking into it
checking that there are small groups of the log-likelihood vector that each depend only on small number of terms in
r_1. Can you do that?
I don’t know how to check it explicitly, but I think each child (individual/subject) contributes its own local likelihood, so it is a block-by-child structure.
Is Y ordered so that observations for each child are together and are there equal number of observations per child?
The observations are arranged by time, but the design is unbalanced. Each child has their own ordered measurement times, and the ages are not the same across children. Therefore, the data are longitudinal and clustered within child but not balanced in the sense of equal numbers of observations at the same time points for all children.
For example, suppose there are three children:
Child 1: ages 2, 4, 6, 8
Child 2: ages 3, 5, 7
Child 3: ages 1, 2, 4, 6, 9, 12
Here, the measurements are still ordered within each child, but each child has a different number of observations and the ages do not match across children.
Because height typically increases with age, ordering the time points also implies that the height measurements should generally increase over time, aside from random measurement error.
It sounds like this would require general unstructured sparse matrix support which Stan doesn’t have. Sorry to disappoint you
Thank you for clarifying.
I appreciate you checking, and although this is disappointing, it helps narrow down the possible directions.
We only have sparse-matrix/vector products if you can finagle your sparse matrix into the right representation.
I’m guessing this is a bug and the second value should be beta2 since it was passed in as an argument and not used.
You’ll find it much more efficient to not create that matrix and just code the operations directly:
vector poly_eval(vector t, vector a, vector b, vector c, vector beta1, vector beta2) {
int N = size(t);
vector[N] T = (t - b) .* exp(c);
vector[N] mu = a + T .* beta1 + T^2 .* beta2; // fixed typo?
return mu;
}
I actually find it more readable, too. It won’t generalize beyond K = 2, but then neither will the original code as written with a hard coded value of 2. I took the liberty of rewriting the rest of your model (I changed some names to match our conventional choices). It may dump out a little more than your original example in that I’m writing out A, B, C, Beta1, Beta2 rather than recomputing. I’ve also moved declarations closer to usage and removed some redundant parens and spaces. I also added offset and scaling to parameter so initializations are closer to from the prior and adaptation will go faster if those are reasonable priors. The variable K never gets used. poly_eval() hard coded K = 2, so I wasn’t sure if the K = 3 constant was a typo or what.
functions {
vector poly_eval(vector t, vector a, vector b, vector c, vector beta1, vector beta2) {
int K = 2;
int N = size(t);
vector[N] T = (t - b) .* exp(c);
vector[N] mu = a + T .* beta1 + T^2 .* beta1;
return mu;
}
}
data {
int<lower=1> N;
vector[N] Y;
vector[N] time;
int<lower=1> J;
array[N] int<lower=1> id;
}
transformed data {
int<lower=1> K = 3;
int<lower=1> M = K;
int<lower=1> NC_1 = (K * (K - 1)) / 2;
}
parameters {
real<offset=160, scale=20> mu_a;
real mu_b;
real<multipler=0.1> mu_c;
real<offset=5, multiplier=2> mu_beta1;
real mu_beta2;
real<lower=0> sigma;
matrix[M, J] z;
vector<lower=0>[M] sd;
cholesky_factor_corr[M] L;
}
transformed parameters {
matrix[J, M] r_1 = (diag_pre_multiply(sd, L) * z)';
vector[N] A = mu_a + r_1[ , 1];
vector[N] B = mu_b + r_1[ , 2];
vector[N] C = mu_c + r_2[ , 3];
vector[N] Beta1 = rep_vector(mu_beta1, N);
vector[N] Beta2 = rep_vector(mu_beta2, N);
}
model {
vector[N] mu = poly_eval(time, A, B, C, Beta1, Beta2);
target += normal_lupdf(Y | mu, sigma);
target += normal_lupdf(mu_a | 160, 20);
target += normal_lupdf(mu_b | 0, 1.0);
target += normal_lupdf(mu_c | 0, 0.1);
target += normal_lupdf(mu_beta1 | 5, 2);
target += normal_lupdf(mu_beta2 | 1, 1);
target += normal_lupdf(sigma | 1, 1);
target += normal_lupdf(sd[1] | 10, 10);
target += normal_lupdf(sd[2] | 1, 1);
target += normal_lupdf(sd[3] | 0.1, 0.5);
// target += lkj_corr_cholesky_lupdf(L | 1); // this is uniform, so a no-op
target += std_normal_lupdf(to_vector(z));
}
generated quantities {
corr_matrix[M] Omega = multiply_lower_tri_self_transpose(L);
vector[N] pred_Y = poly_eval(time, rep_vector(mu_a, N), rep_vector(mu_b, N),
rep_vector(mu_c, N), rep_vector(mu_beta1, N),
rep_vector(mu_beta1, N));
vector[N] pred_Yhat = poly_eval(time, A, B, C, Beta1, Beta2);
}
There’s not a lot more you can do to speed this model up in terms of its code. I didn’t want to go too crazy or I would have merged A and B and C into a matrix, and Beta1 and Beta2 into another and I would have renamed sd to tau (there’s already a sigma acting like a standard deviation).
The original question was about laplace_marginal function and the sparse matrix operations would be needed in C++ part of the laplace_marginal function.