Release of CmdStan 2.39.0

We are very happy to announce that CmdStan 2.39.0 is out!

This release cycle brings the embedded Laplace approximation, new functions, and many fixes.

For details, see the blog post: Release of CmdStan 2.39 – The Stan Blog

Thank you @WardBrian and everyone else who contributed!

Thank you for introducing the embedded Laplace approximation in this release. Could you please point to a case study or example showing how to use it for fitting hierarchical models, such as a random-intercept and random-slope model, using CmdStanR?

Thanks

Hi @satpal -

I believe these kinds of materials are still being prepared by @charlesm93, @avehtari and company. The older examples I link to in the blog post may be informative for the general modeling technique, but the nuts and bolts of the syntax have changed and it wasn’t plugged all into cmdstanr at the time.

Hopefully an increasing number of resources will become available soon!

As varying intercept and slope models work usually quite well with HMC/NUTS and as INLA, TMB handle these if more speed is needed, and for bigger data full Bayes is not needed and lme4 can be used, we have not even planned this kind of case studies. I would expect than in Stan there wpould not be much speed benefit. The users guide has an example for Gaussian processes, and the paper Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond has and example with regularized horseshoe prior, which are more towards where Stan had benefit over those other packages and where the posterior has more challenging shape than in regular varying intercept and slope models. I’ll add examples for using Laplace to improve leave-one-out and leave-one-group-out cross-validation. If you think you have an example where HMC/NUTS struggles and there are Gaussian latent variables, start a new topic, post the example, and we can take a look whether the new embedded Laplace helps.

Can I perhaps suggest that Logistic Normal Multinomial (LNM) model could be a good application to document, where the explicit residual parametrisation is very hard to model at the moment.

I tag @Yingnan_Gao who is currently trying to adapt your binomial gaussian process example to LNM.

@Yingnan_Gao feel free to start a new post with your 3 component example (stan code and data) where the model struggles a lot. Please add “[Laplace approximation]” in the title.

Thanks @WardBrian and @avehtari

Below is an example of the type of hierarchical model I am trying to speed up. The example involves fitting a quadratic polynomial model to longitudinal data, but the time variable is transformed nonlinearly within the polynomial function as `(t−B) * exp(C)`, where B and C are parameters.

The actual model is much more complex, involving splines rather than a simple polynomial function, and it includes a larger number of parameters.



###########################################
# Stan code
###########################################

scode <-
  "
functions {
 // Define function to create polynomial design matrix
 matrix poly_mat(vector t, int degree) {
    int N = size(t);
    int K = degree;
    matrix[N, K] X;
    for(i in 1:K) {
     X[, i] = t ^ i;
    }
    return X;
  }
  // Evaluate polynomial
  vector poly_eval(vector t, vector a, vector b, vector c, vector beta1, vector beta2) {
    int degree = 2;
    int N = size(t);
    int K = degree;
    vector[N] mu;
    vector[N] T;
    T = (t - b) .* exp(c);
    matrix[N, K] X = poly_mat(T, K);
    mu = a + X[, 1] .* beta1 + X[, 2] .* beta1;
    return mu;
  }
}
data {
  int<lower=1> N;  
  vector[N] Y;
  vector[N] time;
  int<lower=1> J; 
  array[N] int<lower=1> id;  
}
transformed data {
 int<lower=1> K = 3;               // Number of population level effects
 int<lower=1> M = K;               // Number of individual levels effects
 int<lower=1> NC_1 = (K*(K-1))/2;  // Number of correlation parameters
}
parameters {
  real mu_a;  
  real mu_b;  
  real mu_c;  
  real mu_beta1; 
  real mu_beta2; 
  real<lower=0> sigma; 
  matrix[M, J] z;  
  vector<lower=0>[M] sd;  
  cholesky_factor_corr[M] L;  
}
transformed parameters {
  matrix[J, M] r_1 = transpose(diag_pre_multiply(sd, L) * z);
  vector[J] re_a = r_1[, 1];
  vector[J] re_b = r_1[, 2];
  vector[J] re_c = r_1[, 3];
}
model {
  vector[N] mu;
  vector[N] A;
  vector[N] B;
  vector[N] C;
  for (n in 1:N) {
    A[n] = mu_a + re_a[id[n]] ;
    B[n] = mu_b + re_b[id[n]] ;
    C[n] = mu_c + re_c[id[n]] ;
  }
  vector[N] Beta1;
  vector[N] Beta2;
  for (n in 1:N) {
    Beta1[n] = mu_beta1  ;
    Beta2[n] = mu_beta2  ;
  }
  mu = (poly_eval(time , A , B , C, Beta1, Beta2));
  target += normal_lupdf(Y | mu, sigma);
  target += normal_lupdf(mu_a | 160, 20);
  target += normal_lupdf(mu_b | 0, 1.0);
  target += normal_lupdf(mu_c | 0, 0.1);
  target += normal_lupdf(mu_beta1 | 5, 2);
  target += normal_lupdf(mu_beta2 | 1, 1);
  target += normal_lupdf(sigma | 1, 1);
  target += normal_lupdf(sd[1] | 10, 10);
  target += normal_lupdf(sd[2] | 1, 1);
  target += normal_lupdf(sd[3] | 0.1, 0.5);
  target += lkj_corr_cholesky_lupdf(L | 1);
  target += std_normal_lupdf(to_vector(z));
}
generated quantities {
  corr_matrix[M] Cor_1 = multiply_lower_tri_self_transpose(L);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  for (k in 1:M) {
    for (j in 1:(k - 1)) {
      cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
    }
  }
  // predict Y (population) and Yhat (individual) trajectories
  vector[N] pred_Y;
  vector[N] pred_Yhat;
  {
    vector[N] A;
    vector[N] B;
    vector[N] C;
    for (n in 1:N) {
      A[n] = mu_a + re_a[id[n]] ;
      B[n] = mu_b + re_b[id[n]] ;
      C[n] = mu_c + re_c[id[n]] ;
    }
    vector[N] Beta1;
    vector[N] Beta2;
    for (n in 1:N) {
      Beta1[n] = mu_beta1  ;
      Beta2[n] = mu_beta2  ;
    }
    pred_Y    = (poly_eval(time, rep_vector(mu_a, N), 
    rep_vector(mu_b, N), 
    rep_vector(mu_c, N),
    rep_vector(mu_beta1, N),
    rep_vector(mu_beta1, N)
    ));
    pred_Yhat = (poly_eval(time , A, B, C, Beta1, Beta2));
  }
  
}
"

###########################################
# Data
###########################################

text_dat <- "    id    age height
1    1  9.544  136.8
2    1 10.040  139.9
3    1 10.538  142.3
4    1 11.058  145.9
5    1 12.055  155.0
6    1 12.550  159.4
7    1 13.070  161.9
8    1 13.569  163.1
9    1 14.067  164.3
10   1 14.543  165.4
11   1 15.157  165.5
12   1 15.576  165.8
13   1 15.964  165.8
14   2  9.503  134.6
15   2  9.999  137.6
16   2 10.497  140.5
17   2 11.017  143.7
18   3  9.396  129.6
19   3 10.390  133.8
20   3 10.910  135.5
21   3 11.389  137.4
22   3 11.907  140.4
23   3 12.402  145.7
24   3 12.923  150.2
25   4  9.240  135.7
26   4  9.736  138.2
27   4 10.234  141.1
28   4 10.754  143.9
29   4 11.233  146.1
30   4 11.751  149.4
31   4 12.246  152.1
32   4 12.767  156.3
33   4 13.763  164.3
34   4 14.240  165.9
35   4 14.853  168.0
36   4 15.277  168.4
37   5 10.182  148.3
38   5 10.678  151.6
39   5 11.176  154.6
40   5 12.175  160.6
41   5 12.693  163.6
42   5 13.188  166.3
43   5 13.708  167.4
44   5 14.207  168.6
45   5 14.705  169.9
46   5 15.181  170.5
47   5 15.795  170.7
48   5 16.214  171.4
49   5 16.602  171.4
50   6  9.547  142.2
51   6 10.042  146.0
52   6 10.541  149.1
53   6 11.061  152.4
54   6 11.540  154.1
55   6 12.057  155.3
56   6 12.553  156.6
57   6 13.073  155.4
58   6 13.572  156.7
59   6 14.070  156.3
60   6 14.546  156.5
61   6 15.159  156.5
62   6 15.578  155.8
63   6 15.967  156.2
64   7  8.912  123.0
65   7  9.407  125.1
66   7  9.906  127.8
67   7 10.426  130.3
68   7 10.905  132.5
69   7 11.422  134.2
70   7 11.918  136.3
71   7 12.438  138.4
72   8  9.229  128.7
73   8  9.725  130.9
74   8 10.223  134.2
75   8 10.743  136.2
76   8 11.222  140.0
77   8 11.740  142.9
78   8 12.235  146.6
79   8 12.756  150.6
80   8 13.254  154.7
81   8 13.752  158.4
82   8 14.229  161.2
83   8 14.842  162.0
84   8 15.261  162.5
85   8 15.650  162.8
86   9  9.440  139.2
87   9  9.936  143.5
88   9 10.434  148.5
89   9 10.954  152.0
90   9 11.433  154.7
91   9 12.446  158.3
92   9 12.966  159.2
93   9 13.963  160.7
94  10  8.312  123.4
95  10  8.808  125.8
96  10  9.306  127.6
97  10  9.826  130.5
98  10 10.305  133.1
99  10 10.823  137.0
100 10 11.318  140.4
101 10 11.838  145.0
102 10 12.337  148.5
103 10 12.835  150.8
104 10 13.311  152.6
105 10 14.344  154.9
106 10 14.732  155.3
107 11  9.670  140.4
108 11 10.166  142.8
109 11 10.664  145.7
110 11 11.184  148.4
111 11 11.663  150.4
112 11 12.181  154.0
113 11 12.676  158.4
114 11 13.196  162.6
115 11 13.695  165.9
116 11 14.193  167.6
117 11 14.669  169.3
118 11 15.283  170.1
119 11 15.707  170.6
120 11 16.090  170.0
121 12 10.103  140.3
122 12 10.598  142.5
123 12 11.097  144.9
124 12 11.617  147.3"


data    <- read.table(text = text_dat, header = TRUE)
data$id <- as.integer(data$id)


###########################################
# Data - list format
###########################################

stan_data <- list(
  N = nrow(data),
  J = length(unique(data$id)),
  id = data$id,
  time = data$age,
  Y = data$height
  )



###########################################
# Model fit
###########################################

fit <- rstan::stan(
  model_code = scode,
  data = stan_data,
  chains = 2, 
  cores = 2,
  iter = 2000,
  warmup = 1000,
  seed = 12345,
  init = 0,
  control = list(adapt_delta = 0.95, max_treedepth = 12)
  )

Thanks

For any model with a large number of observations N, crucial part is to have sparse Hessian of the log likelihood. This is the same with INLA and TMB, too, but they are specialized for this and do the sparse work automatically. Stan doesn’t have sparse matrices, and currently only block sparse Hessian with equal block sizes is supported (unequal Hessian block sizes can be managed in some cases by calling laplace_marginal function once per each block). The first step here would be to figure out whether Hessian of normal_lupdf(Y | mu, sigma) has block structure with respect to r_1 and conditional on mu_a, mu_b, mu_c, mu_beta1, mu_beta2, sigma. This is equivalent to checking that there are small groups of the log-likelihood vector that each depend only on small number of terms in
r_1. Can you do that?

Thanks @avehtari for looking into it

checking that there are small groups of the log-likelihood vector that each depend only on small number of terms in
r_1. Can you do that?

I don’t know how to check it explicitly, but I think each child (individual/subject) contributes its own local likelihood, so it is a block-by-child structure.

Is Y ordered so that observations for each child are together and are there equal number of observations per child?

The observations are arranged by time, but the design is unbalanced. Each child has their own ordered measurement times, and the ages are not the same across children. Therefore, the data are longitudinal and clustered within child but not balanced in the sense of equal numbers of observations at the same time points for all children.

For example, suppose there are three children:

  • Child 1: ages 2, 4, 6, 8

  • Child 2: ages 3, 5, 7

  • Child 3: ages 1, 2, 4, 6, 9, 12

Here, the measurements are still ordered within each child, but each child has a different number of observations and the ages do not match across children.

Because height typically increases with age, ordering the time points also implies that the height measurements should generally increase over time, aside from random measurement error.

It sounds like this would require general unstructured sparse matrix support which Stan doesn’t have. Sorry to disappoint you

Thank you for clarifying.

I appreciate you checking, and although this is disappointing, it helps narrow down the possible directions.

We only have sparse-matrix/vector products if you can finagle your sparse matrix into the right representation.

I’m guessing this is a bug and the second value should be beta2 since it was passed in as an argument and not used.

You’ll find it much more efficient to not create that matrix and just code the operations directly:

  vector poly_eval(vector t, vector a, vector b, vector c, vector beta1, vector beta2) {
    int N = size(t);
    vector[N] T = (t - b) .* exp(c);
    vector[N] mu = a + T .* beta1 + T^2 .* beta2;  // fixed typo?
    return mu;
  }

I actually find it more readable, too. It won’t generalize beyond K = 2, but then neither will the original code as written with a hard coded value of 2. I took the liberty of rewriting the rest of your model (I changed some names to match our conventional choices). It may dump out a little more than your original example in that I’m writing out A, B, C, Beta1, Beta2 rather than recomputing. I’ve also moved declarations closer to usage and removed some redundant parens and spaces. I also added offset and scaling to parameter so initializations are closer to from the prior and adaptation will go faster if those are reasonable priors. The variable K never gets used. poly_eval() hard coded K = 2, so I wasn’t sure if the K = 3 constant was a typo or what.

functions {
  vector poly_eval(vector t, vector a, vector b, vector c, vector beta1, vector beta2) {
    int K = 2;
    int N = size(t);
    vector[N] T = (t - b) .* exp(c);
    vector[N] mu = a + T .* beta1 + T^2 .* beta1;
    return mu;
  }
}
data {
  int<lower=1> N;  
  vector[N] Y;
  vector[N] time;
  int<lower=1> J; 
  array[N] int<lower=1> id;  
}
transformed data {
  int<lower=1> K = 3;
  int<lower=1> M = K;
  int<lower=1> NC_1 = (K * (K - 1)) / 2;
}
parameters {
  real<offset=160, scale=20> mu_a;  
  real mu_b;  
  real<multipler=0.1> mu_c;  
  real<offset=5, multiplier=2> mu_beta1; 
  real mu_beta2; 
  real<lower=0> sigma; 
  matrix[M, J] z;  
  vector<lower=0>[M] sd;  
  cholesky_factor_corr[M] L;  
}
transformed parameters {
  matrix[J, M] r_1 = (diag_pre_multiply(sd, L) * z)';
  vector[N] A = mu_a + r_1[ , 1];
  vector[N] B = mu_b + r_1[ , 2];
  vector[N] C = mu_c + r_2[ , 3];
  vector[N] Beta1 = rep_vector(mu_beta1, N);
  vector[N] Beta2 = rep_vector(mu_beta2, N);
}
model {
  vector[N] mu = poly_eval(time, A, B, C, Beta1, Beta2);
  target += normal_lupdf(Y | mu, sigma);
  target += normal_lupdf(mu_a | 160, 20);
  target += normal_lupdf(mu_b | 0, 1.0);
  target += normal_lupdf(mu_c | 0, 0.1);
  target += normal_lupdf(mu_beta1 | 5, 2);
  target += normal_lupdf(mu_beta2 | 1, 1);
  target += normal_lupdf(sigma | 1, 1);
  target += normal_lupdf(sd[1] | 10, 10);
  target += normal_lupdf(sd[2] | 1, 1);
  target += normal_lupdf(sd[3] | 0.1, 0.5);
  // target += lkj_corr_cholesky_lupdf(L | 1);  // this is uniform, so a no-op
  target += std_normal_lupdf(to_vector(z));
}
generated quantities {
  corr_matrix[M] Omega = multiply_lower_tri_self_transpose(L);
  vector[N] pred_Y = poly_eval(time, rep_vector(mu_a, N), rep_vector(mu_b, N),
			       rep_vector(mu_c, N), rep_vector(mu_beta1, N),
			       rep_vector(mu_beta1, N));
  vector[N] pred_Yhat = poly_eval(time, A, B, C, Beta1, Beta2);
}

There’s not a lot more you can do to speed this model up in terms of its code. I didn’t want to go too crazy or I would have merged A and B and C into a matrix, and Beta1 and Beta2 into another and I would have renamed sd to tau (there’s already a sigma acting like a standard deviation).

The original question was about laplace_marginal function and the sparse matrix operations would be needed in C++ part of the laplace_marginal function.