Log_prob_grad via csv files for cmdstan

I want to be able to compute the gradient of the log posterior for a provided set of parameters using cmdstan(py) , as eg described here using pystan/rstan Sampling on a fixed grid or on a specified list of points - #9 by avehtari

The most efficient and least awkward way to do this appears to be to add this functionality to cmdstan directly.

Is there interest for this? If yes, whenever I implement this for myself I’ll take extra care and make this into a pull request. (Conditional on me actually managing to find a good solution). Is this the right place to ask this, or should this be done on github?

Reason for implementation: investigate impact of ode solver tolerances on the accuracy of the gradient of the log posterior.

1 Like

There is this hacky way doing it with model.sample + inits

If step_size and inv_metric are given, the results are wrong (?), not sure why

model_code = """
data {
    int N;
    int K;
    vector[N] x[K];
    vector[N] y[K];
}
parameters {
    real b;
    real m[K];
    real sigma[K];
}
model {
    for (k in 1:K) {
        y[k] ~ normal(x[k] * m[k] + b, sigma[k]);
    }
    b ~ std_normal();
    m ~ std_normal();
}
"""

stan_path = "regr_example.stan"
with open(stan_path, "w") as f:
    print(model_code, file=f)

import numpy as np
np.random.seed(3)
N = 35
K = 3
x = np.sort(np.random.randn(K, N), axis=1)

m_true = np.c_[[2.3, 4.5, 6.6]]
b_true = 10

y_raw = x * m_true + b_true
noise = np.random.randn(K, N) / 10
y = y_raw + noise

data = dict(
    N = N,
    K = K,
    x = x,
    y = y,
)

import json
from cmdstanpy import CmdStanModel

model = CmdStanModel(stan_file=stan_path)

fit = model.sample(iter_sampling=100, data=data, sig_figs=16)

values = fit.stan_variables()

lp_values = []
for chain in range(fit.chains):
    lp_chain = []
    #inv_metric_path = f"metric_file_chain_{chain}.json"
    #inv_metric = fit.metric[chain].tolist()
    #with open(inv_metric_path, "w") as f:
    #    json.dump({"inv_metric": inv_metric}, f)
            
    for draw in range(fit.num_draws_sampling):
        i = chain * fit.num_draws_sampling + draw
        sub_values = {key: values[i] for key, values in values.items()}
        
        fit_ = model.sample(
            iter_warmup=0, 
            iter_sampling=1, 
            chains=1, 
            #step_size=fit.step_size[chain],
            #metric=inv_metric_path,
            adapt_engaged=False,
            data=data, 
            inits=sub_values, 
            sig_figs=16
        )
        lp_value = fit_.sampler_variables()["lp__"].ravel()[0]
        lp_chain.append(lp_value)

    lp_values.append(lp_chain)
lp__new = np.array(lp_values).T

lp__original = fit.sampler_variables()["lp__"]

print("max absolute error", abs(lp__new - lp__original).max())

# max absolute error 3.19745083743328e-08

print("max relative error", (abs(lp__new - lp__original) / lp__original).max())

# max relative error 2.739793033513317e-10

Doesn’t this change the parameters?

You sample one draw which is the init position.

edit. Of course one could use different parameter values (no pre-sampling needed)

You reporting different/wrong values with step size and metric provided is consistent with this (edit: init=sample) only being so because the sampler diverges immediately.

But I haven’t gotten around to running the code yet.

Edit2:

So if just ran the code. If you provide the metric and step size, divergent__==0, n_leapfrog__!=1 and the parameter values differ, while if you do not provide the metric and stepsize, divergent__==1, n_leapfrog__==1 and the parameter values are identical.

Hm, so it looks like currently the simplest (yet absurdly hacky way) is to specify a preposterous step size (e.g. 1e16) to ensure immediate divergence to ensure that your hack above works. This appears to work, and while I have a simple serial python solution, parallelization via threads should be straightforward. I was hoping this was possible via fixed_param=True, but fixed_param does not write the gradients to the diagnostic file.

Edit2: Another potentially significant drawback of this approach is that it always evaluates the log prob (and the gradient or most of the work needed for it?) twice, once at the original position and then at a potentially very awkward position.

Anyhow, for 4x10 samples from the prior, the planetary motion example reveals the following, using rk45 with rel_tol==abs_tol==tolerance and max_no_steps==1e8:

      tolerance  max parameter diff.  max lp__ diff.  max rel. lp__ diff.  max grad lp__ diff.  max rel. grad lp__ diff.
0  1.000000e-03                  0.0    1.284610e+17         3.306828e+11        8.067190e+294             1.610180e+290
1  1.000000e-04                  0.0    2.735000e+03         7.205161e-03        3.085950e+164             3.894940e+158
2  1.000000e-05                  0.0    1.000000e+01         1.468984e-05         1.399810e+80              1.765037e+74
3  1.000000e-06                  0.0    2.000000e+00         8.489000e-06         6.318570e+46              1.176854e+42
4  1.000000e-07                  0.0    2.000000e+00         5.148376e-06         2.993760e+18              3.099934e+13
5  1.000000e-08                  0.0    0.000000e+00         0.000000e+00         1.381629e+08              5.408111e+02
6  1.000000e-09                  0.0    0.000000e+00         0.000000e+00         4.788780e+05              1.255518e+00
7  1.000000e-10                  0.0    0.000000e+00         0.000000e+00         2.172300e+04              5.913800e-02
8  1.000000e-11                  0.0    0.000000e+00         0.000000e+00         2.707200e+04              6.945842e-02

(Edit: comparing with a reference solution obtained with tolerance=1e-14)
I.e. while the trajectory/lp__ appears to converge rather quickly, the gradients stay really really really badly approximated (for some draws) for “much” longer.

Tagging some relevant people @charlesm93 (your model), @betanalpha (having voiced concerns about potential accuracy/tuning problems for the current adjoint implementation) and @wds15 (we should investigate this for the adjoint solver).

I think one possible conclusion of this very limited convergence study should be that we have to tell people that they should really check the appropriateness of their solver configurations.

1 Like

Before I forget. This is great and nice… but you should be aware of RStan having this feature already. It’s just that RStan is using a very old Stan still unless you install the 2.26 version from the Stan mirrors.

Having this functionality for cmdstan would also be nice, of course.

1 Like

Yes, this is where I got the name from :)

Pystan has it as well, but for people exclusively using cmdstanX, this will make things much easier.