Pattern for single Stan-file for simulate & inference

motivation I’d like check I can recover parameters of a simulation of a model, after quite some testing I find it’s quite annoying for complex models, and I’d like to find a way to write a single model file. For purposes of discussion, a simple simulation file

model { }
generated quantities {
    real mu = normal_rng(0.0, 1.0);
    real x = normal_rng(mu, 1.0); 
}

a corresponding model

data {
    real x;
}
parameters {
    real mu_h;
}
model {
    mu_h ~ normal(0.0, 1.0);
    x ~ normal(mu_h, 1.0);
}

I can open both files in my editor and be confident they’re consistent, but only for simple cases. For more complex cases, the translation of _rng and sampling statements (and their maintenance) is onerous (though HOFs or a language feature to allow for a function to be used both as _lp and _rng would ameliorate the situation).

question I started to think I could use a single file for both, and a flag variable in the data section to switch on/off the model, with the assumption that, if I’m not predicting the data, the sampler will generate samples equivalent to those _rng calls,

data {
    real x;
    int use_data;
}
parameters {
    real mu_h;
}
model {
    mu_h ~ normal(0.0, 1.0);
    if (use_data == 1)
        x ~ normal(mu_h, 1.0);
}
generated quantities {
    real x_pp = normal_rng(mu_h, 1.0);
}

Obviously, when doing inference on simulated data, I’d s/x_pp/x/g but are there other obvious problems with this second, single file approach? My guess is the sampling for simulation is a bit costlier but not much compared to guarantee that the simulation and inference is consistent.

If there are better ways, or examples I should’ve found I’m all ears; I did try searching for this but didn’t turn up much.

edit there’s still duplicated code between the if (use_data ==1) sampling statements and the GQ _rng section, though, so any tips to reduce that would be great too.

There’s no problem doing things in one file that’s helpful but my main suggestion is to encapsulate the majority of parameter transformations in functions that get used in both the generated quantities and the model block. If you’re using rstan that also lets you write tests for the functions in R

True, and I remember trying to write R code once… In practice I’m appending some extern "C" functions to model.hpp, make model, ctypes.CDLL('model') and calling the wrappers via Python’s ctypes which works for the pure functions.

I still don’t think that’s enough though, consider for example, a transformation matrix fc_ab(matrix, real),

real[] fv_ab_ij(real sc_ij, real scvar) {
    real ab[2];
    real u;
    real v;
    if (sc_ij==0.0) {
        u = 1e-6;
        v = 1e-3;
    } else {
        u = sc_ij;
        v = u * scvar;
    }
    ab[1] = gamma_a_from_u_v(u, v);
    ab[2] = gamma_b_from_a_u(ab[1], u);
    return ab;
}

real[,,] fc_ab(matrix sc, real scvar) {
    int nn = rows(sc);
    real ab[nn, nn, 2];
    for (i in 1:nn)
        for (j in 1:nn)
                ab[i, j] = fv_ab_ij(i==j ? 0.0 : sc[i,j], scvar);
    return ab;
}

void fc_lp(matrix fch, matrix sc, real scvar) {
    int nn = rows(sc);
    real ab[nn, nn, 2];
    ab = fc_ab(sc, scvar);
    for (i in 1:nn)
        for (j in 1:nn)
            fch[i, j] ~ gamma(ab[i, j, 1], ab[i, j, 1]);
}

matrix fc_rng(matrix sc, real scvar) {
    int nn = rows(sc);
    matrix[nn, nn] fc;
    real ab[nn, nn, 2];
    ab = fc_ab(sc, scvar);
    for (i in 1:nn)
        for (j in 1:nn)
            fc[i, j] = gamma_rng(ab[i, j, 1], ab[i, j, 1]);
    return fc;
}

writing a single model and using sampler instead of _rng allows for considerable simplification.

and more importantly you can use the functions from your Stan program to do simulations in R which is ultra-fast as it is based on C++ compiled code. Search the forum for how you can cache these binaries which make this a very convenient approach.

The short answer: you can, but unless the model is simple, you shouldn’t.

The longer answer is multi-faceted.

  • Yes, you can write your model like that. A long time ago (in Stan history), we made sure 0-length vectors and 0x0-sized matrices could be used to generated data (often from just the prior). There was a time when 0-sized structures caused a problem.
  • For generating data, Monte Carlo simulation is much better than Markov chain Monte Carlo sampling. There’s no need to assess convergence; it’s a given by construction.
  • The Stan language is very flexible. It allows users to specify the log joint probability distribution in the model block that does not necessarily correspond to the same model in the generated quantities. This isn’t a bug. We’ve been pushing the Stan language so that it’s easier to do difficult things, not so that easy things get easier. Every few years, this discussion of having a single Stan program to do both tasks comes up, but there isn’t a good way to enforce this without too much restriction on the Stan language.
  • That said, if you conditioned on a subset of the language, say a representation of the model that could be represented as a graph, you would be able to generate both the statistical model and the generated quantities from that one representation. Btw, BUGS, JAGS, and Nimble encode directed acyclic graphs (DAGs) and are thus able to do things like this. RStanArm is another example of how you can do this once you condition on a subset of the language; they’re using R’s formula interface to limit the space of expressible Stan models. The Stan language is much broader than what a DAG can encode. The best example is functions. In general, functions can’t be implemented in DAGs; under stronger conditions and needing to be much more clever, it can be done, but this is why functions aren’t really available in BUGS and JAGS and they are in Stan. I’d much rather have functions and not have to guarantee a DAG vs having DAGs, generative code with the model, and no functions.
  • (This bullet point is opinion.) Although it’s annoying to have to write the model and the generative process separately, I’ve never found it to be a deal-breaker. I can’t say that’s the same with everyone else, but I’ve caught lots of little errors by doing the process twice. There isn’t a good statistical model debugger yet; it’s difficult because I don’t think we (as a community) have really figured out how to write good tests for a complex model. Deterministic code needs to take input and produce known output. Stan programs need to take input and then estimate parameters, but we wouldn’t be doing this if we already knew the answer or we could do this analytically. And there’s the problem, that’s also found in normal programming, that what you intend could be very different than what you implement.

Indeed as noted I’m calling those functions from Python through ctypes.

Perhaps the example I showed was excessively simple… I’m porting stochastic delay differential equation models for neuroimaging data, and the last thing I want is to duplicate code. Finding ways to reduce the amount of code to express the same mathematics, without hurting maintainability is paramount there. Functions help, and I am using them, but higher-order functions or some way to deduplicate expressions between _rng and _lp are what I was getting at with my question.

But… there’s nothing unsound: using HMC in absence of data, the samples will be determined by priors, just as if I had sampled them using _rng functions in the generated quantities section, right? If that lifts a developer burden and introduces no unsound steps, the tradeoff is clear for me.

In theory, yes. In practice, it depends on the convergence and what you want to do with it.

There shouldn’t be any unsound steps.