Standalone generated quantities - comments welcome

Stan’s standalone generated quantities service stan::service::standalone_gqs now takes as input the fitted parameter values on the constrained scale, i.e., as output by the sampler. This means that it should be relatively simple to plumb calls to this service through the interfaces.

I’ve updated the cmdstan issue here: Add standalone generated quantities options to cmdstan · Issue #594 · stan-dev/cmdstan · GitHub
Here’s what it says:

The standalone_generate function requires as inputs:

  • the model
  • the data used to fit it
  • the set of draws from the posterior.

The command line arguments required for this feature are:

  • method = proposing new option: generate
  • data = same dataset as before
  • fitted_param_values = a new sub-option which specifies a single input file containing the fitted parameter values
  • output = name of output file
  • seed = specify random seed - optional

The input file of fitted parameter values can be in one of the following formats

  • Rdump
  • JSON
  • csv (?) needs investigation

thoughts on workflow for RStan and other interfaces welcome ,

cheers,
Mitzi

Just to clarify, this list of inputs suggests that information created/derived in the transformed data block would not be able to be used in these standalone GQ. Is that correct?

incorrect.

information created/derived in the transformed data block is available in the GQ block.

statements in the transformed data block are executed in the model’s constructor, i.e., on model instantiation. the standalone gqs service takes an instantiated model, therefore any variable declared in the top-level of the transformed data block will be available in the generated quantities block.

cheers,
Mitzi

1 Like

After getting this working preliminarily and discussion with @seantalts, I’ve revised this issue w/r/t to allowed input formats of the sample from the fitted model. In order to make this feature easy to use, the sample should be in its default format - for CmdStan, this is the stan csv file format.

The question is - should CmdStan try to also recognize csv data from RStan or PyStan?

Discussion from the updated issue is as follows:

It is easy to assemble the matrix of fitted parameter values with the correct ordering from Cmdstan’s output.csv file, (or concatenated output.csv files).

It is also possible to parse a csv file created from an RStan stan_fit object which has been saved as a csv file (without row names), because the first N columns correspond to the parameters on the unconstrained scale in the correct order.

Not sure whether or not something similar is possible in PyStan.

@jonah, @bgoodri - is it worth having this feature in CmdStan for RStan users?
@ariddell - is there a way to dump the sample in csv format such that the parameters are in declaration order?

I guess. If the comments in the csv file are not relevant, then it is just a question of writing the columns in the correct order.

right.
looks like RStan does this right thing by default -
given a stan_fit object foo_fit, the command
write.csv(as.matrix(foo_fit), file="foo.csv", row.names=FALSE)
produces a csv file with parameters in declaration order.

the column order in a file generated in this way differs from the column order in the current CmdStan “output.csv” format in that the CmdStan file has “lp__” as the first column, and cols 1:6 are the sampler state diagnostic variables, where in the RStan stan_fit object written as CSV the sampler state diagnostics aren’t there, and “lp__” is the last column. otherwise, everyone is in the same order. heuristics are trivial to see which kind of csv file is which - does header start with “lp__” or not?

if feature standalond generated quantities is available directly in RStan interface, should we invest extra work to make CmdStan handle csv files generated in this way?

I don’t know that it is a huge priority but I’m sure someone will come up with a use case for it.

I started implementing this in RStan, but it is segfaulting pretty badly when I call it. I have a Stan program mu_sigma.stan (750 Bytes) that draws mu and sigma in a normal model. And I have another Stan program gqs.stan (850 Bytes)
that is the same except it has no model block and adds a generated quantities block that draws from the posterior predictive distribution.

I run the first model and create a matrix of draws from the posterior. I compile the second model to a stanmodel called sm. I call (in the develop branch of RStan)

test <- gqs(sm, data = list(N = N, y = y), draws)

The draws of mu and sigma look fine in C++

draws = 
    0.465417     0.905675
    0.422443     0.816499
...
   0.0239443     0.827866
    0.036482     0.855883

but then it segfaults with this backtrace:

(gdb) bt
#0  0x00007fffebbc947d in stan::services::util::gq_writer::write_gq_names<model4b20476a92e7_gqs_namespace::model4b20476a92e7_gqs> (this=0x7ffffffeefe0, model=...)
    at /home/ben/r-devel/library/StanHeaders/include/src/stan/services/util/gq_writer.hpp:57
#1  0x00007fffebbc762e in stan::services::standalone_generate<model4b20476a92e7_gqs_namespace::model4b20476a92e7_gqs> (model=..., draws=..., seed=660085397, interrupt=..., logger=..., 
    sample_writer=...) at /home/ben/r-devel/library/StanHeaders/include/src/stan/services/sample/standalone_gqs.hpp:101
#2  0x00007fffeba894fc in rstan::stan_fit<model4b20476a92e7_gqs_namespace::model4b20476a92e7_gqs, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >::standalone_gqs (this=0x55555a2d1c20, pars=0x5555607acce0, 
    seed=0x55555dff2180) at /home/ben/r-devel/library/rstan/include/rstan/stan_fit.hpp:1225
#3  0x00007fffebbc5cac in Rcpp::CppMethod2<rstan::stan_fit<model4b20476a92e7_gqs_namespace::model4b20476a92e7_gqs, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >, SEXPREC*, SEXPREC*, SEXPREC*>::operator() (
    this=0x55555c184320, object=0x55555a2d1c20, args=0x7ffffffef6e0) at /home/ben/r-devel/library/Rcpp/include/Rcpp/module/Module_generated_CppMethod.h:195
#4  0x00007fffebab52eb in Rcpp::class_<rstan::stan_fit<model4b20476a92e7_gqs_namespace::model4b20476a92e7_gqs, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > > >::invoke_notvoid (this=0x55555ce7fe80, 
    method_xp=0x55556040cea8, object=0x55555dcb79d0, args=0x7ffffffef6e0, nargs=2) at /home/ben/r-devel/library/Rcpp/include/Rcpp/module/class.h:234
#5  0x00007ffff1273c15 in CppMethod__invoke_notvoid(SEXPREC*) () from /home/ben/r-devel/library/Rcpp/libs/Rcpp.so
#6  0x000055555563f48d in do_External (call=0x555560102ea0, op=0x555555bfae40, args=<optimized out>, env=0x55555dc956b8) at dotcode.c:548
#7  0x0000555555685104 in Rf_eval (e=<optimized out>, rho=rho@entry=0x55555dc956b8) at eval.c:723
...

I poke around in the zero frame

(gdb) frame 0
#0  0x00007fffebbc947d in stan::services::util::gq_writer::write_gq_names<model4b20476a92e7_gqs_namespace::model4b20476a92e7_gqs> (this=0x7ffffffeefe0, model=...)
    at /home/ben/r-devel/library/StanHeaders/include/src/stan/services/util/gq_writer.hpp:57
57                sample_writer_(gq_names);
(gdb) print gq_names
$1 = warning: Type size unknown, assuming 1. Try casting to a known type, or void *.
warning: Type size unknown, assuming 1. Try casting to a known type, or void *.
Python Exception <class 'gdb.error'> Cannot perform pointer math on incomplete type "std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >", try casting to a known type, or void *.: 
std::vector of length 1600, capacity 1600
(gdb) print gq_names[0]
Python Exception <class 'gdb.error'> Cannot perform pointer math on incomplete type "std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >", try casting to a known type, or void *.: 
Error while executing Python code.
(gdb) print include_tparams
$2 = false
(gdb) print include_gqs
$3 = true
(gdb) print names[0]
Python Exception <class 'gdb.error'> Cannot perform pointer math on incomplete type "std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >", try casting to a known type, or void *.: 
Error while executing Python code.

It seems as if it is getting the names wrong. Any ideas?

is the model block the same?

checking your programs now… - will try running in cmdstan

I deleted the model block from the second one, thinking that it was irrelevant to the standalone generated quantities. Could that have caused the problem somehow?

missing model block doesn’t make a difference - in the generated c++ code the methods that report the parameter names are identical.

current method signature is:

        /**
         * Constructor.
         *
         * @param[in,out] sample_writer samples are "written" to this stream
         * @param[in,out] logger messages are written through the logger
         * @param[in] num_constrained_params offset into write_array gqs
         */
        gq_writer(callbacks::writer& sample_writer, callbacks::logger& logger,
                  int num_constrained_params): sample_writer_(sample_writer),
                  logger_(logger),
                  num_constrained_params_(num_constrained_params) { }

the param num_constrained_params is used in some ptr arithmetic - should it be declared as size_t?

I don’t think the int vs size_t thing should make a difference. Does it not crash in CmdStan?

doesn’t crash in cmdstan.

data file: data.json:

{
    "N" : 10,
    "y" : [0.0,1.1,2.2,-1.1,-2.2,3.3,-1.1,0.5,-0.5,20.1]
}

compile mu_sigma, run in cmdstan, rename file “output.csv” to “mu_sigma_fit.csv”:

> make mu_sigma
> make gqs
> ./mu_sigma sample data file=data.json
> mv output.csv mu_sigma_fit.csv

now run data and fit through gqs

> ./gqs generate_quantities fitted_params=mu_sigma_fit.csv data file=data.json
> head -30 output.csv 
# stan_version_major = 2
# stan_version_minor = 18
# stan_version_patch = 1
# model = gqs_model
# method = generate_quantities
#   generate_quantities
#     fitted_params = mu_sigma_fit.csv
# id = 0 (Default)
# data
#   file = data.json
# init = 2 (Default)
# random
#   seed = 1874183497
# output
#   file = output.csv (Default)
#   diagnostic_file =  (Default)
#   refresh = 100 (Default)
y_rep.1,y_rep.2,y_rep.3,y_rep.4,y_rep.5,y_rep.6,y_rep.7,y_rep.8,y_rep.9,y_rep.10
-2.77147,11.1755,-2.94387,-0.149251,16.7854,-4.46741,7.68876,-4.59758,0.625983,1.79926
-5.4313,-3.21489,1.82186,7.55181,-6.68378,5.3052,-5.60962,-12.2762,-0.331983,6.12462
-1.66626,-19.1516,-1.0209,5.55801,-4.69572,-1.93266,-14.9463,-6.4146,-4.90327,8.32455
19.3073,2.25695,6.50002,14.0943,3.47019,13.2952,10.3881,2.47026,-1.9345,6.74451
2.86956,-16.5261,1.14381,-9.08778,-1.31738,-1.65913,7.57881,2.61252,-0.877738,-3.43483
15.9574,-8.08627,20.0695,-1.4178,-2.32336,-4.97237,-3.06483,-18.561,2.0951,-14.7014
3.60437,-1.26338,-2.17529,-13.162,11.8011,7.64686,-0.333673,-4.31193,2.22699,10.813
7.64112,-2.01195,-4.4741,5.75201,5.58956,0.678053,2.59562,1.11146,0.424284,-6.30433
-6.11477,1.5185,-5.27225,-16.3766,5.0236,4.01305,-11.3282,16.3483,-7.28889,17.9621
3.02672,18.2838,8.80931,10.3096,8.94595,-7.71211,1.37242,12.745,7.94828,13.5966
-13.7672,-7.66666,-3.70309,-4.43902,7.27395,8.80424,-3.84288,-4.9865,0.943423,-2.1204
-2.2052,4.8987,1.29588,-1.20765,3.93648,17.9908,-2.00135,8.16654,5.80836,2.42973

Upon further review, it definitely looks like the pointer arithmetic is messed up somewhere, but it is messed up before it even gets to gq_writer. In frame 1, which is just standalone_generate, I am getting things like

(gdb) frame 1
#1  0x00007fffebbcec0e in stan::services::standalone_generate<model20463ca05adc_gqs_namespace::model20463ca05adc_gqs> (model=..., draws=..., seed=0, interrupt=..., logger=..., 
    sample_writer=...) at /home/ben/r-devel/library/StanHeaders/include/src/stan/services/sample/standalone_gqs.hpp:101
101       writer.write_gq_names(model);
(gdb) print p_names.size()
warning: Type size unknown, assuming 1. Try casting to a known type, or void *.
$7 = 64

I have no idea how it gets to 64 there. It should be two or maybe three if you count lp__. And then

(gdb) print gq_names.size()
warning: Type size unknown, assuming 1. Try casting to a known type, or void *.
$8 = 1664

which should be 52 or 53.

is this an R cpp problem? it can’t figure out that pnames is std::vector<std::string>?

I don’t recall ever having a problem with the parameter names for regular Stan programs. I can even get the parameter names for sm inside R before calling standalone_gqs.

and get the size as well?

Yeah, I think gdb’s environment was corrupted from an incorrectly initialized writer. Now, I am getting a different error that I am looking into.

OK - could change gq_writer so that num_constrained_params_ is type size_t - that would change initialization.

I made this change and ran the unit tests - seems fine.