Standalone generated quantities - comments welcome

mitzimorris · January 21, 2019, 7:24pm

Stan’s standalone generated quantities service stan::service::standalone_gqs now takes as input the fitted parameter values on the constrained scale, i.e., as output by the sampler. This means that it should be relatively simple to plumb calls to this service through the interfaces.

I’ve updated the cmdstan issue here: Add standalone generated quantities options to cmdstan · Issue #594 · stan-dev/cmdstan · GitHub
Here’s what it says:

The standalone_generate function requires as inputs:

the model

the data used to fit it

the set of draws from the posterior.

The command line arguments required for this feature are:

method = proposing new option: generate

data = same dataset as before

fitted_param_values = a new sub-option which specifies a single input file containing the fitted parameter values

output = name of output file

seed = specify random seed - optional

The input file of fitted parameter values can be in one of the following formats

Rdump

JSON

csv (?) needs investigation

thoughts on workflow for RStan and other interfaces welcome ,

cheers,
Mitzi

lcomm · January 21, 2019, 10:42pm

Just to clarify, this list of inputs suggests that information created/derived in the transformed data block would not be able to be used in these standalone GQ. Is that correct?

mitzimorris · January 21, 2019, 11:28pm

incorrect.

information created/derived in the transformed data block is available in the GQ block.

statements in the transformed data block are executed in the model’s constructor, i.e., on model instantiation. the standalone gqs service takes an instantiated model, therefore any variable declared in the top-level of the transformed data block will be available in the generated quantities block.

cheers,
Mitzi

mitzimorris · January 23, 2019, 1:36am

After getting this working preliminarily and discussion with @seantalts, I’ve revised this issue w/r/t to allowed input formats of the sample from the fitted model. In order to make this feature easy to use, the sample should be in its default format - for CmdStan, this is the stan csv file format.

The question is - should CmdStan try to also recognize csv data from RStan or PyStan?

Discussion from the updated issue is as follows:

It is easy to assemble the matrix of fitted parameter values with the correct ordering from Cmdstan’s output.csv file, (or concatenated output.csv files).

It is also possible to parse a csv file created from an RStan stan_fit object which has been saved as a csv file (without row names), because the first N columns correspond to the parameters on the unconstrained scale in the correct order.

Not sure whether or not something similar is possible in PyStan.

@jonah, @bgoodri - is it worth having this feature in CmdStan for RStan users?
@ariddell - is there a way to dump the sample in csv format such that the parameters are in declaration order?

bgoodri · January 23, 2019, 1:38am

I guess. If the comments in the csv file are not relevant, then it is just a question of writing the columns in the correct order.

mitzimorris · January 23, 2019, 3:03am

right.
looks like RStan does this right thing by default -
given a stan_fit object foo_fit, the command
write.csv(as.matrix(foo_fit), file="foo.csv", row.names=FALSE)
produces a csv file with parameters in declaration order.

the column order in a file generated in this way differs from the column order in the current CmdStan “output.csv” format in that the CmdStan file has “lp__” as the first column, and cols 1:6 are the sampler state diagnostic variables, where in the RStan stan_fit object written as CSV the sampler state diagnostics aren’t there, and “lp__” is the last column. otherwise, everyone is in the same order. heuristics are trivial to see which kind of csv file is which - does header start with “lp__” or not?

if feature standalond generated quantities is available directly in RStan interface, should we invest extra work to make CmdStan handle csv files generated in this way?

bgoodri · January 23, 2019, 4:50am

I don’t know that it is a huge priority but I’m sure someone will come up with a use case for it.

bgoodri · March 10, 2019, 7:22am

I started implementing this in RStan, but it is segfaulting pretty badly when I call it. I have a Stan program mu_sigma.stan (750 Bytes) that draws mu and sigma in a normal model. And I have another Stan program gqs.stan (850 Bytes)
that is the same except it has no model block and adds a generated quantities block that draws from the posterior predictive distribution.

I run the first model and create a matrix of draws from the posterior. I compile the second model to a stanmodel called sm. I call (in the develop branch of RStan)

test <- gqs(sm, data = list(N = N, y = y), draws)

The draws of mu and sigma look fine in C++

draws = 
    0.465417     0.905675
    0.422443     0.816499
...
   0.0239443     0.827866
    0.036482     0.855883

but then it segfaults with this backtrace:

(gdb) bt
#0  0x00007fffebbc947d in stan::services::util::gq_writer::write_gq_names<model4b20476a92e7_gqs_namespace::model4b20476a92e7_gqs> (this=0x7ffffffeefe0, model=...)
    at /home/ben/r-devel/library/StanHeaders/include/src/stan/services/util/gq_writer.hpp:57
#1  0x00007fffebbc762e in stan::services::standalone_generate<model4b20476a92e7_gqs_namespace::model4b20476a92e7_gqs> (model=..., draws=..., seed=660085397, interrupt=..., logger=..., 
    sample_writer=...) at /home/ben/r-devel/library/StanHeaders/include/src/stan/services/sample/standalone_gqs.hpp:101
#2  0x00007fffeba894fc in rstan::stan_fit<model4b20476a92e7_gqs_namespace::model4b20476a92e7_gqs, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >::standalone_gqs (this=0x55555a2d1c20, pars=0x5555607acce0, 
    seed=0x55555dff2180) at /home/ben/r-devel/library/rstan/include/rstan/stan_fit.hpp:1225
#3  0x00007fffebbc5cac in Rcpp::CppMethod2<rstan::stan_fit<model4b20476a92e7_gqs_namespace::model4b20476a92e7_gqs, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >, SEXPREC*, SEXPREC*, SEXPREC*>::operator() (
    this=0x55555c184320, object=0x55555a2d1c20, args=0x7ffffffef6e0) at /home/ben/r-devel/library/Rcpp/include/Rcpp/module/Module_generated_CppMethod.h:195
#4  0x00007fffebab52eb in Rcpp::class_<rstan::stan_fit<model4b20476a92e7_gqs_namespace::model4b20476a92e7_gqs, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > > >::invoke_notvoid (this=0x55555ce7fe80, 
    method_xp=0x55556040cea8, object=0x55555dcb79d0, args=0x7ffffffef6e0, nargs=2) at /home/ben/r-devel/library/Rcpp/include/Rcpp/module/class.h:234
#5  0x00007ffff1273c15 in CppMethod__invoke_notvoid(SEXPREC*) () from /home/ben/r-devel/library/Rcpp/libs/Rcpp.so
#6  0x000055555563f48d in do_External (call=0x555560102ea0, op=0x555555bfae40, args=<optimized out>, env=0x55555dc956b8) at dotcode.c:548
#7  0x0000555555685104 in Rf_eval (e=<optimized out>, rho=rho@entry=0x55555dc956b8) at eval.c:723
...

I poke around in the zero frame

(gdb) frame 0
#0  0x00007fffebbc947d in stan::services::util::gq_writer::write_gq_names<model4b20476a92e7_gqs_namespace::model4b20476a92e7_gqs> (this=0x7ffffffeefe0, model=...)
    at /home/ben/r-devel/library/StanHeaders/include/src/stan/services/util/gq_writer.hpp:57
57                sample_writer_(gq_names);
(gdb) print gq_names
$1 = warning: Type size unknown, assuming 1. Try casting to a known type, or void *.
warning: Type size unknown, assuming 1. Try casting to a known type, or void *.
Python Exception <class 'gdb.error'> Cannot perform pointer math on incomplete type "std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >", try casting to a known type, or void *.: 
std::vector of length 1600, capacity 1600
(gdb) print gq_names[0]
Python Exception <class 'gdb.error'> Cannot perform pointer math on incomplete type "std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >", try casting to a known type, or void *.: 
Error while executing Python code.
(gdb) print include_tparams
$2 = false
(gdb) print include_gqs
$3 = true
(gdb) print names[0]
Python Exception <class 'gdb.error'> Cannot perform pointer math on incomplete type "std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >", try casting to a known type, or void *.: 
Error while executing Python code.

It seems as if it is getting the names wrong. Any ideas?

mitzimorris · March 10, 2019, 3:02pm

is the model block the same?

checking your programs now… - will try running in cmdstan

bgoodri · March 10, 2019, 3:14pm

I deleted the model block from the second one, thinking that it was irrelevant to the standalone generated quantities. Could that have caused the problem somehow?

mitzimorris · March 10, 2019, 3:56pm

missing model block doesn’t make a difference - in the generated c++ code the methods that report the parameter names are identical.

mitzimorris · March 10, 2019, 4:04pm

current method signature is:

        /**
         * Constructor.
         *
         * @param[in,out] sample_writer samples are "written" to this stream
         * @param[in,out] logger messages are written through the logger
         * @param[in] num_constrained_params offset into write_array gqs
         */
        gq_writer(callbacks::writer& sample_writer, callbacks::logger& logger,
                  int num_constrained_params): sample_writer_(sample_writer),
                  logger_(logger),
                  num_constrained_params_(num_constrained_params) { }

the param num_constrained_params is used in some ptr arithmetic - should it be declared as size_t?

bgoodri · March 10, 2019, 4:22pm

I don’t think the int vs size_t thing should make a difference. Does it not crash in CmdStan?

mitzimorris · March 10, 2019, 5:03pm

doesn’t crash in cmdstan.

data file: data.json:

{
    "N" : 10,
    "y" : [0.0,1.1,2.2,-1.1,-2.2,3.3,-1.1,0.5,-0.5,20.1]
}

compile mu_sigma, run in cmdstan, rename file “output.csv” to “mu_sigma_fit.csv”:

> make mu_sigma
> make gqs
> ./mu_sigma sample data file=data.json
> mv output.csv mu_sigma_fit.csv

now run data and fit through gqs

> ./gqs generate_quantities fitted_params=mu_sigma_fit.csv data file=data.json
> head -30 output.csv 
# stan_version_major = 2
# stan_version_minor = 18
# stan_version_patch = 1
# model = gqs_model
# method = generate_quantities
#   generate_quantities
#     fitted_params = mu_sigma_fit.csv
# id = 0 (Default)
# data
#   file = data.json
# init = 2 (Default)
# random
#   seed = 1874183497
# output
#   file = output.csv (Default)
#   diagnostic_file =  (Default)
#   refresh = 100 (Default)
y_rep.1,y_rep.2,y_rep.3,y_rep.4,y_rep.5,y_rep.6,y_rep.7,y_rep.8,y_rep.9,y_rep.10
-2.77147,11.1755,-2.94387,-0.149251,16.7854,-4.46741,7.68876,-4.59758,0.625983,1.79926
-5.4313,-3.21489,1.82186,7.55181,-6.68378,5.3052,-5.60962,-12.2762,-0.331983,6.12462
-1.66626,-19.1516,-1.0209,5.55801,-4.69572,-1.93266,-14.9463,-6.4146,-4.90327,8.32455
19.3073,2.25695,6.50002,14.0943,3.47019,13.2952,10.3881,2.47026,-1.9345,6.74451
2.86956,-16.5261,1.14381,-9.08778,-1.31738,-1.65913,7.57881,2.61252,-0.877738,-3.43483
15.9574,-8.08627,20.0695,-1.4178,-2.32336,-4.97237,-3.06483,-18.561,2.0951,-14.7014
3.60437,-1.26338,-2.17529,-13.162,11.8011,7.64686,-0.333673,-4.31193,2.22699,10.813
7.64112,-2.01195,-4.4741,5.75201,5.58956,0.678053,2.59562,1.11146,0.424284,-6.30433
-6.11477,1.5185,-5.27225,-16.3766,5.0236,4.01305,-11.3282,16.3483,-7.28889,17.9621
3.02672,18.2838,8.80931,10.3096,8.94595,-7.71211,1.37242,12.745,7.94828,13.5966
-13.7672,-7.66666,-3.70309,-4.43902,7.27395,8.80424,-3.84288,-4.9865,0.943423,-2.1204
-2.2052,4.8987,1.29588,-1.20765,3.93648,17.9908,-2.00135,8.16654,5.80836,2.42973

bgoodri · March 10, 2019, 6:15pm

Upon further review, it definitely looks like the pointer arithmetic is messed up somewhere, but it is messed up before it even gets to gq_writer. In frame 1, which is just standalone_generate, I am getting things like

(gdb) frame 1
#1  0x00007fffebbcec0e in stan::services::standalone_generate<model20463ca05adc_gqs_namespace::model20463ca05adc_gqs> (model=..., draws=..., seed=0, interrupt=..., logger=..., 
    sample_writer=...) at /home/ben/r-devel/library/StanHeaders/include/src/stan/services/sample/standalone_gqs.hpp:101
101       writer.write_gq_names(model);
(gdb) print p_names.size()
warning: Type size unknown, assuming 1. Try casting to a known type, or void *.
$7 = 64

I have no idea how it gets to 64 there. It should be two or maybe three if you count lp__. And then

(gdb) print gq_names.size()
warning: Type size unknown, assuming 1. Try casting to a known type, or void *.
$8 = 1664

which should be 52 or 53.

mitzimorris · March 10, 2019, 6:46pm

is this an R cpp problem? it can’t figure out that pnames is std::vector<std::string>?

bgoodri · March 10, 2019, 6:51pm

I don’t recall ever having a problem with the parameter names for regular Stan programs. I can even get the parameter names for sm inside R before calling standalone_gqs.

mitzimorris · March 10, 2019, 7:08pm

and get the size as well?

bgoodri · March 10, 2019, 7:40pm

Yeah, I think gdb’s environment was corrupted from an incorrectly initialized writer. Now, I am getting a different error that I am looking into.

mitzimorris · March 10, 2019, 7:57pm

OK - could change gq_writer so that num_constrained_params_ is type size_t - that would change initialization.

I made this change and ran the unit tests - seems fine.

Topic		Replies	Views
Adding standalone generated quantities option to cmdstan (and rstan and pystan) Interfaces	9	1211	December 15, 2017
Standalone generated quantities usage help : cmdstanr CmdStan techniques , fitting-issues , algorithms	3	991	April 23, 2020
CmdStan generate_quantities and stansummary CmdStan	14	1325	January 9, 2022
Shinystan with separate generated quantities Developers cmdstan , shinystan	4	762	March 9, 2022
Generate Quantities - Mismatch between model and fitted_parameters CmdStan	3	593	September 28, 2022

Standalone generated quantities - comments welcome

Related topics