Var_context, what do we want from it?

I was going to write some var_contexts for binary data, I don’t mind working in some other re-writing of var_context stuff with it. So far:

  • I’d like easy construction from c++
  • I’d like easy construction from binary files
  • I’m assuming the var_context API isn’t changing (?)

If I missed some discussion about this I’d appreciate any pointers.

The main problem right now is that var_context concept is extraordinarily monolithic and wraps all of the variable transformations. To clean that up we’d need to pull all of the transforms exposed in https://github.com/stan-dev/stan/blob/develop/src/stan/io/reader.hpp and https://github.com/stan-dev/stan/blob/develop/src/stan/io/writer.hpp into their own separate classes that handle the constraining/unconstraining for each transform. For example,

template <typename T, typename T_LB>
class vector_lower_bound: public vector_transform<T> {
private:
  T_LB lower_bound_;

public:
  vector_lower_bound(T_LB lower_bound, size_t N):
    lower_bound_(lower_bound), vector_transform<T>(N) {}

  size_t unconstrained_dim() { return N_; }

  void validate(vector_t constr_val) {
    stan::math::check_greater_or_equal("stan::io::vector_lower_bound",
                                       "Lower bound constraint", constr_val,
                                       lower_bound_);
  }

  template <typename Jacobian>
  vector_t constrain(const vector_t& unconstr_val, lp_accumulator<T_LB>& acc) {
    vector_t constr_val;
    for (idx_t n = 0; n < N_; ++n) {
      if (Jacobian) {
        T_LB lp;
        output(n) = stan::prob::lb_constrain(unconstr_val[n], lower_bound_, lp);
        acc.push_back(lp);
      } else {
        output(n) = stan::prob::lb_constrain(unconstr_val[n], lower_bound);
      }
    }
  }

  vector_t constrain(const vector_t& unconstr_val,
                     std::vector<double> constr_val) {
    for (idx_t n = 0; n < N_; ++n)
      constr_val.push_back(stan::prob::lb_constrain(unconstr_val[n],
                                                    lower_bound_));
  }

  void unconstrain(const vector_t& constr_val, vector_t::InnerIterator& it) {
    for (idx_t n = 0; n < N_; ++n)
      *(it++) = stan::prob::lb_free(constr_val[n], lower_bound_);
  }
};

I already did a bunch of these in the old branch, https://github.com/stan-dev/stan/tree/feature/refactor_io_reader_writer/src/stan/io/transforms.

Once we have all the transformed defined in their own classes we can then clean up the var_context itself. First, a rename – what has been called a “variable context” is more property a “data access layer”. The new DAL would then have a single constrain/unconstrain method that takes in the transform classes, either as a base function or a template. For example,

// base_dal
vector_t get_vector(const std::string& name, size_t N) {
  if (!contains_r(name)) {
    std::stringstream msg;
    msg << "Variable " << name
        << " not found in variable context" << std::endl;
    throw std::runtime_error(msg.str());
  }

  std::vector<size_t> dims = dims_r(name);
  if (dims.size() != 1) {
    std::stringstream msg;
    msg << "Requested the vector " << name
        << ", but " << name
        << " is not defined as a vector in variable context"
        << std::endl;
    throw std::runtime_error(msg.str());
  }

  if (dims[0] != N) {
    std::stringstream msg;
    msg << "Requested the vector " << name
        << " with size " << N << ", but " << name
        << " is defined with size " << dims[0] << " in variable context"
        << std::endl;
    throw std::runtime_error(msg.str());
  }

  std::vector<double> vals = vals_r(name);
  vector_t output(N);
  for (vector_t_idx n = 0; n < N; ++n)
    output(n) = vals[n];
  return output;
}

// constr_dal
template <typename T>
void get_constrained_vector(const string& name,
                            const stan::io::transform& transform,
                            vector_t param) {
  param.resize(transform.size());
  param = get_vector(name, transform.size());
  transform.validate(param);
}

// unconstr_dal
template <typename T>
void get_unconstrained_vector(const string& name,
                              const stan::io::transform& transform,
                              vector_t::InnerIterator& it) {
  vector_t constr_val = get_vector(name, transform.size());
  transform.unconstrain(constr_val, it);
}

Then we can drastically clean up (and simplify) the generated model code to look something like

// constructor
model_name {
  ...
  vector_lb_transform<input_type, lb_type>
    param_name1_transform__(lb_value);
  data_var_context.get_constrained_vector("param_name1",
                                          param_name1_transform,
                                          param_name1__);
  ...
}

// init
void init(vector_t& unconstr_state) {
  vector_t::InnerIterator it(unconstr_state);

  ...
  vector_lb_transform<input_type, lb_type>
    param_name1_transform__(lb_value);
  get_unconstrained_vector("name1", param_name1_transform, it);
  ...
}

// what write_array was
void constr_state(vector_t unconstr_state,
                  std::vector<double>& constr_state) {
  ...
  vector_lb_transform<input_type, lb_type>
    param_name1_transform__(lb_value);
  param_name1_transform__.constrain(unconstr_state, constr_state);
  ...
}

// log_prob
template <typename T>
T log_prob(vector_t& unconstr_state) {
  lp_accumulator<T> lp_acc__;

  ...
  vector_lb_transform<input_type, lb_type>
    param_name1_transform__(lb_value);
  vector_t param_name1__ =
    param_name1_transform__.constrain<Jacobian>(unconstr_state, lp_acc__);
  ...
}

At this point DAL implementations are solely responsible for providing the actual data access. For example, a C++ DAL would be extremely easy to write at this point.

Let me also note that none of this redesign is controversial – everyone agreed upon it a few years ago and the spec has been sitting my desktop buried under other issues.

Great, I never understood why that all was tied together. Question about the current transforms: there are all sorts of no-op’s sprinkled around, why do we need them?

For example in the reader there’s this one :

      /**
       * Return the next scalar in the sequence, incrementing
       * the specified reference with the log absolute Jacobian determinant.
       *
       * <p>With no transformation, the Jacobian increment is a no-op.
       *
       * <p>See <code>scalar_constrain()</code>.
       *
       * log_prob Reference to log probability variable to increment.
       * @return Next scalar.
       */
      T scalar_constrain(T& /*log_prob*/) {
        return scalar();
      }

Nope that’s not a no-op methods, I get it now.

could you share that spec here? seems like it’s time to re-review this.

Agreed.

Mine, too.

I put what @betanalpha gave me up on a stan-dev/stan Wiki for the var_context API spec.

There are two things: design and implementation. I’ll stay out of the current implementation since I don’t really want to dig into the code right now.

I think the redesign isn’t controversial – it just takes work and some coordination.

There’s a use case that pops up now and again that makes the design a little cumbersome. That said, it ties directly into Stan’s type system, so there may not be a better design out there. That use case is streaming data.

The reason this is hard, in my mind, is because the language is really flexible. It’s not like N always means number of observations and we can subsample anything indexed with N. I can’t really imagine a design that would be amenable to Stan’s full expressibility. That said, I bet there’s a way to design something that rstanarm could use to subsample that data in a sensible manner.

Anyway, if I wanted to build a streaming algorithm, I wouldn’t have any problem using the current interface and providing new var_context instances that have chunks of data.

Can you be clearer about this?

Sure. I’m thinking about Matt’s online LDA paper.

I can imagine there being a really large data set that possibly can’t be stored in memory. I can imagine we can ask for a subset of the data essentially infinitely many times and it’d return a subset of the data with the right sizes. Maybe something more concrete, think about a linear model.

data {
  int N;
  int M;
  real y[N];
  matrix X[N, M];
}
parameters {
  vector[M] beta;
  real<lower = 0> sigma;
}
model {
  y ~ normal(X * beta, sigma);
}

This is way overkill, but suppose M was something like 10 and the real data had N at 1 billion. I could imagine M always being 10, but every time you asked for new data, it’d pull something like N = 1000. If this was some streaming variational approximation, then you use the first data set for a certain number of iterations, draw a new data set and then go on for another few iterations, rinse and repeat until the solution converges (converge meaning stays stable, not meaning gets to the right solution).

That make it any clearer? I could write up something more substantial later if you want.

That’s plenty, thanks. What I don’t see is how the new design interferes with that.

It doesn’t interfere; it doesn’t make it better or worse.

Just re-read your original post and saw this bit:

I imagine most users who want to use mini-batches would be willing to supply mini-batches but I see where you were coming from now.

Yup. I had hours of discussions with Alp about this when he was implementing advi.

We can always write a subsampling var context that’s specific to the form of the data and what the unit of observation is. But can’t assume that structure in general. Using it properly within Stan would be a matter of convention. Implementing it wouldn’t be too difficult, but requires config from the interfaces (and a lot of coordination).

It’s total abuse of the context concept to write that, to me it makes more sense for it to take a vector of files in its constructor and move to the next file on request…

Even that seems like something that should be wrapping the var contexts, not be built into them.

re: something non-vegan about cats:

The tee implements the basic interface given two other implementors. It doesn’t add an option to the basic interface.

Yeah, I can see how wrapping would be cleaner. No matter what under the hood you’d be reading from disk/socket to pre-cache some batches and providing them in RAM as needed.