CV - indexing and multiple tilde assignment


Hi everybody!

I am trying to do cross-validation within my stan model. I haven’t found any method to do that on its own (rstanarm has it, but I rather want to have it in the model itself) - is there anything else?

  1. The first question is about indexing. In Stan I can use int arrays for indexing like y[idx1] where y is what I want to subset with the indices in the integer array idx1. But can I somehow exclude certain indices? In R this would be y[ -(idx1)], but this doesn’t work in Stan. Or can I index with a condition? Again in R, this would be y[idx[i] == 2] or y[idx[i] != 3].

  2. My second questions is about tilde assignments: Can I use one element multiple times on the left side of a tilde? In cross validation I would have to use y[1] for training every group of parameters except the one that the first observation is in itself, so K-1 times.
    If multiple assignments are not possible, my first workaround idea was to replicate the data y so that I have a 2-dimensional array and then I would not assign y[1] multiple times but rather use y[1, 1], then y[2, 1], …, y[K, 1].



No, except by constructing a complimentary set of indices

Not easily



Fantastic, thanks so much!
I will have to construct the not_group_k index sets then.


First, it’s not doing an assignment.

y ~ foo(theta);

is just shorthand for

target += foo_lpdf(y | theta);

with some constants dropped out. There’s no assignment—y must be defined before the call.

If you do

y ~ foo(theta);
y ~ bar(phi);

then it’s equivalent to saying

p(y | theta, phi) prop-to foo(y | theta) * bar(y | phi).


Ah, yes. I was wondering how to say that before I wrote my post, hoping it would go unnoticed. ;)
Therefore: How would you say it?

And additionally, regarding the following:

Ouh, is that so? I thought (and was using it now) as two different statements, wanting both to be valid. Eg two equations, I could say

y = a1 + b1 * x
y = a2 + b2 * x

where y and x are given. This is what I want for the cross validation, that each value of y used for the estimation of all parameters which do not belong to the group this particular value is in. Like

y[1] ~ a[1] + b[1] * x[1]
y[1] ~ a[2] + b[2] * x[1]

where y[1] is in say group 3 of parameters for CV. And I want a[1] as well as a[2] to “learn” from y[1].

And as far as I can see it works. But if it was proportional to the product of the two then this would work neccessarily, would it?


Say what? We call it a “sampling statement” even though it doesn’t explicitly call a random-number generator. The point is that you can use them to mirror the generative story in building up a graphical model (like in BUGS or JAGS).

I don’t know what you think would happen—it’s not an equation!

It’s proportional to the product. The point is that sampling statements just increment the log density—that’s equivalent to multiplying in a term on the linear scale.

Usually for cross-validation the data is divided into a training set and a test set, the model estimated on the training set, then measured on the test set (usually some expectation like an event probability or posterior parameter estimate).


Ah, okay, that’s what I meant. What is your name for a statement of the form “a ~ b” (and it is ‘sampling statement’). Thanks!

Exactly, that’s what I am doing. But as I am not only dividing into training and testing once, but multiple times, every observation y[i] is used more than once for training (and thus in my code it is the left-hand-side of a sampling statement more than once).
Imagine having three observations y1, y2 and y3 and doing 3-fold CV on them.
This could mean in the first go only y2 and y3 are used for training (and then the obtained samples are used only with y1 to produce a prediction, which will be compared to the observed outcome for y1).
y2 ~ 'function of parameters_CV_group[1] and data for observation 2'
y3 ~ 'function of parameters_CV_group[1] and data for observation 3'

In the second “round” though, y1 and y3 are used for training. This means that y3 is used on the left side of a ~ again (and indeed all of them are used twice here), like
y1 ~ 'function of parameters_CV_group[2] and data for observation 1'
y3 ~ 'function of parameters_CV_group[2] and data for observation 3'

This is how I set up my cross validation in the stan model. Now I am wondering if my wrong understanding of sampling statements (as simultaneously valid equations) will create problems. All right-hand-sides for the same y will never contain the same parameter again (as they cycle through all CV groups), does this extra condition help?

So far, I only tried with two groups to check the indexing (here there is only one sampling statement for every y) and with three groups it worked as well and produced sensible results. I hope I did not go too far astray with explaining. If anything is unclear, please let me know.


Usually you want to set up the K-fold cross-validation on the outside, not within the Stan program. The Stan program then fits based on the training data and predicts for the test data.

What you’re doing isn’t going to work unless you have 10 copies of the parameters for 10-fold cross-validation.


Yeah, why? I had the feeling that it would be easier to diagnose and extract if all was done within one stan function call. Otherwise I would have to collect all the samples together from the K fits. Obviously this is easily arranged in R, but I thought I would lose the “compact-ness”.

Yes, that is what I am doing. My parameters have an additional dimension which represents to which group they belong. The parameters for group k are then only used once for training (with all data (here y and X) except for the data belonging to group k) and later for creating replicated data (with all the data belonging to group k).

Side note: Would I/you/the Bayesian community call this “replicated” data or “simulated” data (as I do not use the same data points for training and testing), is there a consensus?


It’s usually more efficient and stable to fit ten little models than one big model. If the models are really simple, that may reverse.

It also seems easier to do all the data munging in R, which has many more tools suited to the task than Stan.

But you’re right that you lose the unification into one program.

If you generate data from the priors using pseudorandom number generators, we’d call it “simulated data” (Andrew also calls it “fake data”). The word “replicate” is very overloaded. You can replicate whole experiments, replicate a statistical analysis, etc. You usually need some kind of exchangeaability if you’re going to estimate any kind of parameter.

In cross validation, you typically take a big data set, and break it into K “folds”, each of which is used in turn for testing with everything else used for training.

If you can generate your own simulated data, it’s better still to replicate at the simulation level. That is, generate multiple simulated data sets. That’s what the Cook-Gelman-Rubin diagnostics do (our marketing department’s working out what to call that procedure becuase we’re doing it more and more and Andrew doesn’t like procedures named after people, even if he’s one of them).


Ah okay. I’ll try to get it running just for the sake of it, but then double-check the time and stability compared to K little models.

Thanks for the word clarification as well. As I am explaining what I’m doing anyway, it is easy to follow, but I didnt’t want to cross with existing connotations too much.

I will check out C-G-R diagnostics more over the next week, but was unsure about

I am not generating data which I then use for estimating the parameter, but only simulate/replicate data using the MCMC samples I reiceived to do predictive checking of my model. Thus, I am unsure what you mean by “generating multiple simulated data sets”. Currently, I produce 6000 replicated sets from my 6000 samples (4 chains, 2000 iter, 500 warmup -> 4 * 1500 samples after warmup). Is that what you mean already?


It should be clearer when you check out the CGR diagnostics. You generate multiple complete data sets starting from generating parameters from the prior, then data from the parameters.