Using Stan for SPRITE

Can one use Stan to generate many samples that satisfy certain criteria exactly? I’m thinking of what is described here as SPRITE. Ideally, I’d like to generate multiple possible datasets that have a specific mean, standard deviation, and a given histogram (the same every time, for one value of n). For the histogram, one could presumably adapt what is described on page 202 of the manual (for measurement error from rounding), but I’m not sure how to go about this in a way that yields datasets that have exactly the right mean, sd, and counts in each histogram bin.

What is the generative process for the datasets? For example, if you want to define a K-vector alpha with mean mu, you can do

parameters {
  vector[K - 1] alpha_raw;

transformed parameters {
  vector[K] alph;
  mu[1:(K-1)] = alpha_raw;
  mu[K] = mu - sum(alpha_raw);

If you want a parameter beta that has a given variance sigma, you can write

parameters {
  unit_vector beta_raw;

transformed parameters {
  vector[K] beta = beta_raw * sigma^2;

You can do other tricks to keep things inside a histogram (each bin needs its own variables with the right counts, defined with the right constraints).

Combining any two of these is going to be a lot harder, and combining all three would be a real challenge. I’m guessing it would be possible if your constraints are consistent (obviously if you want a lot of observations in distant bins, that puts bounds on what the mean and variance can be).

Yeah the challenge is keeping things consistent with the bins. Suppose you think that the distribution is exponential, and someone is willing to send you a histogram and summary stats (the mean, sd, and n) but not the raw data. The challenge is to sample from that histogram in a way that’s consistent (exactly) with the mean and sd provided. The first idea was something simple where you get the density for each bin and sample from there, with transformations afterward to get the right mean and sd. The issue is that any data points won’t (can’t) fit a distribution exactly, so you have to deal with uncertainty about the parameters that define the distribution. That’s why I figured Stan could be useful, but I don’t know where to go with that.

I don’t see how to do this easily, even assuming you have laid down constraints (bin counts and boundaries, mean, and sd) that admit a solution. The question is what you want the sampling distribution to be in that constrained space. If you generate in the bins (easy to do in Stan), then any posterior adjustment destroys those sampling properties.

Stan’s not set up to conjoin several external constraints easily. The problem is that you need the Jacobians in order to turn it into a general model, and those interact in nasty ways when you start combining constraints.

What if I had some sense of the sampling distribution? To give a simple example, I think one thing that should be easily doable is if I had just two bins: one from negative infinity to 0 and 0 to infinity. Then, I could (quite easily in Stan I think) sample $n_{bin} - 1$ points for each side from a half-normal which is close to what I want, and get exactly the right two points needed to get exactly the right mean and sd. However, I’m not sure how I’d even go about implementing that, let alone generalize that process.

To do this in Stan, you’d need to write down proper densities for the variables you’re sampling with matching parameter constraints. Why do you even care about doing it probabilistically?

I’d like to be make predictions about what “future” histograms would look like, without using some simplistic Dirichlet-multinomial model which works bin by bin.

Then you need a proper (in the sense of being normalizable) probabilistic model. As soon as you have that, you should be able to code it in Stan, although it may be a lot of work, depending on how the constraints pan out and the complexity of the Jacobians introduced.

What if I did something much simpler to start with? I’m thinking of the example on page 203 of the Stan 2.15.0 manual (the model with latent parameters). Is there an easy way to modify that code so that I can get samples from z of size n given measurements of y (for now ignoring that we have information about mean and sd)?

Not sure I follow. The latent values there are true values. Are you going to have more than one measurement for the same true value?

But to answer the direct question, yes, you can always add moe parameters. Just change the declarations and sampling statements.

No, only one measurement for the same true value. But is it possible to sample the latent values given that one has the rounded values?

Yes, you can. It depends on the rounding process. If you round to the nearest integer, then a measured value of 6 means 5.5 to 6.5, so if you want to sample in that region, you just have to do

parameters {
  real<lower=5.5, upper=6.5> x_true;
...
model {
  x_true ~ foo(theta);

Ah, but is there a way to modify the code in the manual so I do not have to specify all of the constraints manually? Moreover, different observations will have different associated bounds. What would be a simple way of getting estimates (e.g) for mu and sigma of a Gaussian when observations are rounded, together with samples for x_true?

If the rounding is all unit rounding, then you can code a vector of zero-centered unit intervals and add the means. Stan’s just a programming language, so you can just code all these things up directly.

data {
  int<lower=0> N;
  vector[N] y_rounded;
  ...
parameters {
  real<lower=-0.5, upper=0.5> rounding_error[N];
  ...
model {
  y_rounded + rounding_error ~ foo(theta);
...

Or if they’re not all unit rounding you can write your own transforms and include the log Jacobian adjustments as given in the manual chapter on transforms.

1 Like