Apparently I do a terrible job of communicating this concept, this isn’t the first time I’ve managed to make @betanalpha think I’m totally nuts. I chalk this up to talking past each other, so I take the blame here.
If we have a known p(Data | Params) over our data, and we for some reason want to express it in our model using a transformed version of our data p(F(Data)|Params) then we need to make p(F(Data)|Params) be a particular push-forward measure from the Data space onto the F space, and we need to do all that stuff Mike mentioned, otherwise we’re not using something equivalent to our known p(Data | Params). I hope at least here Mike and I are on the same page.
On the other hand, it seems often enough that people don’t know a distribution over their data, they assign a distribution over some function of their data, so Some_F_Distro(F(Data)|Params) is given by their modeling choices and then this implies that the likelihood over the data is the push-forward measure induced by Finverse (when this thing exists, so F needs to be not too weird). This is the same as saying foo ~ normal(0,1) getting samples of foo, and then later saying expfoo = exp(foo) you wind up with a distribution over expfoo which is lognormal, because that’s the push-forward measure of a normal distribution pushed through the exp function.
If you insist on expressing your model in a generative Data ~ SomeDistro(Params) manner, then you will need to do all the stuff Mike mentions to find yourself SomeDistro from Some_F_Distro. On the other hand if you’re happy with your choice of p(F(Data) | Params) then you can use this p(F(Data)|Params) in Stan as a kind of likelihood, since it implies some unknown push-forward measure onto Data space, provided the F function is sufficiently nice.
F(Data) ~ KnownDistro(Params)
which will throw a Jacobian warning because it assumes you KNOW the distribution you want on Data and that you need to correct the distribution on F to account for the Jacobian. in this case, it’s wrong, because we don’t know the distribution on Data we accept that Data just has the push-forward measure implied by choosing KnownDistro for F and pushing this measure through Finverse. (and, this is a modeling assumption we should check)
Then if you want to predict new data, you create say a FF parameter, assign it the same distribution, sample in the F space, and then numerically solve for Data using FF samples to get samples in Data space. Same as calculating expfoo = exp(foo) except instead of “exp(foo)” you need to numerically solve.
I tried to lay this out clearly here: http://models.street-artists.org/2017/09/25/followup-on-implicit-function-theorem-likelihoods/
Finally, if you have a “relationship” such as F(Data,Covariates,Parameters) = 0 + error, where the implicit predictor function that gets you Data = f(Covariates,Parameters) is unknown, non-separable, etc the same type of thing applies. If you know the distribution of error, you can say
F(Data,Covariates,Parameters) ~ distribution_for_error()
which gives you a weighting function over the Parameters for fixed Data,Covariates that deforms your prior into your posterior distribution. Since it’s evaluated at the data samples, and they don’t change during sampling, it’s just a pure function of the Parameters.
and if you want to predict new Data you can create a parameter perr, give it distribution_for_error() and sample in it, and numerically solve for Data as a function of perr thereby inducing a push-forward measure onto your Data space. (and again you should check that this push-forward measure onto data space makes sense, just like you should check any model)
The qualifications about “usually (mostly?)” etc all come from the fact that say a numerical solver for Data as a function of perr needs to give you a unique answer if you want this to be straightforward, and needs to give you a countable set of possible answers if you want it to be more complicated, and if it gives you a continuum of answers like @Bob_Carpenter mentioned, then you’re most likely doing something wrong.
When you choose to do this kind of thing, you have to check that it makes sense, in the same way that if you do a linear regression you have to check to make sure that a linear regression makes sense, and if you do a fourier series, you have to check the fourier series make sense. If you a do an implicit function you have to check to make sure when you sample in the intermediate space and solve for Data numerically the solutions you get make sense.
Finally, the ABC method is typically an example of this where you’re actually doing something in an intermediate space, using a non-invertible transform, but on purpose. For example, you take some weather predictor black box model, you project the massive global prediction output down to a small number of “statistics” and then you assign a “likelihood” over those statistics:
Statistics_of_Simulation(Full_simulation_output(Parameters)) ~ some_distribution(Parameters)
And you definitely can’t recover the Full_simulation_output from the small set of statistics in the usual case. It’s not invertible. Typically what you have is for a given set of parameters, you randomly choose some initial conditions based on those parameters, run your black box forward, then from the output compute some summary statistics, nevertheless even though the parameters don’t define a one-to-one mapping to outputs, we get useful inferences that pick out Parameter values that at least approximately match the summary statistics of our simulation.
I think it’s possible to call ABC “Not fully Bayesian” but I don’t choose to do that. The end result is actually that every model over measurements is really a model of some summary of a much more complicated reality, so for example a model for the trajectory of the center of mass of a ball is “really” a model for a mapping from statistics of the initial conditions, to a detailed initial conditions of trillions of molecules, to a projection forward in time of the trillions of molecules, and then a collapse back to the summary statistic “center of mass”. The success of a formula for a ball falling from a given height is down to the fact that the answer for what happens to the center of mass is totally insensitive to the details of the individual molecules.
Your mileage may vary as to what you want to accept in your modeling. The key is to figure out where you have given information, and where you have an induced push-forward measure, and to make sure the dog (base measure) is wagging the tail (inducing the push-forward).