Non-invertible change of variables?


#6

Nice. Thanks!


#7

More fundamentally, anything expressable as

F(Data,Params) ~ distribution(constants)

is actually a likelihood function of a sort. Even if you can’t solve for the data values explicitly. The meaning of statements like this is “downweight the probability density in the vicinity of the vector of Params to the extent that it makes F(Data,Params) take on values away from the peak of distribution(constants)”

Whether you admit such likelihoods or not in your modeling is kind of a matter of taste. They can be difficult to debug as it can be hard to generate fake Data depending on how complicated your F function is (and technically I think you can express models here which are non-generative in the sense that they don’t imply a normalizable probability distribution over the Data). But often (mostly?) they’re totally legit and can express fairly complicated models in a simple manner.

They are, sort of, the statistical analog of the implicit function theorem https://en.wikipedia.org/wiki/Implicit_function_theorem


#8

Many thanks — I was pretty sure there was something fundamental I was missing.


#9

It’s the human version of Gelman’s folk theorem “things are never that difficult; you just haven’t thought about it enough.”


#10

Well, my position on this is not without controversy, and has taken a while for me to understand, but in the case where you can solve for the data it’s pretty obvious what’s going on.


#11

I wrote this up as a blog post explaining how the implicit function theorem relates to this issue for posterity.

http://models.street-artists.org/2017/09/22/the-implicit-function-theorem-and-likelihood-functions/

I see that some related posterity recently hit the discourse forum as well where someone linked to our old google groups discussion about rounding that initially got me thinking about this issue.

@Bob_Carpenter

Would it be hard for Stan to determine that the left hand side of a sampling statement included a transform of both parameters and data, and then alter the warning? Perhaps the warning could say something like “The left hand side of this sampling statement contains a transform of both parameters and data. This may define an implicit likelihood or may be part of a multi-parameter reparameterization which would require a Jacobian correction. It is the responsibility of the user to ensure that any required Jacobian corrections are included explicitly as Stan does not automatically calculate anything in this case.”

Or the like. I also would be happy to donate some version of my blog post to a section in the Stan manual if it seems useful (and the warning could refer to that section). I think we’ve seen several cases in the last few weeks where the warning about the Jacobian correction leads people in the wrong direction when in fact what they’re doing is an implicit likelihood. I do think the Jacobian warnings are very helpful, but they currently don’t detect the difference between say a pure invertible parameter transform requiring a Jacobian correction, and a relationship between data and a parameter that defines an implicit likelihood and so they sometimes lead people astray.


#12

This is a dangerous benchmark. it’s easy enough to get rid of Jacobian warnings where you really should be calculating one. Just using target += won’t catch any potential Jacobian issues.

In a very real sense these problems are all undecideable because Stan gives you a Turing-equivalent imperative language so you can’t decide statically which statements will get executed in a call to the log density. So it’s all going to come down to heuristics.

I completely agree. I’m not the one to talk to about this becuase I barely understand implicit likelihood functions. Ben Goodrich and Michael Betancourt are the ones to talk to.


Does the soft-constraint converge to rigid-constraint?
#13

That’s what’s going on with something like a simplex. We map K - 1 unconstrained parameters ot a K-simplex. The Jacobian is (K - 1) x (K - 1) and the last value in the simplex is just one minus the sum of the first K - 1. So what’s really going on is that the simplex is really only (K - 1)-dimensioanl.

The bad situation is where you have something like parameters a and b, then define c = a / b and try to identify both a and b by setting c ~ foo(theta). It won’t work because there’s an infinite ridge. You can identify this by putting a prior on a, b. This is what’s actually going on in our unit vector parameterization—it is really only K - 1, but we overparameterize with K parameters then throw an identifying distribution on them.


#14

Yes, but the current Jacobian warnings have the same issue right? There must be some heuristic involved where if the thing on the left hand side of the ~ involves a parameter, spit out a warning. I’m just suggesting that if the thing on the left hand side of the ~ involves both parameters and data spit out a slightly different more broadly applicable warning. I think the average user takes the warning as gospel, as if Stan were much more sure of itself, and so they think “Gee Stan has detected that I’m definitely doing something wrong, and it is telling me exactly what I need to do to correct it, but now I really HAVE to figure out a Jacobian and apply it… but… I can’t figure out what makes sense here, so I must be doing something wrong” whereas in fact sometimes having a transformed value on the left hand side is an implicit likelihood, and likelihoods are only probability densities over the data and since they’re evaluated at particular data points, with respect to the parameters they’re just functions and subject to no such Jacobian corrections (note, they would in some sense be subject to Jacobian corrections if you decide what the distribution should be on the data, but you have transformed data to work with instead of raw data and you don’t want to un-transform first).

Having the warning express less certainty about what is going on, to match the uncertainty involved in the heuristic used to decide whether to spit out a warning… leaves the user in the position of “Stan is telling me I should think carefully about this” not “Stan is telling me I need to do something very specific here but I don’t understand how or why”

This stuff about implicit likelihoods is surprisingly easy to crop up in models motivated by physics or other existing theoretical constructs that are expressed naturally in an implicit way. Whenever you have algebra that can be naturally expressed as F(data,parameters)+epsilon = 0 and the data isn’t easily “solved for” the notion naturally crops up to simply do

F(data,parameters) ~ distribution_for_epsilon()

And this will currently spit out a warning that isn’t applicable.


#15

This unfortunately is completely incorrect. Having “often (mostly?)” as a best case is very good indicator that the problem is being approached incorrectly.

Change of variables follows from probability theory which is airtight in what you can and cannot do.

  1. If you have a one-to-one transformation then the transformed density is given by the original density evaluated at the transform inverse augmented with the absolutely value of the Jacobian determinant.

  2. If you have a one-to-many transformation then you do the above but sum over all of the possible values of the transform inverse. This behaves well except in cases where there are infinite number of values, for example when you try to map a circle into the real line by transforming (0, 2pi) to (-\infty, \infty), in which case the transform becomes ill-posed.

  3. If you have a many-to-one transformation then you have to be very careful to ensure that you can self-consistently define a probability distribution on the transformed space. There is no “implicit distribution theorem” – the only way to do this is to “buffer” the output with additional variables so that you can define a one-to-one transformation, solve for the appropriate density on the buffered space, and then marginalize out the buffer variables. This is, for example, how one goes from a 2D Gaussian to a Rayleigh distribution or a 3D to a Maxwell, or the N simplex to the N-1 dimensional unconstrained real space we use in Stan (as @Bob_Carpenter notes below) . Typically this approach is limited to cases where the marginalization can be done analytically so it’s hard to apply to complex problems like this.

There are many heuristics that capture some of these properties, but as with many heuristics they do not capture all of them and hence inevitably lead one to invalid answers. And unfortunately those violations tend to be in exactly the difficult problems of applied interest!


Does the soft-constraint converge to rigid-constraint?
#16

I’d clarify here “one-to-one transformation of a parameter vector”, that is, taking some parameter vector Foo and expressing it as Foo = F(Bar) with F invertible. I agree with you totally here.

Again, applies to transforms of parameter vectors, yes.

The thing about all this is that for example in this original case, the factor that @James_Savage wanted to add to his model was NOT a probability distribution over parameters. It was a non-negative weighting function over the parameters.

I think you’ll agree that if p(Parm)dParm is a probability distribution and q(Parm) is a non-negative function such that Z=integrate(q(Parm)p(Parm)dParm) exists then q(Parm)p(Parm)/Z is also a probability distribution, deformed by the q function.

Then if p(Parm) is a prior density, what kinds of q(Parm) functions will you allow yourself to multiply your prior by in order to get a posterior? What q functions express the model that you want to use? There is no single right answer here it’s a modeling choice.

That’s the question being asked here. @James_Savage wanted to multiply his prior by a function expressed not as

X1 ~ distribution( X2,theta,beta,sigma)

which corresponds to p(X1 | X2,theta,beta,sigma) where the parameters are given

but instead he had

F(X1,theta) ~ normal(x2*beta,sigma)

which corresponds to p(F(X1,theta) | X2,theta,beta,sigma) again, a statement of probability about F values that you get when you plug in observed data X1 for given theta, beta, sigma, and so a modeling choice.

Here X1 and X2 are data. And because I was able to re-work his problem into a form where X1 was on the left hand side, it was easy to see, he was in fact defining something equivalent to a regular old likelihood.

The next question is whether you would accept a q(Params) type non-negative weighting function that was expressed like:

F(X1,theta) ~ normal(x2*beta,sigma)

Well, if you tell me that your modeling assumption is:

F(X1,theta) = 0 + epsilon

and epsilon ~ normal(X2*beta,sigma)

then those modeling assumptions imply that you should accept

F(X1,theta) ~ normal(x2*beta,sigma)

as a deforming function for your prior.

This is the source of the caveat “often (mostly?)”. Since you’ll only be plugging in the X1 values you actually observe, it’s possible to be working with a non-generative model for X1, or to be accidentally making a modeling assumption you don’t really want to make. As long as the resulting q(Params) leads to a normalizable probability distribution, you’ll get a probability distribution over the parameters… the question is just whether it’s one you believe in at the modeling level, not at the probability theory level.


#17

Note that if he had given us a probability distribution on X1 and asked to express an equivalent probability statement on F(X1,theta) then we’d need to do all the stuff you mention in your list of possibilities. But, instead of giving us a distribution on X1 he gave us a distribution on epsilon in an expression F(X1,theta) = 0 + epsilon, and we just used the given distribution on epsilon directly.


#18

Apparently I do a terrible job of communicating this concept, this isn’t the first time I’ve managed to make @betanalpha think I’m totally nuts. I chalk this up to talking past each other, so I take the blame here.

If we have a known p(Data | Params) over our data, and we for some reason want to express it in our model using a transformed version of our data p(F(Data)|Params) then we need to make p(F(Data)|Params) be a particular push-forward measure from the Data space onto the F space, and we need to do all that stuff Mike mentioned, otherwise we’re not using something equivalent to our known p(Data | Params). I hope at least here Mike and I are on the same page.

On the other hand, it seems often enough that people don’t know a distribution over their data, they assign a distribution over some function of their data, so Some_F_Distro(F(Data)|Params) is given by their modeling choices and then this implies that the likelihood over the data is the push-forward measure induced by Finverse (when this thing exists, so F needs to be not too weird). This is the same as saying foo ~ normal(0,1) getting samples of foo, and then later saying expfoo = exp(foo) you wind up with a distribution over expfoo which is lognormal, because that’s the push-forward measure of a normal distribution pushed through the exp function.

If you insist on expressing your model in a generative Data ~ SomeDistro(Params) manner, then you will need to do all the stuff Mike mentions to find yourself SomeDistro from Some_F_Distro. On the other hand if you’re happy with your choice of p(F(Data) | Params) then you can use this p(F(Data)|Params) in Stan as a kind of likelihood, since it implies some unknown push-forward measure onto Data space, provided the F function is sufficiently nice.

F(Data) ~ KnownDistro(Params)

which will throw a Jacobian warning because it assumes you KNOW the distribution you want on Data and that you need to correct the distribution on F to account for the Jacobian. in this case, it’s wrong, because we don’t know the distribution on Data we accept that Data just has the push-forward measure implied by choosing KnownDistro for F and pushing this measure through Finverse. (and, this is a modeling assumption we should check)

Then if you want to predict new data, you create say a FF parameter, assign it the same distribution, sample in the F space, and then numerically solve for Data using FF samples to get samples in Data space. Same as calculating expfoo = exp(foo) except instead of “exp(foo)” you need to numerically solve.

I tried to lay this out clearly here: http://models.street-artists.org/2017/09/25/followup-on-implicit-function-theorem-likelihoods/

Finally, if you have a “relationship” such as F(Data,Covariates,Parameters) = 0 + error, where the implicit predictor function that gets you Data = f(Covariates,Parameters) is unknown, non-separable, etc the same type of thing applies. If you know the distribution of error, you can say

F(Data,Covariates,Parameters) ~ distribution_for_error()

which gives you a weighting function over the Parameters for fixed Data,Covariates that deforms your prior into your posterior distribution. Since it’s evaluated at the data samples, and they don’t change during sampling, it’s just a pure function of the Parameters.

and if you want to predict new Data you can create a parameter perr, give it distribution_for_error() and sample in it, and numerically solve for Data as a function of perr thereby inducing a push-forward measure onto your Data space. (and again you should check that this push-forward measure onto data space makes sense, just like you should check any model)

The qualifications about “usually (mostly?)” etc all come from the fact that say a numerical solver for Data as a function of perr needs to give you a unique answer if you want this to be straightforward, and needs to give you a countable set of possible answers if you want it to be more complicated, and if it gives you a continuum of answers like @Bob_Carpenter mentioned, then you’re most likely doing something wrong.

When you choose to do this kind of thing, you have to check that it makes sense, in the same way that if you do a linear regression you have to check to make sure that a linear regression makes sense, and if you do a fourier series, you have to check the fourier series make sense. If you a do an implicit function you have to check to make sure when you sample in the intermediate space and solve for Data numerically the solutions you get make sense.

Finally, the ABC method is typically an example of this where you’re actually doing something in an intermediate space, using a non-invertible transform, but on purpose. For example, you take some weather predictor black box model, you project the massive global prediction output down to a small number of “statistics” and then you assign a “likelihood” over those statistics:

Statistics_of_Simulation(Full_simulation_output(Parameters)) ~ some_distribution(Parameters)

And you definitely can’t recover the Full_simulation_output from the small set of statistics in the usual case. It’s not invertible. Typically what you have is for a given set of parameters, you randomly choose some initial conditions based on those parameters, run your black box forward, then from the output compute some summary statistics, nevertheless even though the parameters don’t define a one-to-one mapping to outputs, we get useful inferences that pick out Parameter values that at least approximately match the summary statistics of our simulation.

I think it’s possible to call ABC “Not fully Bayesian” but I don’t choose to do that. The end result is actually that every model over measurements is really a model of some summary of a much more complicated reality, so for example a model for the trajectory of the center of mass of a ball is “really” a model for a mapping from statistics of the initial conditions, to a detailed initial conditions of trillions of molecules, to a projection forward in time of the trillions of molecules, and then a collapse back to the summary statistic “center of mass”. The success of a formula for a ball falling from a given height is down to the fact that the answer for what happens to the center of mass is totally insensitive to the details of the individual molecules.

Your mileage may vary as to what you want to accept in your modeling. The key is to figure out where you have given information, and where you have an induced push-forward measure, and to make sure the dog (base measure) is wagging the tail (inducing the push-forward).


#19

Right—it’s where the the false positives crop up you mention.

Specifically, Stan warns if the variable on the left is a local variable or transformed parameter or an explicit transform of a function by a function not known to be linear.

We’ve been debating what to do about this from the get go. We have a very multimodal user community, some of whom take every warning to be the end of the world and some of whom ignore every warning as long as a number comes out the other end.

I don’t understand what you’re proposing here and why data being involved would matter—data in the Stan sense just provide constants. No difference between

data {
  real x;
  ...

  ... x * y ...

and

data {
  ...

  ... 3 * y ...

if x is 3.


#20

The issue is that sometimes people write down stuff that is in essence defining a kind of likelihood, and in that case, the Jacobian correction message is wrong, there is no Jacobian that should be involved. But, this case could only happen when Data values are involved and so checking to see whether a data value is involved could help with the warning message.

Likelihoods come about when you assign a probability distribution over the data conditional on parameter values, and then evaluate it at the data points. When you do this, the resulting thing is just a number. When you allow the parameters to vary, it’s a function of the parameters which multiplies the prior, and after renormalization, provides the posterior. No Jacobian is involved, because the thing assigns probability measure in data space and the data values are constant. If you want to get a probability valued result you’d need to multiply by an infinitesimal interval of data space dData and since you’ve assigned this probability as P(Data|Parameters)dData the dData is constant and there’s no Jacobian correction. So, p(Data | Parameters) evaluated at particular Data values is just a function of the Parameters that multiplies the prior, downweighting regions of bad fit.

My “likelihoodish” examples come about when instead of knowing what probability distribution to assign to the data, we know what probability distribution to assign to some function of the data, a function that might be parameterized by parameters. When we insist on doing this, we are then forced to accept that there is an implied “likelihood” over the data space, which is the push-forward measure of the intermediate quantity (or to check this assumption and reject it and alter our model). But, again, the intermediate quantity is just evaluated at given values, by plugging in the data, and is conditional on the given values of the parameters, so they are also plugged in:

p(F(Data) | Parameters)dF = Some_distribution() dF (BY ASSUMPTION)

Is a statement that “the Data takes on the push-forward measure defined by pushing the interval of width dF backwards through the Inverse transformation of F, whatever that push-forward measure is”

Since it’s evaluated at given Data and Parameters values, it is again the case that F is really just a kind of data. To get things into probability units, you’d multiply p(F(Data)|Parameters) by a fixed dF. Of course, this fixed dF implies a variable sized dData, but in assigning a probability to “F space” for fixed dF you must implicitly accept whatever the measure is in data space, which includes the affect of the inverse F transform on the size of the dData interval.

Also, since Parameters are given in the conditioning, F can depend on those parameter values too F(Data,Parameters) and if this occurs, Stan will throw warnings because it misunderstands what you’re doing since you’ve got parameter relationships on the left hand side of a sampling statement.

This is similar to what @James_Savage was doing, where some modeling step has something like

F(Data,Parameters) = G(Covariate,ParticularParameter) + error

If you can separate this into

Data = G(Covariate,ParticularParameter)+H(Parameters)+error

then obviously you can get an explicit likelihood statement:

Data ~ SomeDIstribution(G,H,etc)

which eventually is what I showed James, and that solved his problem. He’d just failed to solve for the data and used some expression that he had written down on paper probably as-is.

But if you can’t separate Data onto one side of an equation, and you do:

F(Data,Parameters) ~ SomeDIstribution(G,Constants,etc)

This is an “implicit likelihood” because it is the statement that you believe Data takes on whatever the push-forward measure is when you push SomeDistribution backwards through the inverse transformation of F (provided it exists and is measureable etc)

As in any choice of likelihood you should check what your statement implies about Data. You can do this by creating an F parameter with the same distribution, sample that F, and then numerically solve for Data from the samples. Just as you don’t do a jacobian correction when you sample a normal random variable and then exponentiate it to get a lognormal one, you don’t apply any jacobian correction to the stated F distribution because the F space is where you’ve assigned a very specific distribution, and whatever you get when you invert F is what that assignment implies about your predictive distribution for Data. This is why I say you have to be sure the dog is wagging the tail. Whatever space you assign probability is given a specific assigned distribution and a “Jacobian correction” to this assignment means you’re changing your assigned distribution which is to say you’re changing your model.

Nevertheless Stan’s Heuristic doesn’t handle this so well and the message leads people down the garden path towards looking for a Jacobian correction that actually would alter the meaning of their model if they carried it out by rote.

On the other hand, if Stan detects that data appears as part of what’s on the left hand side of the sampling statement, then perhaps this kind of thing is what is going on, and the heuristic involved could alter the message so as not to lead people down a particular garden path… a message that says something closer to "Stan doesn’t know what you’re doing here but you should check to see whether you’re defining a prior, or defining a pseudo-likelihood. Plus, then perhaps link them to a section of the manual where a version of this whole discussion is explained. (and, I realize that @betanalpha as far as I know still thinks I’m crazy here, but perhaps by more explicitly trying to put the explanation of what I’m talking about into measure theory language we’ll finally wind up with a meeting of the minds).

EDIT: also I should say that of course this doesn’t always work perfectly, you don’t always get a valid push-forward measure from intermediate F space going back through an invertible measurable transform to Data space. In that case, you can still potentially get inference on parameters for fixed Data, but you have a non-generative model for Data you can’t sample F and get meaningful Data values from it. For example a weather model where you use F is the global average temperature and global average precipitation… that’s not enough info to back out the actual weather at every point on the globe. You can still get inferences on parameters though, that’s what the ABC method is really all about.


#21

I agree that we should improve the way the Jacobian (and other) warnings work to make it clearer they’re based on heuristics and the user needs to exercise judgement.

A Jacobian adjustment involving only data will be a constant, so it shouldn’t alter the result from sampling. This is related to the normal of a log-transformed variable and a lognormal—the lognormal builds in the Jacobian. Specifically,

lognormal(y | mu, sigma)
= normal(log(y) | mu, sigma) |d.log(y) / d.y|
= normal(log(y) | mu, sigma) (1 / y)

But since that Jacobian term (1 / y) only involves y, you can get away with coding this in Stan either way if y is data. If y is a parameter, you get different results with lognormal(y | mu, sigma) and normal(log(y) | mu, sigma).


#22

Is there a section of the manual that you could point to? Preferably with an example where you do need to add a Jacobian and one where you don’t. (Given that most people don’t understand how to do a change of variables, or when they need to do one…)


#23

Yes, there’s a change of variables chapter with contrasting examples where you do and don’t need Jacobians. But you’re right—it’s vexing when it gets to multiple dimensions.

The standard problem users run into is when they define two parameters, but want to put a prior on some function of the two—it’s improper without a prior on one of the parameters or some other function. Conveying that intrinsic dimensionality argument and why you get ridges is also in the problematic posteriors chapter of the manual.


#24

Yes

a ~ normal(1,1)
b ~ normal(1,1)
a/b ~ normal(3,.2)

with a,b confined to positive values is a perfectly well defined prior for example, but you can think of it as independent (truncated) normals, multiplied by a further weighting function that describes the dependency structure between the two variables. If you try to do some kind of Jacobian “correction” to the last line you get a different model. What model do you want?

Also, in the lognormal example

y ~ normal(0,1);
ylognormal = exp(y);

gives ylognormal (a transformed parameter) the push-forward measure caused by pushing the normal through the exp function, whereas with ylognormal a parameter

log(ylognormal) ~ normal(0,1)

doesn’t give ylognormal a lognormal density, but, it does give it a density. So whether you need a Jacobian correction or not is down to which density you want to have on ylognormal. The reason it’s “obvious” which one we want is because we immediately assume something about motivations of the programmer. In a much more complicated model where motivations are not so clear, the importance of considering the fact that you have to think about what you want carefully and get the tail wagged by the dog is higher.


#25

Yup, not the right situation for a Jacobian. You only want the Jacobians when you have a parameter x and define y = f(x) for some monotonic invertible f and want to define p(x) in terms of p(y).