Prior on distances between latent locations in an MDS model

I am interested in building a multidimensional scaling-type model where the parameters locations in a 2D space and I’d like to be able to put a prior on the distances between the locations – something like the following:

data {
  int Nz;
  ...
}
parameters {
  matrix[Nz, 2] z;
}
transformed parameters {
  vector[(Nz * (Nz - 1) / 2)] dist_vect;
  pos = 1;
  for(j in 2:Nz) {
    for(i in 1:j) {
      dist_vec[pos] = distance(z[i,] - z[j,]);
      pos += 1;
    }
  }
}
model {
  dist_vec ~ std_normal();
  ...
}

Presumably this would require a Jacobian adjustment but the number of z parameters is not equal to the length of dist_vec. Therefore it isn’t possible to give a square Jacobian like in the example provided here: 24.3 Changes of variables | Stan User’s Guide . I’d guess this is straightforward for someone with better calculus/modeling knowledge than me. Is there any guidance in the docs for what to do in a situation like this? Is a model like this feasible?
Thanks so much for the help!

Edit:
It seems like this question was answered by @betanalpha’s response here. I’m a bit confused as to how adding the extra function(s) of the parameters would work. If there is an example of this anywhere, it would be much appreciated!

1 Like

Sorry I won’t have the time to go into the nitty-gritty of your model, but since this has been here for a while I thought I’d chime in and maybe someone else can pick up where I left or correct me if I don’t get it right for your specific case.

I am assuming you tried just running what you implemented and you are getting the warning on the Jacobian correction for nonlinear transformations. I guess the summary about this warning is that when you transform variable and specify a distribution for it you have a different posterior from the untransformed one. But that may be ok, as long the transformed-variable posterior makes sense.

So an alternative parameterization of your model could be specifying the distances as parameters and placing the priors on them (you’d have to deal with the number of parameters when going from positions to distances, but assuming you could have a matrix of distances, it would be equivalent). If after all of that the model you end up with is reasonable, you can just sample from that, or you can go back to your original formulation, but knowing that what you are actually sampling from is the reparameterized model.

1 Like

This isn’t really an answer to the question, but rather a general set of cautions. Even if you can work out a correct Jacobian adjustment, it might not achieve what you intend here.

It’s important to keep in mind that distances in Euclidian space are constrained in weird ways. We can already see this with three points in a 2-D plane. Given the distance ab and the distance bc, the distance ac is constrained to be less than or equal to the sum ab + bc and greater than or equal to the absolute value of the difference. As the number of points gets much larger than the dimensionality of the space, this constraint becomes ever more convoluted, because it’s hard to come up with a set of \frac{N(N-1)}{2} distances that are actually a valid set of distance between N points in the plane. This has a few implications:

  • It means that we cannot readily

unless we can come up with some constraining transform that ensures that we end up with a set of valid distances. That seems like it’d be very very hard to do.

  • It means that if we write distances ~ std_normal(); we will not end up with a standard normal joint prior on the distances even if we apply the correct Jacobian adjustment. I doubt we even would get standard normal margins, though perhaps we would get very close to standard normal margins (??). We certainly would not get a standard normal joint distribution, because most of this distribution will sit over disallowed distance combinations.
  • Therefore, it means that we will not be able to write down a generative prior for the distances.

In fact, there’s yet another reason why we cannot write down a generative prior for the distances. For any set of points, we can apply any translation we wish and retain the same set of distances. Thus, in addition to a prior statement on the distances, to get a generative prior we’ll also need some kind of prior regularization of the location of the point cloud in the plane (e.g. priors on the positions of the points themselves). But this prior will interact with whatever prior we write down on the distances. The interaction might be subtle or even negligible (I think this becomes likely if the scale of the prior on z is large compared to the prior on the distance, but in this case the prior model becomes very weakly identified since we can translate the points around with considerable flexibility). In other settings, the interaction might be very strong. You can see this intuitively if we imagine that we have z ~ normal(0, .1); and distance ~ std_normal();. No way will we end up with standard normal distances!

The reason I’m raising these points is because one of the main motivations for applying Jacobian adjustments is to ensure that the prior pushforward on the transformed variable (distance in this case) actually corresponds to the distribution that is syntactically implied by the prior statement (e.g. target += normal_lpdf(distance | 0, 1); is supposed to yield a joint normal distribution for the prior pushforward for distance). Since this is guaranteed not to happen for you, it’s worth thinking very carefully about what you want this prior to achieve, and whether that would be properly achieved by a standard normal increment to the density accompanied by the Jacobian adjustment.

Since the prior-plus-Jacobian will not yield a generative prior no matter what, you can think of it more like a penalty term that penalizes model configurations with large distances. Seen this way, there’s nothing magical about working out the correct Jacobian; what you really want is to increment the target with something in place of the prior-plus-Jacobian that induces a reasonable penalty. You’ll be helped here by the symmetry of the transform; I’m pretty sure you can expect that penalizing distances won’t distort the underlying points anisotropically, for example. So you could tinker with some penalties, find one that seems to yield reasonable prior pushforwards, and just roll with it.

A final note: even if you put a penalty on the absolute magnitude of z in order to get a proper prior (i.e. one that doesn’t allow you to translate the point cloud out to infinity), you will still have a prior model that doesn’t identify the positions of the points up to rotations and reflections. For the purposes of efficient sampling, it might be wise to fix the positions of two points in z, which will preclude such non-identifiabilities. However, in so doing, you are implicitly setting a distance scale for the model that might conflict with the scale that you desire via the prior.

2 Likes

That’s a great point, and really shows what’s the complicated part, that I didn’t go into at all. However, more than being constrained, they will actually be completely determined as long as are able to work out the full relationship between the transformation, i.e. you need to be able to invert the transformation. I’d guess you could do that, at least up to the translation of all points/change in coordinate system, mentioned by @jsocolar.

Again, without working out the details I’d say you don’t need that, as long as you fix the coordinates of a single point, which together with all distances would remove all degrees of freedom for translation. If all that matters is the distances the specific choice of a references shouldn’t matter.

I completely agree with this, but it may also mean we are not being very formal about our posterior. Then again, practitioners (especially in Machine Learning) will use all kinds of penalties, from the traditional Ridge/LASSO, to very empirical ones like early stopping and dropout, so if you accept that MDS is more of results-oriented approach rather than formally-motivated one, you might as well go for a more empirical approach. The only caveat (that I can remember now, at least) when doing something like this is that you have to make sure that this doesn’t generate weird or impossible assumptions (e.g. placing a prior with negative-only support, but making a transform that will require positive values).

Thank you @caesoma and @jsocolar for the detailed responses! I probably should have provided more details here – I didn’t go into the nature of the prior on z that we were using in order to focus on the main question but as you both mention, there are a number of non-identifiabilities in MDS that need to be considered. This question was, in fact, largely motivated by an attempt to implement the solution proposed by Bakker and Poole, which fixes one point at the origin and fixes one dimension of another, as well as it’s direction on the other dimension (they also generalize their solution beyond 2-D but we’re focused on 2D here). However, implementing their solution in Stan, along with a weakly informative independent normal prior on the unconstrained coordinate locations, still often seems to result in multimodal posteriors (and therefore high r-hats), despite the theory.

I thought perhaps one way towards resolving some non-identifiabilities was regularizing the distances, hence the putting the prior directly on the distances. So I’m not currently particularly concerned about the particular form of the prior (and I definitely should have mentioned this instead of putting std_normal in) but rather use it just as a penalty as @caesoma suggested (although, assuming it works, it might be interesting to examine the resulting prior pushforward and see if it makes sense). As both of you mention, the distances themselves are highly constrained so it unfortunately necessitates writing the model in this way.

I imagined that possibly the Jacobian was necessary for the HMC sampler to work properly but, assuming I’m understanding both of you correctly, it sounds like we can still get a reasonable answer without it, if we don’t have an analytic prior we are particularly tied to (aka “not being very formal about our posterior”).

Yep, that’s right. If the standard normal density (without a Jacobian term) is a useful penalty term to apply to the distances… well then it’s a useful penalty.

If, on the other hand, your intuition is that the useful penalty is the thing that ensures that the distribution of the distances is proportional to a standard normal density everywhere within the range of the transform, then to achieve that penalty you would need to adjust for the curvature with a Jacobian.

So I definitely take both of your point that in this situation, the Jacobian might not really be worth it since the combined Gaussian prior + constraints is not a distribution we have a great intuition for. I don’t want to waste anyone’s time but, for my own education, I’d definitely be curious how it might look to construct the Jacobian in this case, because it’s not obvious to me how to make it a square matrix. If anyone has any examples of the approach suggested by @betanalpha (adding additional functions and then marginalizing), I’d definitely appreciate it.