Estimate mean and SD of polygon area in Stan

I am trying to calculate the mean and standard deviation of the area of a polygon of which I have sample (x,y) coordinates of the polygon’s coordinates (but I don’t know the real (x,y) coordinates of the polygon’s corners).

This is my model, but I am not sure I am doing this write (it compiles and runs, but I think I am not getting values that make sense.

data {
  int<lower=0> N;            // Number of data points
  int<lower=0> C;
  matrix[C, N] x;               // x coordinates
  matrix[C, N] y;               // y coordinates
}

parameters {
  real<lower=0> mean_a;
  real<lower=0> sd_a;
}

model {
  real area = 0.5 * abs(sum(x[1:N-1,] .* y[2:N,] - y[1:N-1,] .* x[2:N,]));

  // Prior for mean_a and sd_a
  mean_a ~ normal(0, 50);
  sd_a ~ cauchy(0, 20);
  
  // Likelihood
  target += normal_lpdf(area | mean_a, sd_a);
}

and this is an example data set:

{
  "N": 3,
  "C": 4,
  "x": [
    [1, 2, 1],
    [4, 4.5, 4.7],
    [2, 1.5, 1.5],
    [3.8, 4.2, 4]
  ],
  "y": [
    [1, 2, 1],
    [4, 4.5, 4.7],
    [2, 1.5, 1.5],
    [3.8, 4.2, 4]
  ]
}

chatGPT helped with the model specification but it doesn’t look right…

Note that the area can be computed as a generated quantity conditional on estimates for the true positions of the coordinates and knowledge of which order to connect the coordinates in. So all you really need here is a model for the true locations of the coordinates, and then a rule for understanding what order to connect the coordinates in.

I don’t think that your question or Stan code provides a clear answer to three key questions:

  1. Given observations of the coordinates, what do you want to assume about the probable values of the true coordinates? For example, you might choose to assume that the observed values arise from a bivariate Gaussian (with known covariance matrix) centered on the true values, or something similar.
  2. Note that from the same set of coordinates, it is often possible to draw more than one polygon (i.e. if the polygon is not convex). Thus, you probably either need to assume convexity, or you need to assume that the order in which you connect the vertices is known a priori. Which would you like to assume?
  3. Is it possible that your polygon has a self-intersection or a hole in it?

Depending on your answers about convexity and holes, there might be a follow-up question or two to make sure your problem is well defined.

1 Like

Hi! Thanks for pointing out those questions! They are helpful.

For 1, I would say a bivariate Gaussian is a good assumption, in the absence of any more specific assumption.

For 2, the specific case I am thinking of has only one possible polygon/order of connecting the points (see below).

For 3, it follows from 2 that there are no self-intersections or holes.

Basically, the specific case I am thinking of is a set of F1 and F2 values of (for example) 6 peripheral vowels (F1 and F2 are the first and second formant frequencies). See figure below. Imagine I have a list of F1/F2 values of multiple tokens of each vowel because the real F1/F2 position of each vowel is not known. What I am trying to do is to obtain an estimate of the “vowel space” (i.e. the area of the polygon created by the vowels). Since I can only estimate the centroid of each vowel (from the F1/F2 values of each token) I would like for this uncertainty to be reflected in the estimate of the area (something like error propagation).

The standard way to model this would be to declare the unknown true positions as a parameter, put some prior on the unknown true positions, and then in the model block write that each observed position is a realization of a bivariate Gaussian centered on the true position. There is no information in the data to estimate the scale/covariance of this bivariate gaussian, so you’ll need to provide the assumed covariance matrix as data. Then (probably easier outside of Stan, though it can be done in Stan if you really want, and you’ll have to do it inside of Stan if you want to do something like putting a prior directly on the area), compute the area of the resulting polygon iteration-wise to get a posterior for the area.

However, while I know nothing about linguistics, I’d still caution that this notion of area might not behave intuitively for non-convex polygons, including the example you’ve shown. I don’t know the name of that i with a horizontal bar through it, nor how to type it, so I’m going to call it j.

For example, suppose a second language is identical to the one displayed in your figure, except that it lacks j. Do you really want to say that that second language occupies more vowel space than this one? One solution to eliminate this behavior would be to use a minimum convex hull.

Furthermore, you used some rule to decide to connect this polygon as a-e-i-j-u-o-a. You could have done something else, e.g. a-e-i-u-j-o-a, yieling a different area. You’ll need to make that rule explicit. Depending on what that rule is, and how uncertain the positions are, you should be aware that weirdness can happen. For example, if the rule is “always connect a-e-i-j-u-o”, what happens if the position of j is really uncertain, such that it might in fact be “southeast” of o? Either you’ll have a polygon with a self-intersection, or if you want to disallow self-intersections you’ll have a prior on the true locations that is truncated by the no-self-intersections constraint.

All of these problems vanish if you switch to a convex hull.

Thank you, that’s all very interesting! So, to make sure I understand, you are proposing to

  1. Estimate the position of each vowel with a bivariate Gaussian.
  2. Get the area of the minimum convex hull from each draw from 1. and that should give the posterior for the area.

Is this right?

To answer your questions on the specifics of the vowel. Now that you mention it, convex hulls are the standard in most acoustic work on vowels, so good idea to go that route.

As far as the order of the vowels, there is only one theoretically sensible order and for each set of vowels there is one and one one order (the space is defined only by the peripheral vowels, meaning the vowels that are at the periphery of the entire vowel space) so once you know which vowels are peripheral you know the order.

But I guess a convex hull would obviate the problem of having to specify the order.

Now I wonder. Would a multivariate model in brms work as well? (I am quite used to brms and just started musing about stan recently and thought that of course brms wouldn’t have been able to calculate the area as I wanted, but now that you suggest to “outsource” the area calculation then I can just do that in R).

Sorry for bothering you with all this and thank you for your help! :)

Yes, that’s right.

Whether and how to do this in brms depends on what you want the priors on the x and y values to be, and if the priors are hierarchical whether or not you want to pool information across x and y via the prior.

Epic, thanks! I am not sure how to go about fitting a bivariate Gaussian model in Stan. Is this an example in the right direction: Simon Jackman’s Bayesian Model Examples in Stan?

If you want to assume null covariance, then you can just fit the x’s and y’s as univariate gaussians and the result will be a bivariate gaussian. If you want to assume a nonzero covariance, then you might, for example, store your data as an array of vectors [x,y], and assume each of these arises from your desired multivariate normal 25.1 Multivariate normal distribution | Stan Functions Reference

1 Like