Disk galaxy rotation modeling with Stan

techniques
fitting-issues

#1

I’m just doing this for fun, so advice is welcome but not at all urgent.

A paper titled “2D Bayesian automated tilted-ring fitting of disk galaxies in large HI galaxy surveys: 2DBAT” showed up on arxiv.org a few weeks ago that addressed a problem I’ve been thinking about off and on for a while. I liked the idea of what the authors are doing, but I didn’t like the implementation. So I decided to try my own in Stan. If you take a look at the paper equations 1-4 describe the model; I’d suggest not reading past section 3.1.

For a little bit of context disk galaxies are usually assumed to be round, with stars and gas in more or less circular orbits around the center. If a galaxy is inclined to the plane of the sky it appears elliptical – assuming it’s actually round you can deduce the inclination angle. The other relevant angle is the orientation on the sky, which is conventionally defined to be the angle measuring counterclockwise from north to the apparent major axis. With spatially resolved spectroscopy you can measure a 2D field of radial velocities – the data I have came from SDSS-IV MaNGA. The problem is to derive the rotation curve from the velocity field.

Here is my current Stan model, which slightly simplifies the model described in the paper:

functions {
  real sump(vector c, real r, int order) {
    real s = c[1];
    for (i in 2:order) {
      s = s + c[i]*r^(i-1);
    }
    return s;
  }
}

data {
  int<lower=1> order;  //order for polynomial representation of velocities
  int<lower=1> N;
  real x[N];
  real y[N];
  real v[N];  //measured los velocity
  real dv[N]; //error estimate on v
  int<lower=1> N_r; //number of points to fill in
  real r_post[N_r];
}

parameters {
  unit_vector[2] phi;
  unit_vector[2] incl;  //disk inclination
  real x_c;  //kinematic centers
  real y_c;
  real v_sys;     //system velocity offset (should be small)
  real v_los[N];  //latent "real" los velocity
  vector[order] c_rot;
  vector[order] c_exp;
  real<lower=0.> sigma_los;
}

transformed parameters {
  real<lower=0.> r[N];
  
  for (i in 1:N) {
    r[i] = sqrt((-(x[i] - x_c) * phi[1] + (y[i] - y_c) * phi[2])^2 +
                (((x[i] - x_c) * phi[2] + (y[i] - y_c) * phi[1])/incl[2])^2);
  }
}

model {
  x_c ~ normal(0, 2);
  y_c ~ normal(0, 2);
  v_sys ~ normal(0, 50);
  sigma_los ~ normal(0, 20);
  for (i in 1:order) {
    c_rot[i] ~ normal(0, 100);
    c_exp[i] ~ normal(0, 100);
  }
  
  for (i in 1:N) {
    v[i] ~ normal(v_los[i], dv[i]);
    v_los[i] ~ normal(v_sys + incl[1] * (
                      sump(c_rot, r[i], order) * (-(x[i] - x_c) * phi[1] + (y[i] - y_c) * phi[2])  + 
                      sump(c_exp, r[i], order) * (-(x[i] - x_c) * phi[2] - (y[i] - y_c) * phi[1]) / incl[2]),
                      sigma_los);
  }                                           
}

generated quantities {
  real v_rot[N];
  real v_exp[N];
  real v_model[N];
  real vrot_post[N_r];
  
  for (i in 1:N) {
    v_rot[i] = fabs(r[i]*sump(c_rot, r[i], order));
    v_exp[i] = r[i]*sump(c_exp, r[i], order);
    v_model[i] = v_sys + incl[1] * (
                      sump(c_rot, r[i], order) * (-(x[i] - x_c) * phi[1] + (y[i] - y_c) * phi[2])  + 
                      sump(c_exp, r[i], order) * (-(x[i] - x_c) * phi[2] - (y[i] - y_c) * phi[1]) / incl[2]);
  }
  for (i in 1:N_r) {
    vrot_post[i] = fabs(r_post[i] * sump(c_rot, r_post[i], order));
  }
}

I quickly discovered that making the two relevant angles parameters won’t work, so those are both declared as 2D unit vectors instead. The velocities are measured with error and I have estimates for the uncertainties, so I follow the measurement error discussion in the Stan language manual and model the observations as coming from normal distributions with unknown means and known standard deviations. Those latent means are then modeled as coming from a gaussian with mean described by equations 1-3 in the arxiv paper and for simplicity constant variance, which I model.

I decided to declare the deprojected radius in the transformed parameters block. I also tried declaring it as local variables in the model block but it seemed to make no difference to performance. The other parameters in the model are the cartesian coordinates of the kinematic center, which should be small adn the overall system radial velocity which should also be small (velocities are measured relative to the overall system redshift). The authors of the arxiv paper used splines for the radial and peculiar velocity components. For a first pass I just use a low order polynomial, so the coefficients are the parameters.

I’ve tried this on 3 data sets so far and it works, but it’s very fragile. I find I have to fiddle the target acceptance rate and maximum treedepth – setting adapt_delta in the range 0.95 - 0.975 and max_treedepth=15 usually works. It often takes several runs with different random initializations to get reasonable looking results, but it isn’t too difficult to tell when the results look reasonable.

Here are some results from one successful run. First the measured velocity field and the (posterior mean) modeled velocities:

The thing you’re most interested in is the rotation curve. Here is the joint density of the deprojected radius and rotation velocity:

Both the magnitude and shape of the curve are very reasonable, but I haven’t seen independent estimates in the literature for this particular galaxy.

Here’s one traceplot for the first element of the parameter phi:


These are inherently multimodal and in this run fortunately both modes were discovered well within the warmup period.

Questions:

  1. Is there an obvious reparametrization of this model that would make it less fragile?

  2. The general “tilted ring” model allows the inclination and orientation angles to vary with radius, and the arxiv paper uses splines for that purpose. I don’t see an obvious way to do something equivalent with unit vectors.


#2

is this “inherent multimodality” based on a symmetry, like you can have the unit vector pointing in direction A or direction -A and it means the same thing? Perhaps you should put a strong prior on say phi[1] to ensure you get only the positive version:

phi[1] ~ normal(0.5,.3);

or something like that.


#3

Yes, exactly. If you look at the first image I posted, by the astronomers’ convention the angle from north to the point of maximum recession velocity is, by eye, about 30° so phi[i] should be around 0.5, and sure enough that’s where one out of 4 chains settles down to. But the algorithm doesn’t care about the convention and -0.5 works too provided the radial velocity flips sign as well (or in this model the coefficients of the polynomial representation). In the successful runs it appears that the parameters representing angles settle down to one or both modes before adaptation ends and the velocity coefficients likewise find the right mode(s).

Anyway I may have been too quick to dismiss using angles as parameters. My mistake was to declare them as bounded variables. What works instead is to leave them unbounded and, as you suggest, use a tight enough prior to keep them from hopping modes.

Thanks.


#4

In general it’s better to eliminate the symmetry either through hard restrictions, or “soft” restrictions of a prior, because you will ultimately get more sensible inference, larger effective sample size, meaningful mean values, etc. If you’re combining inference from 4 chains, and 3 of them found one mode, and 1 found another… unless you post-process to remove the symmetry, you will have a hard time doing anything useful with the combined chains for example (means and medians for example will be meaningless)

If you declare the bounds for your angle as something like 0,pi I would expect that to work, but in addition providing some prior on top of the bounds could still be useful. If the unbounded variable + prior works well that could be fine also.


#5

Just throwing this out here, but I don’t think hard restrictions or soft restrictions solve this problem. Ofc. the only practical solution right now I know of is hard constraints or soft constraints haha.

I don’t think the hard restrictions solve the problem because symmetry in these problems seems like it’s a function of the solution.

Like in this case, if A is a solution, A rotated 180 degrees is a solution. Which seems like a flip across an axis (of which you could conceivably use to cut parameter space in two), but it’s a different axis for every different angle and isn’t something you can set before hand.

If you say the angle is between 0 and 180, then what happens if it’s a normal looking lump around 180? It’s like the transformed space just made something that was unimodal multimodal (think this has come up somewhere on the forums here).

Also it’s hard to use a soft prior for the same reasons it’d be hard to set up a hard one. You really don’t know where stuff is.

So it seems like it’d be nice to reduce all the multimodal answers to one answer, but I just don’t think that’s possible cause it’s not easy to mark out any sorta boundary that’ll work cleanly.

The thing that isn’t so bad in these systems sometimes is computing the distances between solutions under symmetry. So in this case, you could take two angles A and B, and then compare A with B and then A with B + 180 degrees and take the minimum distance.

It seems like there could be some sorta diagnostic set up on these distances, but I dunno.


#6

True, though sometimes the boundaries are really meaningful, like if you’re talking about observations through a telescope, you can’t look down through the center of the earth and see a star, so you know that the angle is always somewhere between 0 meaning “towards the north pole” and pi meaning “towards the south pole”. Even though in a pure math sense you could describe the location in terms of a vector pointed towards the ground, and a negative distance.

On the other hand, other times you can’t use those kinds of simplifications and you’re right, the topology of a circle is not the same as the topology of a line. If the symmetry is purely formal, then restricting yourself to one of the modes is purely a formal choice, and so you can for example fit your model, discover the location of the stuff, and then restrict your parameters to identify a single mode and refit. If you use the poor heuristic of “don’t use the data twice” then this might seem non-kosher, but if you understand that what you’re doing is applying real prior knowledge that all of the modes are equivalent (or any other kind of prior knowledge you have, like only one of the modes makes sense in some other reason) then this technique simply allows you to engage that prior knowledge after you’ve discovered the necessary information required to engage it.


#7

Thanks for the discussion. I’m marking this as solved for now because @dlakelan gave me the hint I needed.

There are two angles in the model. One of them measures the orientation of the disk relative to a more or less arbitrary north. In the second of the likelihood statements (the one with v_los[i] ~ on the left) there’s a rotation matrix that’s transforming from one cartesian coordinate system to another. In the transformed coordinate system the long axis of the disk aligns with the y axis and +y is the receding side. That’s unproblematic in practice, I think. The reason is because you can plot the velocities and eyeball that angle with reasonable precision, and use that to define a fairly tight prior. That might be cheating a little bit but I’m comfortable with that.

The other angle is an inclination, and that’s the troublesome one. First, I think it’s only determined modulo pi and as far as I know you can’t force that if your parameter is a unit vector. You could declare it to be a real with limits [0, pi()] or [-pi()/2, pi()/2] but that doesn’t work either because the model breaks down at inclinations near 0 or pi/2. If the inclination is 0 you’re looking at the disk exactly face on. All you can measure are radial velocities, so any net velocities you measure are bulk motions perpendicular to the disk. That might be interesting but it’s not rotation which isn’t accessible at all. If the disk is exactly edge on you’re measuring rotation velocities directly and the model simplifies considerably, but the model formulation here is wrong.

The final consideration here is the third term in the likelihood has a factor 1/cos(i), the purpose of which is to stretch the ellipse back to a circle. cos(i) < 0 just has the effect of flipping the transformed x axis, which we don’t need to do. So my solution is to replace the parameter representing inclination with a new one declared as real<lower=0., upper=1> si, representing the sine of the inclination. I use the elementary trig identity cos(i) = sqrt(1. - si^2) in the likelihood. This eliminates the sign ambiguity on this angle, which also eliminates the sign ambiguities on the coefficients of the two velocity components.

Here’s the parameter and model blocks of the revised model. I pass the best guess of the orientation angle and its uncertainty as data and that seems sufficient to keep the rotation well behaved. The logit transform of si should be well behaved because inclinations where this model works well would typically be in the range 30-60 degrees or so.

parameters {
  real phi;
  real<lower=0., upper=1.> si;  //sine disk inclination
  real x_c;  //kinematic center
  real y_c;
  real v_sys;     //system velocity offset (should be small)
  real v_los[N];  //latent "real" los velocity
  vector[order] c_rot;
  vector[order] c_exp;
  real<lower=0.> sigma_los;
}

transformed parameters {
  real<lower=0.> r[N];
  
  for (i in 1:N) {
    r[i] = sqrt((-(x[i] - x_c) * sin(phi) + (y[i] - y_c) * cos(phi))^2 +
                (((x[i] - x_c) * cos(phi) + (y[i] - y_c) * sin(phi))/sqrt(1-si^2))^2);
  }
}

model {
  phi ~ normal(phi0, sd_phi0);
  x_c ~ normal(0, 2);
  y_c ~ normal(0, 2);
  v_sys ~ normal(0, 50);
  sigma_los ~ normal(0, 20);
  for (i in 1:order) {
    c_rot[i] ~ normal(0, 100);
    c_exp[i] ~ normal(0, 100);
  }
  
  for (i in 1:N) {
    v[i] ~ normal(v_los[i], dv[i]);
    v_los[i] ~ normal(v_sys + si * (
                      sump(c_rot, r[i], order) * (-(x[i] - x_c) * sin(phi) + (y[i] - y_c) * cos(phi))  + 
                      sump(c_exp, r[i], order) * (-(x[i] - x_c) * cos(phi) - (y[i] - y_c) * sin(phi)) / sqrt(1-si^2)),
                      sigma_los);
  }                                           
}

I’ve only tried this on one data set, but sampling went considerably faster than previously, effective sample sizes with 4 chains and 1000 post warmup iterations were in the range 2500-4000 for just about all parameters and generated quantities, and no warnings were thrown. I may be able to adjust control parameters back towards default values too – the maximum treedepth actually used post-warmup was 9, although adaptation is still pretty slow.


#8

If you’re interested in making this model faster, there’s a lot of places you can vectorize operations and avoid recomputing quantities.


#9

Vectorizing reduced execution time by at least 50%, so yes thanks it was well worthwhile. It also seemed to reduce the variance in execution time across chains, although maybe that was a coincidence.

Shorter version of what I wrote the other day: I found a mix of soft constraints in the form of fairly tight priors and hard constraints in the form of bounds on one angle solved my convergence issues. That probably doesn’t generalize to other models involving angles.


#10

can you post the vectorized code for comparison?


#11

Sure.

functions {
  vector sump(vector c, vector r, int order) {
    int N = num_elements(r);
    vector[N] s;
    for (n in 1:N) {
      s[n] = c[1];
      for (i in 2:order) {
        s[n] = s[n] + c[i]*r[n]^(i-1);
      }
    }
    return s;
  }
}

data {
  int<lower=1> order;  //order for polynomial representation of velocities
  int<lower=1> N;
  vector[N] x;
  vector[N] y;
  vector[N] v;  //measured los velocity
  vector[N] dv; //error estimate on v
  real phi0;
  real<lower=0.> sd_phi0;
  real si0;
  real<lower=0.> sd_si0;
  int<lower=1> N_r; //number of points to fill in
  vector[N_r] r_post;
}

parameters {
  real phi;
  real<lower=0., upper=1.> si;  //sine disk inclination
  real x_c;  //kinematic centers
  real y_c;
  real v_sys;     //system velocity offset (should be small)
  vector[N] v_los;  //latent "real" los velocity
  vector[order] c_rot;
  vector[order] c_exp;
  real<lower=0.> sigma_los;
}

transformed parameters {
  vector<lower=0.>[N] r;
  
  for (i in 1:N) {
    r[i] = sqrt((-(x[i] - x_c) * sin(phi) + (y[i] - y_c) * cos(phi))^2 +
                (((x[i] - x_c) * cos(phi) + (y[i] - y_c) * sin(phi))/sqrt(1-si^2))^2);
  }
}

model {
  phi ~ normal(phi0, sd_phi0);
  si ~ normal(si0, sd_si0);
  x_c ~ normal(0, 2);
  y_c ~ normal(0, 2);
  v_sys ~ normal(0, 50);
  sigma_los ~ normal(0, 20);
  c_rot ~ normal(0, 100);
  c_exp ~ normal(0, 100);
  
  v ~ normal(v_los, dv);
  v_los ~ normal(v_sys + si * (
                  sump(c_rot, r, order) .* (-(x - x_c) * sin(phi) + (y - y_c) * cos(phi))  + 
                  sump(c_exp, r, order) .* (-(x - x_c) * cos(phi) - (y - y_c) * sin(phi)) / sqrt(1-si^2)),
                  sigma_los);
}

generated quantities {
  vector[N] v_rot;
  vector[N] v_exp;
  vector[N] v_model;
  vector[N] v_res;
  vector[N_r] vrot_post;
  
  v_rot = fabs(r .* sump(c_rot, r, order));
  v_exp = r .* sump(c_exp, r, order);
  v_model = v_sys + si * (
                      sump(c_rot, r, order) .* (-(x - x_c) * sin(phi) + (y - y_c) * cos(phi))  + 
                      sump(c_exp, r, order) .* (-(x - x_c) * cos(phi) - (y - y_c) * sin(phi)) / sqrt(1-si^2));
  v_res = v - v_model;
  vrot_post = fabs(r_post .* sump(c_rot, r_post, order));
}

I also tried an expanded transformed parameters block that make the code more transparent and allowed less duplication, but for some reason this makes performance worse:

transformed parameters {
  vector[N] xp;
  vector[N] yp;
  vector<lower=0.>[N] r;
  
  xp = -(x - x_c) * sin(phi) + (y - y_c) * cos(phi);
  yp = -(x - x_c) * cos(phi) - (y - y_c) * sin(phi);
  
  r = sqrt(xp .* xp + yp .* yp / (1.-si^2));
}


#12

What you want to do is save -(x - x_c) and reuse it for both xp and yp. Probably not a big deal.