Constraints on a transformation of multiple parameters

Thanks for the help! That constraint makes a lot of sense, and it works.

I’m trying to reproduce the results from a paper and uniform(1, 200) is what they use. I’ll try out your suggestion though and see what difference that makes.

@nhuurre I’m trying to add another layer to the model, and y has been moved to the transformed_parameters block. I think this means that p_phi0 has to be moved there as well, because it doesn’t know about y when it’s in the parameters block. Now my problem is that the transformed_parameters block doesn’t enforce the constraints. How would I go about solving this? Thanks!

So, if I’m reading this right you transform the observed proper motion to galactic coordinates and y[1] is the speed in the galactic plane, y[2] is the the total speed and y[3] is some distance (from the galactic center?)
You didn’t explain what change requires y to be a transformed parameter but my first guess is that you’re modeling the uncertainty in proper motion by making pmra_true a parameter. If that’s the case then I think it would be easier to parametrize the velocity in galactic coordinates and calculate the proper motion as a transformed parameter. That way the constraints are more natural.
Secondly, what is the physical mechanism that enforces the constraint E > 0?
Is it an inviolable law with no exceptions, or a probabilistic one where exceptions are possible but very unlikely? If it’s the latter then consider omitting the constraint and modeling the mechanism directly.

The observed proper motions are transformed into spherical velocities in galactic coordinates (output of the transform_vels function) and then retransformed into y because the distribution function takes in y. y[1] is the total speed, y[2] is the tangential speed relative to the galactic center, and y[3] is the distance from the galactic center.

Yes, you’re totally right—I’m trying to incorporate uncertainties. I’ve included some code below, that probably explains things better than I can. The main change is that I have some pmra_err data coming in, and instead of transforming the pmra_true parameters to get the velocities, I use some new pmra parameter which is normally distributed around pmra_true. Then these velocities are put into y in the same way as before.

I don’t have a great understanding of this either, but E is the relative energy, and for E<=0 (sorry this should be <= and not <) the distribution function should be 0. From what I can tell, there is an assumption that p_phi0 is chosen such that the DF is 0 when E<=0, so the range of support of the DF is only for E>0. This is what I was trying to do with the negative_infinity() thing, which I guess is supposed to be 0. I just tried it again with 0, and this also does not work (“Stan can’t start sampling from this initial value”).

functions {
    
    real df_lpdf(real[] y, real phi0, real g, real b, real a) {

        // y[1] = v, y[2] = v_t, y[3] = r;
        real E = -square(y[1]) / 2. + phi0 / pow(y[3], g);
        real L = y[3] * y[2];

        // terms in numerator
        real num_t1 = -2 * b * log(L);
        real num_t2 = (b * (g - 2.) / g + a / g - 3. / 2.) * log(E);
        real num_t3 = lgamma(a / g - 2. * b / g + 1.);
        
        // terms in denominator
        real denom_t1 = 0.5 * log(8. * pow(pi(), 3.) * pow(2., -2. * b));
        real denom_t2 = (-2. * b / g + a / g) * log(phi0);
        real denom_t3 = lgamma(b * (g - 2.) / g + a / g - 1. / 2.);
        
        real numerator = num_t1 + num_t2 + num_t3;
        real denominator = denom_t1 + denom_t2 + denom_t3;

        // this ~should~ never happen
        if (E < 0.) {
            return 0.;
        }         

        return numerator - denominator;
    }

    real phi_lower(real[,] y, real g) {
      real p[size(y)];
      for (i in 1:size(y)) {
        p[i] = 0.5 * square(y[i,1]) * pow(y[i,3], g);
      }
      return max(p);
    }

    real shifted_gamma_lpdf(real y, real a, real b, real shift) {
        real shifted_y = y - shift;
        return a * log(b) - lgamma(a) + (a - 1) * log(shifted_y) - b * shifted_y;
    }

    vector get_angles(real x, real y, real z) {
        real projR = sqrt(square(x) + square(y));
        real R = sqrt(square(x) + square(y) + square(z));

        real sintheta = y / projR;
        real costheta = x / projR;

        real sinphi = projR / R;
        real cosphi = z / R;

        return to_vector([sintheta, costheta, sinphi, cosphi]);
    }

    vector transform_vels(real ra, real dec, real pmra, real pmdec, real vlos, real plx, real sintheta, real costheta, real sinphi, real cosphi) {
        matrix[3,3] T = [
            [-0.06699, -0.87276, -0.48354],
            [ 0.49273, -0.45035,  0.74458],
            [-0.86760, -0.18837,  0.46020]
        ];
        real deg2rad = pi() / 180.;
        real sinra = sin(ra * deg2rad);
        real cosra = cos(ra * deg2rad);
        real sindec = sin(dec * deg2rad);
        real cosdec = cos(dec * deg2rad);
        matrix[3,3] A = [
            [cosra * cosdec, -sinra, -cosra * sindec],
            [sinra * cosdec,  cosra, -sinra * sindec],
            [        sindec,      0,          cosdec]
        ];
        matrix[3, 3] B = T * A;
        real k = 4.74057;
        vector[3] solarmotion = to_vector([11.1, 232.24, 7.25]); // including rotation

        vector[3] dat = to_vector([vlos, k * pmra / plx, k * pmdec / plx]);
        vector[3] uvw = B * dat + solarmotion;

        matrix[3, 3] ptz_mat = [
            [ costheta, sintheta, 0],
            [-sintheta, costheta, 0],
            [        0,        0, 1]
        ];
        vector[3] ptz = ptz_mat * uvw;

        matrix[3, 3] rtp_mat = [ 
            [sinphi, 0,  cosphi], 
            [     0, 1,       0],
            [cosphi, 0, -sinphi]
        ];
        vector[3] rtp = rtp_mat * ptz;

        return rtp / 100;
    }

}

data {

    int<lower=1> N; // total length of data
    
    vector[N] ra;
    vector[N] dec;
    vector[N] plx;

    vector[N] Xgc;
    vector[N] Ygc;
    vector[N] Zgc;

    // measurements
    vector[N] pmra_true;
    vector[N] pmdec_true;
    vector[N] vlos_true;
    vector[N] r_true;
    
    // measurement uncertainties
    vector[N] pmra_err; 
    vector[N] pmdec_err;
    vector[N] vlos_err;
    vector[N] r_err;

}

transformed data {
    vector[N] sintheta;
    vector[N] costheta;
    vector[N] sinphi;
    vector[N] cosphi;

    for (i in 1:N) {
        vector[4] angles = get_angles(Xgc[i], Ygc[i], Zgc[i]);
        sintheta[i] = angles[1];
        costheta[i] = angles[2];
        sinphi[i] = angles[3];
        cosphi[i] = angles[4];
    }
}

parameters {
    real<lower=0> p_gamma; 
    real<upper=1> p_beta
    real<lower=(p_beta * (2. - p_gamma) + p_gamma / 2.)> p_alpha; 

    vector<lower=min(pmra_true) - 5 * max(pmra_err), upper=max(pmra_true) + 5 * max(pmra_err)>[N] pmra;
    vector<lower=min(pmdec_true) - 5 * max(pmdec_err), upper=max(pmdec_true) + 5 * max(pmdec_err)>[N] pmdec;
    vector<lower=min(vlos_true) - 5 * max(vlos_err), upper=max(vlos_true) + 5 * max(vlos_err)>[N] vlos;
    vector<lower=min(r_true) - 5 * max(r_err), upper=max(r_true) + 5 * max(r_err)>[N] r;
} 

transformed parameters {

    real y[N, 3]; // wrap up input data as (v, v_t, r)

    // the constraint doesn't work here. 
    real<lower=phi_lower(y, p_gamma)> p_phi0; // scale factor for gravitational potential 

    {
        for (i in 1:N) {
            vector[3] vels_sph = transform_vels(ra[i], dec[i], pmra[i], pmdec[i], vlos[i], plx[i], sintheta[i], costheta[i], sinphi[i], cosphi[i]); 
            y[i, 2] = sqrt(square(vels_sph[2]) + square(vels_sph[3]));
            y[i, 1] = sqrt(square(vels_sph[1]) + square(y[i, 2]));
            y[i, 3] = r[i];

        }
    }

   
}


model {

    for (i in 1:N) {
        pmra[i] ~ normal(pmra_true[i], pmra_err[i]); 
        pmdec[i] ~ normal(pmdec_true[i], pmdec_err[i]); 
        vlos[i] ~ normal(vlos_true[i], vlos_err[i]); 
        r[i] ~ normal(r_true[i], r_err[i]); 
    }

    p_gamma ~ normal(0.5, 0.06);
    p_phi0 ~ normal(0., 150.)T[0, 200];
    p_beta ~ uniform(-0.5, 1.);
    p_alpha ~ shifted_gamma(2.99321604, 2.82409927, 3.);

    for (i in 1:N) {
        y[i] ~ df(p_phi0, p_gamma, p_beta, p_alpha);
    }
    
}

generated quantities {
}

Ah, makes sense. I mixed up y[1] and y[2] and didn’t follow details of the transform.

The correct representation of zero probability is negative_infinity() as lpdf is on the log-scale.

That’s what I thought.
But note that the “data generating process” goes in the other direction. The observed star has some physical velocity and when viewed from Earth the projection of that vector appears as proper motion. The proper motion is then measured with some error so that the recorded value is random number drawn from distribution normal(pmra, pmra_err).

Stan allows you to model the observation process directly.

data {
  // measurements
  vector[N] pmra_true;
  vector[N] pmdec_true;
  vector[N] vlos_true;
  // uncertainties
  vector[N] pmra_err;
  vector[N] pmdec_err;
  vector[N] vlos_err;
  ...
}
parameters {
  vector[3] vels_sph[N];
}
transformed parameters {
  vector[N] pmra;
  vector[N] pmdec;
  vector[N] vlos;
  for (i in 1:N) {
    // the coordinate transform, basically transform_vels() backwards
    vector[3] pm = inv_transform_vels(vels_sph, ...);
    pmra[i] = pm[1];
    pmdec[i] = pm[2];
    vlos[i] = pm[3];
  }
}
model {
  for (i in 1:N) {
    pmra_true[i] ~ normal(pmra[i], pmra_err[i]); 
    pmdec_true[i] ~ normal(pmdec[i], pmdec_err[i]); 
    vlos_true[i] ~ normal(vlos[i], vlos_err[i]); 
  }
}

The advantage of this formulation is that now the tricky constraint is simply an upper bound on the length of the vector vels_sph.
Such a constraint can be implemented by a custom transform, for example

parameters {
  vector[3] raw_vels[N];
}
transformed parameters {
  vector[3] vels_sph[N];
  for (i in 1:N) {
    real max_vel = sqrt(2*p_phi0*pow(r[i], -g));
    vels_sph[i] = max_vel * raw_vels[i]/sqrt(1+dot_self(raw_vels[i]));
  }
}
model {
  for (i in 1:N) {
    // implies uniform density for vels_sph
    real max_v2 = 2*p_phi0*pow(r[i], -g);
    target += 1.5*log(max_v2) - 2.5*log1p(dot_self(raw_vels[i]));
  }
}

And finally you have

y[1] = sqrt(dot_self(vels_sph[i]));
y[2] = sqrt(dot_self(vels_sph[i,1:2]));
y[3] = r[i];
target += log(fabs(vels_sph[i,3])) - log(y[1]) - log(y[2]);
y ~ df(p_phi0, p_gamma, p_beta, p_alpha);

The target+= statement is a Jacobian adjustment and it’s there because df() computes the density of y but we need to compute the density for the parameter vels_sph.

2 Likes

But note that the “data generating process” goes in the other direction.

Ok, this makes sense.

Sorry, I took a while to try to go through your solution but I’m having a hard time following. Here are a bunch of questions:

  • We can’t put the constraint on p_phi0, so we instead put an equivalent constraint on the velocity (just rearrange the inequality). That’s where max_vel comes from. Why can we not use that as the upper bound for raw_vels instead of doing the whole raw_vels to vels_sph thing?

  • I don’t understand the raw_vels thing and where the following two lines in your second code block came from:

     vels_sph[i] = max_vel * raw_vels[i]/sqrt(1+dot_self(raw_vels[i]));
    
     target += 1.5*log(max_v2) - 2.5*log1p(dot_self(raw_vels[i]));
    

    I tried un-logging the second one and got pow(max_vel, 3) * 1. / pow(sqrt(1. + dot_self(raw_vels[i])), 5) which is not quite the first one cubed… ?

  • Why can’t pmra just be a free parameter instead of being defined by the inverse transform?

  • I don’t understand where the idea for the inverse transform came from. Is this because now that the constraint goes the other way, the transform also has to go the other way?

I know that you came up with this pretty quickly so it’s probably obvious to you, but if you could walk through your thought process in coming to this solution that would be really really helpful. Thanks.

Stan interpretes upper bound on a vector to mean upper bound on each component independently. So vector<lower=-1,upper=1>[3] would constrain the parameter to be inside a cube but for a length constraint it should be inside a ball. A custom constraint transform is needed.

The first line is some transform that restricts the length of the vector. The choice is somewhat arbitrary but the important thing is that any value of raw_vels gives a vels_sph whose length less than max_vel.

The second line is the log Jacobian determinant of the first line. The change of variables page I linked in the last post should explain the math behind it.
Now, Jacobians are tricky and I never get them right on the first try so it’s good idea to test it. Here’s what I did. The transform-only model is

parameters {
    vector[3] raw;
}
transformed parameters {
    vector[3] vel = raw / sqrt( 1 + dot_self(raw) );
}
model {
    target += -2.5 * log1p(dot_self(raw));
}

This model can be sampled from and if the transform is correct every vel vector must have length < 1. Also, because there is no prior or likelihood the samples must be distributed uniformly in the allowed region.

The upper plot shows samples in 3d coordinate system. The lower plot compares the number of samples within r of the center to the volume of r-radius ball. Because the number of samples is proportional to volume the distribution is uniform as expected.

If the Jacobian is wrong then either the distribution will be non-uniform or sampling will just fail.

It can be but it’s easier to express the constraints on vels_sph. pmra and vels_sph cannot both be parameters in the same model because that would imply independence; but they are coordinate transforms of each other.

Yes. In the original model transform_vels() takes (the known) proper motion and calculates the corresponding velocity in the spherical coordinates. In this model you hypothesize a value for the velocity and need calculate the predicted proper motion (which is then compared with the observed proper motion).

Feel free to ask more questions if anything is unclear.

1 Like

Ok, assuming that I did everything correctly (code posted below), this unfortunately doesn’t seem to work that well. Most of the iterations are divergent and I also end up with a lot of exceeded tree depths even with adapt_delta=0.999. The traceplots are flat. From what I’ve read, a reparameterization seems to be the way to go? I don’t know enough to be comfortable doing this so I’d like to avoid this option if possible.

Is there a way to have the constraint in the probability distribution function itself? As in keep trying new parameters if the function returns negative infinity? Doing this with the previous model (without uncertainties, and taking out the constraint) didn’t seem to work too well when I tried it though, even with “safe” initial values (big p_phi0).

What would you suggest?

Thanks.

functions {
    
    real df_lpdf(real[] y, real phi0, real g, real b, real a) {

        // y[1] = v, y[2] = v_t, y[3] = r;
        real E = -square(y[1]) / 2. + phi0 / pow(y[3], g);
        real L = y[3] * y[2];

        // terms in numerator
        real num_t1 = -2 * b * log(L);
        real num_t2 = (b * (g - 2.) / g + a / g - 3. / 2.) * log(E);
        real num_t3 = lgamma(a / g - 2. * b / g + 1.);
        
        // terms in denominator
        real denom_t1 = 0.5 * log(8. * pow(pi(), 3.) * pow(2., -2. * b));
        real denom_t2 = (-2. * b / g + a / g) * log(phi0);
        real denom_t3 = lgamma(b * (g - 2.) / g + a / g - 1. / 2.);
        
        real numerator = num_t1 + num_t2 + num_t3;
        real denominator = denom_t1 + denom_t2 + denom_t3;

        // this ~should~ never happen
        if (E < 0.) {
            print("E: ", E, " v: ", y);
            return negative_infinity();
        }         

        return numerator - denominator;
    }

    real shifted_gamma_lpdf(real y, real a, real b, real shift) {
        real shifted_y = y - shift;
        return a * log(b) - lgamma(a) + (a - 1) * log(shifted_y) - b * shifted_y;
    }

    vector get_angles(real x, real y, real z) {
        real projR = sqrt(square(x) + square(y));
        real R = sqrt(square(x) + square(y) + square(z));

        real sintheta = y / projR;
        real costheta = x / projR;

        real sinphi = projR / R;
        real cosphi = z / R;

        return to_vector([sintheta, costheta, sinphi, cosphi]);
    }

    vector inv_transform_vels(vector vels_sph, real ra, real dec, real plx, real sintheta, real costheta, real sinphi, real cosphi) {
      matrix[3,3] T = [
          [-0.06699, -0.87276, -0.48354],
          [ 0.49273, -0.45035,  0.74458],
          [-0.86760, -0.18837,  0.46020]
      ];
      real deg2rad = pi() / 180.;
      real sinra = sin(ra * deg2rad);
      real cosra = cos(ra * deg2rad);
      real sindec = sin(dec * deg2rad);
      real cosdec = cos(dec * deg2rad);
      matrix[3,3] A = [
          [cosra * cosdec, -sinra, -cosra * sindec],
          [sinra * cosdec,  cosra, -sinra * sindec],
          [        sindec,      0,          cosdec]
      ];
      matrix[3, 3] B = T * A;
      real k = 4.74057;
      vector[3] solarmotion = to_vector([11.1, 232.24, 7.25]); 

      matrix[3, 3] ptz_mat = [
          [ costheta, sintheta, 0],
          [-sintheta, costheta, 0],
          [        0,        0, 1]
      ];

      matrix[3, 3] rtp_mat = [
          [sinphi, 0,  cosphi],
          [     0, 1,       0],
          [cosphi, 0, -sinphi]
      ];

      vector[3] rtp = vels_sph * 100.;
      vector[3] ptz = rtp_mat \ rtp;
      vector[3] uvw = ptz_mat \ ptz;
      vector[3] dat = B \ (uvw - solarmotion);

      real vlos = dat[1];
      real pmra = dat[2] * plx / k;
      real pmdec = dat[3] * plx / k;

      return to_vector([vlos, pmra, pmdec]);

    }
}

data {

    int<lower=1> N; // total length of data
    
    vector[N] ra;
    vector[N] dec;
    vector[N] plx;

    vector[N] Xgc;
    vector[N] Ygc;
    vector[N] Zgc;

    // measurements
    vector[N] pmra_true;
    vector[N] pmdec_true;
    vector[N] vlos_true;
    vector[N] r_true;
    
    // measurement uncertainties
    vector[N] pmra_err; 
    vector[N] pmdec_err;
    vector[N] vlos_err;
    vector[N] r_err;

}

transformed data {
    vector[N] sintheta;
    vector[N] costheta;
    vector[N] sinphi;
    vector[N] cosphi;

    for (i in 1:N) {
        vector[4] angles = get_angles(Xgc[i], Ygc[i], Zgc[i]);
        sintheta[i] = angles[1];
        costheta[i] = angles[2];
        sinphi[i] = angles[3];
        cosphi[i] = angles[4];
    }
}

parameters {
    real<lower=0> p_gamma;
    real<lower=0> p_phi0;
    real<upper=1> p_beta; 
    real<lower=(p_beta * (2. - p_gamma) + p_gamma / 2.)> p_alpha; 

    vector[3] raw_vels[N];
} 

transformed parameters {
  vector[N] pmra;
  vector[N] pmdec;
  vector[N] vlos;

  vector[3] vels_sph[N];

  
  for (i in 1:N) {
    real max_vel = sqrt(2. * p_phi0 * pow(r_true[i], -p_gamma));
    vels_sph[i] = max_vel * raw_vels[i] / sqrt(1. + dot_self(raw_vels[i])); 
  }

  for (i in 1:N) {
    vector[3] pm = inv_transform_vels(vels_sph[i], ra[i], dec[i], plx[i], sintheta[i], costheta[i], sinphi[i], cosphi[i]);
    vlos[i] = pm[1];
    pmra[i] = pm[2];
    pmdec[i] = pm[3];
  }
   
}


model {

  for (i in 1:N) {
    real max_v2 = 2. * p_phi0 * pow(r_true[i], -p_gamma);
    target += 1.5 * log(max_v2) - 2.5 * log1p(dot_self(raw_vels[i]));
  }

  for (i in 1:N) {
    pmra_true[i] ~ normal(pmra[i], pmra_err[i]); 
    pmdec_true[i] ~ normal(pmdec[i], pmdec_err[i]); 
    vlos_true[i] ~ normal(vlos[i], vlos_err[i]);
  }

  p_gamma ~ normal(0.5, 0.06);
  p_phi0 ~ uniform(1., 200.);
  p_beta ~ uniform(-0.5, 1.);
  p_alpha ~ shifted_gamma(2.99321604, 2.82409927, 3.);

  for (i in 1:N) {
    real y[3];
    y[1] = sqrt(dot_self(vels_sph[i]));
    y[2] = sqrt(dot_self(vels_sph[i,1:2]));
    y[3] = r_true[i];
    target += log(fabs(vels_sph[i, 3])) - log(y[1]) - log(y[2]);
    y ~ df(p_phi0, p_gamma, p_beta, p_alpha);
  }
    
}

You need parameter constraints that match the models support. Missing constraints can cause divergencies. Since you have

model {
  p_phi0 ~ uniform(1., 200.);
  p_beta ~ uniform(-0.5, 1.);
  p_alpha ~ shifted_gamma(2.99321604, 2.82409927, 3.);
}

then you need

parameters {
    real<lower=1, upper=200> p_phi0;
    real<lower=-0.5, upper=1> p_beta; 
    real<lower=max(3., p_beta * (2. - p_gamma) + p_gamma / 2.)> p_alpha;
}

Not sure what’s up with that p_alpha lower bound but shifted_gamma requires p_alpha > 3.

Looks right to me. Note that the inverse of a rotation matrix is also its transpose so these could be

      vector[3] uvw = ptz_mat' * ptz;
      vector[3] dat = B' * (uvw - solarmotion);

which may be slightly faster.
The speed is not that important but I’m concerned that rtp_mat is not a rotation matrix:

      matrix[3, 3] rtp_mat = [
          [sinphi, 0,  cosphi],
          [     0, 1,       0],
          [cosphi, 0, -sinphi]
      ];

Can you double check? I think the cosines are usually on the diagonal and sines in the corners.

1 Like

Hmm, that seems a little strange to me. If I tell it to only sample between, for example, 1 and 200, why is the constraint also necessary?

Edit: The priors are defined over R, but just have probability 0 outside their range of support, so the sampling statement isn’t telling it to sample between 1 and 200. So that’s what the constraints are there to do. ?

Yeah the matrix also threw me off a bit. In the paper I am trying to replicate the author has

        matrix[3, 3] rtp_mat = [ 
            [sinphi, 0, cosphi], 
            [0, 1, 0],
            [cosphi, 0, -sinphi]
        ];

but her code on Github has the matrix with the sines on the diagonal, and I figured that I’d follow the code. I’ll run it both ways anyways just to check.

It can be rather confusing. The so-called sampling statements such as phi0 ~ uniform(1, 200); do not in fact sample anything. The model block only computes the probability of a sample; the samples are generated by the HMC algorithm, based on the observed probabilities of the previous samples and the bounds in the parameters block.
A “divergent transition” occurs when the HMC algorithm fails to follow the contours of the probability density and generates a really bad proposal for the next sample.
If you don’t declare a lower bound for phi0 and the algorithm finds significant posterior probability of phi0 being somewhat near 1 then it will likely also try a value that is slightly below 1 (where probability drops zero).

So I added the constraints and the model still isn’t very good. With defaults most of the iterations exceed the maximum tree depth, and even increasing this to 15, there are a lot (75%~) of iterations exceeding tree depth. This is also a lot slower (>2 hours). There are also some divergences.

For the first model without uncertainties, adding the constraints made it work even without the phi_lower constraint. It seemed to go a little faster with the phi_lower, but either way it only took a few seconds.

The model without the uncertainties has four parameter, the model with uncertainties has 3N+ parameters so it’s going to be slower but not by that much and the treedepth problems indicate the posterior geometry is complicated. There might still be some bug in the jacobians?

I’m curious, what software did the original paper use? Can you link to the GitHub repo?

Repo: https://github.com/gweneadie/GME
Paper: https://iopscience.iop.org/article/10.3847/1538-4357/ab0f97/pdf

They used R. It’s honestly pretty hard to follow (all over the place). The code is in the RScripts directory. The paper is also a little confusing because there are bits that are taken from previous papers so you kind of have to jump back and forth.

  • The transformation with the matrices is in functions_conversions.r.
  • The distribution function is in functions_deason.r. It’s the logDF.Deason function, which actually isn’t the log of the DF they give in their papers (one of the older papers) but it’s pretty close. I didn’t follow this in my code.
  • The MCMC code is in GME-hierarchical.r.
  • The priors are taken from the paper (Section 2, paragraph 6), and I couldn’t figure out what they used for p_alpha apart from it being a gamma distribution, so I just ran a curve fit to the picture (Fig. 2) they had in the paper and that’s why it’s a little weird.

Earlier I said that

but now I’m pretty sure that was wrong; df is the phase-space density so it does give the density of vels_sph without modification. You should remove the nonsensical jacobian from the code:

    target += log(fabs(vels_sph[i, 3])) - log(y[1]) - log(y[2]);

Maybe that’s what’s messing up the model.

The other jacobian with raw_vels is still necessary.

Unfortunately removing that doesn’t seem to help. Most of iterations still exceed a maximum treedepth of 14, the model is pretty slow, and the effective sample sizes are maybe 20 even with 32000 draws (32 chains). The r-hat values are also not close to 1.

It also looks like increasing the tree depth decreases the number of saturated iterations, but increases the number of divergences.

I’m not sure if this is really the best way to do it but I went back to the model without uncertainties and just changed the bits down below. I think this is essentially what the R code does and I get results that are almost exactly the same as what is reported in the paper. I had to put in initial values though, and from comments on the forum it seems like that isn’t really the best way to go. I’ll try to spend some more time figuring out how to improve this.

Thank you so much @nhuurre for all the help and for being so patient with this newbie. :)


parameters {
    real<lower=0., upper=1.> p_gamma; 
    real<lower=1., upper=200.> p_phi0;
    real<lower=-0.5, upper=1.> p_beta;
    real<lower=max(to_vector([3., p_beta * (2. - p_gamma) + p_gamma / 2.])), upper=10.> p_alpha; 

    vector<lower=min(pmra_true) - max(pmra_err), upper=max(pmra_true) + max(pmra_err)>[N] pmra;
    vector<lower=min(pmdec_true) - max(pmra_err), upper=max(pmdec_true) + max(pmdec_err)>[N] pmdec;
    vector<lower=-300, upper=300>[N] vlos;
    vector<lower=0, upper=max(r_true) + max(r_err)>[N] r;
} 

transformed parameters {

    real<lower=0> y[N, 3]; 

    for (i in 1:N) {
        vector[3] vels_sph = transform_vels(ra[i], dec[i], pmra[i], pmdec[i], vlos[i], plx[i], sintheta[i], costheta[i], sinphi[i], cosphi[i]); 
        y[i, 2] = sqrt(square(vels_sph[2]) + square(vels_sph[3]));
        y[i, 1] = sqrt(square(vels_sph[1]) + square(y[i, 2]));
        y[i, 3] = r[i];
    }

}


model {

    for (i in 1:N) {
        pmra_true[i] ~ normal(pmra[i], pmra_err[i]);
        pmdec_true[i] ~ normal(pmdec[i], pmdec_err[i]);
        vlos_true[i] ~ normal(vlos[i], vlos_err[i]);
        r_true[i] ~ normal(r[i], r_err[i]);
    }

    p_gamma ~ normal(0.5, 0.06);
    p_phi0 ~ uniform(1., 200);
    p_beta ~ uniform(-0.5, 1.);
    p_alpha ~ shifted_gamma(2.99321604, 2.82409927, 3.);
    
    for (i in 1:N) {
        y[i] ~ df(p_phi0, p_gamma, p_beta, p_alpha);
    }
    
}

This takes ~45 minutes to run on 12 cores with a maximum treedepth of 12 and adapt_delta=0.99. I get <1% divergent iterations, ~33% iterations exceeding treedepth, and n_eff of a few thousand out of 12000 on all parameters.

Good to hear it’s (sort of) working.

  1. My concern about rtp_mat was probably unfounded. Looking at it again I see it is an orthogonal matrix, just one whose determinant is negative.
  2. The intended bounds on pmra are probably closer to
vector<lower=min(pmra_true - pmra_err), upper=max(pmra_true + pmra_err)>[N] pmra;

which combines pmra value with its corresponding error before taking the minimum/maximum.
Also, CmdStan supports vectorized bound so if you’re working with CmdStanPy or CmdStanR you can also do

vector<lower= pmra_true - pmra_err, upper= pmra_true + pmra_err>[N] pmra;

and it sets different bounds for each element of pmra.

  1. Since you have reasonable parameter estimates you could try fixing some of the parameters to the “known” value and see if the other parameters can be inferred without divergence or tree depth problems. (Not sure if knowing which ones cause problems helps, though.)
  1. Yeah, that makes sense. Also figured that the constraints should be a bit looser, so I used +/- 3x the error instead. I’m still a little skeptical of the bounds I’m putting here though. It seems like cheating by just centering pmra on pmra_true, and then later saying that pmra_true is normally distributed around pmra (even more so with vectorized bounds). I guess it kind of makes sense but it’s still a little fuzzy in my head. How does this end up being different from pmra ~ normal(pmra_true, pmra_err)?
  2. I bumped the maximum tree depth up to 14 and it got rid of all the tree depth problems. From what I’ve read it doesn’t seem to matter too much either way? As in tree depth problems don’t affect the results.

All the bounds do is tell the sampler to never explore outside the region. If the posterior probability outside is so low that the sampler would never go there anyway the bounds shouldn’t make any difference to the results. It does make a difference during early warmup when the sampler has not found the posterior mode/typical set yet and explores all kinds of crazy possibliities.

Another way to think about it: the bounds change the model to a truncated distribution

pmra ~ normal(pmra_true, pmra_err)T[pmra_true-3*pmra_err, pmra_true+3*pmra_err]

and this distribution is almost indistinguishable from the untruncated normal so whatever.
Ok, it needs a bit wider bounds though to be “almost indistinguishable”. N is a couple of hundred, right? So you’d expect that at least one pmra drawn from the normal is around 3*err away from the center. A safer value for the bound is 5*err which is definitely not going to happen unless the errors have been underestimated (and if that’s possible the model needs to change to account for it anyway)

Sure, it just makes the model slow. The posterior inference is valid as long as effective sample size is large enough.