High magnitude cutpoint proposals

fitting-issues
performance

#1

I am trying to fit the following hierarchical simplex factor analysis model

data {
    int<lower=1> N;
    int<lower=1> S;
    int<lower=1> K;
    int testmin;
    int testmax;
    int<lower=testmin, upper=testmax> P[S,N];
}

parameters {
    simplex[K] W[N];
    matrix[S,K] B0;
    matrix[S,N] noise0;
    ordered[testmax-1] c;
    simplex[K] w0;
    real<lower=0> alpha0;
    real<lower=0> sigma;
    real<lower=0> sigma_noise;
}

transformed parameters {
    matrix[S,N] C;
    matrix[S,N] noise;
    matrix[S,K] B;
    vector[K] alpha;
    noise = sigma_noise*noise0;
    B = sigma*B0;
    for (n in 1:N)
        C[,n] = B*W[n];
    C += noise;
    alpha = 1+alpha0*w0;
}

model {
    //noise
    sigma_noise ~ gamma(2.,10.);
    to_vector(noise0) ~ normal(0,1);
    
    //latent simplex representation
    alpha0 ~ gamma(2.,10.);
    w0 ~ dirichlet(rep_vector(2,K));
    for (n in 1:N)
        W[n] ~ dirichlet(alpha);
    
    //basis vectors
    sigma ~ gamma(2.,10.);
    to_vector(B0) ~ normal(0,1);
    
    //data generation (prior suggested by Carpenter on boards)
    mean(c) ~ normal(0,1);
    (c[2:(testmax-1)]-c[1:(testmax-2)]) ~ gamma(2.,10.);
    for (n in 1:N) {
        for (s in 1:S) {
            P[s,n] ~ ordered_logistic(C[s,n], c);
        }
    }
}

The sampler actually works fairly well, but I am bit a bit baffled by some of the messages the sampler returns. I get several blocks of stuff like:

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: gamma_lpdf: Random variable[4] is -nan, but must not be nan!  (in 'unknown file name' at line 54)

If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: ordered_logistic: Cut points parameter is 2.76642e+66, but must be greater than 2.76642e+66  (in 'unknown file name' at line 57)

If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: ordered_logistic: Cut points parameter is 704.671, but must be greater than 704.671  (in 'unknown file name' at line 57)

If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: gamma_lpdf: Random variable[3] is -nan, but must not be nan!  (in 'unknown file name' at line 54)

If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: ordered_logistic: Cut points parameter is 20.2726, but must be greater than 20.2726  (in 'unknown file name' at line 57)

If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: ordered_logistic: Cut points parameter is 4.19481e+75, but must be greater than 4.19481e+75  (in 'unknown file name' at line 57)

This occurs for different choices of cutpoint prior such as the normal and the uniform (the current one was suggested in another post) and initializations. Right now I am initializing each chain using the output of an optimizing call with 10 steps where I initialize the cutpoints to np.linspace(-1,1,testmax-1) (i.e. equally spaced grid in the [-1,1] interval).

I must be hitting some massive gradients to get proposal of that magnitude. Is my prior not adequately boundary-avoiding? Or is my initialization off? It seems like my sampler hits a stable regime after a while and the final output looks quite fine, but these reports have me worried.

P.S. A complete aside, but is there any way to make simplex factor models identifiable and avoid label switching?


#2

This definitely looks like the model allows for some crazy parameter values to occur. E.g. - you are getting nan from subtraction, so it is likely that some elements of c are infinite (after transforming from the unconstrained space to the ordered type). The cut points problems also seem to be numerical (may two values that should be increasing got rounded to the same floating point since they are too large).

Non-identifiability / Multimodality of the posterior may result in some large gradients if the sampler adapts its stepsize some of the low-curvature regions and then stumbles upon directions where the curvature is very high.

Beyond that I am unable to help you much as I don’t understand the model. Do I get it correctly that the multimodality/label switching arises form K > 1? Maybe check if the model works correctly (retrieves parameters from simulated data) with K = 1 and K = 2 to be sure there are no programming errors?


#3

Indeed, K is the number of latent factors. It’s basically a factor model with factors B where the latent representation of each observation n comes in the form of a set of convex weights W[n] restricted to the simplex, so that the reconstruction of the latent quantity used in the ordinal regression is B * W[n]. Trying with K=1 could be interesting, will try that.

Sorry if the specification is a bit obtuse. It would be a bit more succinct without all the reparameterizations.

The sampler actually seems quite stable, with the same modes being recovered for K=4 over 10 chains. Of course this might be because the errors block it from mixing further.

hmm, I think you are onto the right track that the differences between the elements in c grow so small that the gamma_lpdf returns -nan because it is boundary avoiding. This would explain both types of warning. But why would c collapse in that manner? Seems bizarre that it is even proposed.

Hypothetically, if the latent values C are suddenly much lower than the current cutpoints, all points would seek to become lower, but the value chosen for c[k] would then block all c[k'] where k’>k from dropping below that value which might cause a sort of pile-up. In general there seems to be some non-identifiability with the scale of C vs the cutpoints c?


#4

Tried with K=1, it does not appear to help with the problem.
I’m thinking to just force a small fixed gap between the cut points in addition to the random gap.


#5

If K=1 doesn’t help than this IMHO indicates there is some error/misspecification in the model code. Try stripping the model to bare minimum (e.g. start with simple ordinal regression), check that it works (recovers parameters from simulated data) and then add other stuff one at a time, always checking the model works. Also note that the brms package supports ordinal regression (https://kevinstadler.github.io/blog/bayesian-ordinal-regression-with-random-effects-using-brms/), so you might want to specify as much of the model with brms and use the Stan code it generates - which tends to be correct :-) - as a starting point.


#6

Ah, I’m unfortunately one of those endangered PyStan users (although I have recently been reconsidering my life choices given the R ecosystem around Stan), but I will see whether I can get something out of bmrs.

Could it be an initialization problem? I am thinking ordinal regression is somewhat sensitive to initialization, given that a wrong initialization will imply near zero probability of the correct event. I am still quite convinced this is an issue with cut point collapse, no?

edit: for the record, the specification largely follows the Stan reference.


#7

Really hard to tell :-) I remember that on multiple occasions I had a prefect theory why the model is not working. Then I reduced the model to bare minimum and started building it in steps just to realize that my theory was false and the problem was caused by a wrong sign in the code or bad indexing of vectors.


#8

I have been over it a couple of times, but I will give it another go.

One other theory: for low K, it might be difficult to actually achieve non-zero probability for every observation, at least initially. But that should then be a problem it has in common with most other binary and ordinal models.


#9

I am truly just guessing - seems you’ve hit a hard problem. Or maybe someone more knowledgeable has to weigh in. It might also be worthwhile to test the model with a P[s,n] ~ normal(C[s,n], 1) or something to that effect to see if everything except the ordinal part works as expected. On a similar note, you may want to move the bounds c to the data block and provide them as known. If that helps the problem is almost certainly with the bounds.


#10

Tried both of your suggestions, and they both worked fine with none of the usual warnings. Of course, with the normal likelihood, the cut points do not influence the likelihood at all.

The order in which Stan outputs its progress messages is a bit confusing, but it appears that each block of errors happens at the start of a chain, which is in favor of this being an initialization problem. I will try initializing it even closer to the MAP.


#11

Tried initializing the chains with 2000 steps of BFGS, but I still get the warnings.

All BFGS runs converged to (something close to) the unassuming cutpoint values of:
array([-6.69411269, -4.08202237, 2.49172561, 8.5414547 , 13.83149098, 16.71796858])

which looks reasonable.


#12

Endangered in what way?

It can be. You need sensible priors to avoid collapse if the data doesn’t support all your divisions.

Which spec are you talking about? The only examples in the manual are much simpler. For example, I don’t understand that C += noise statement in the transformed parameters.

We need to get better error messages with indexes listed, but this is indeed signaling a collapse of the cutpoints. Also, seeing uctpoints in the 4+75 range means you need to add some priors. You’re never going to want solutions in that range for anything presumably.

It’s worth converting B to a matrix to allow this to be matrix multiplication. Doing it this way leads to a lot of memory issues.


#13

I should’ve also mentioned that if your cutpoints are collapsing, you need priors to solve the problem, not better initialization.


#14

Just a tongue in cheek way of pointing out that the Stan Python ecosystem is a bit underdeveloped compared to R, which I am taking as an indicator that most users use R.

Indeed, I agree completely, but I should think that my choice is already zero-avoiding (as is ironically exemplified by the errors where the gamma returns nan).

You are right, should maybe have said Stan guidelines from the manual, instead of specification. Meant the use of non-centered parameterizations and the like. The noise term is perhaps a bit unorthodox, at the point where I posted this I hadn’t realized that the ordinal noise had been implicitly integrated out in the definition of ordered_logistic. Still, I think my model might be slightly weaker than ordinary ordinal regression given that I have several ordinal targets, one common set of cutpoints, and a simple simplex factor model to fit the whole thing with.

Indeed, the high proposals are what has me worried. But why do they even occur, even if we have cutpoint collapse? Presumably there is divergent gradient direction?
The priors are already there, but you think a gamma(2,10) is too weak? It’s pilfered from the manual and the wiki.
An issue with priors for latent ordinal models like the above that I am not sure how to resolve is that the prior on the cutpoints has some kind of interaction with the prior on the latents, in the sense that their scales only make sense relative to each other. So can I really just make the Gamma on increments sharper or larger without taking the prior on B into account?

Wait, I am confused, B is already a matrix here. Did you mean W? Would the ideal implementation be to load all of the simplex variables W[n] into the columns of a local variable Wloc prior to the multiplication? (or is it better to store in rows and then use special product function?)
I assume it’s best if I maintain the simplex constraints explicitly.

If the chain eventually mixes well, doesn’t that speak in favor of problems with initialization and adaptation? The errors occur at the beginning of each chain, but then stop. Of course, a good prior might kill these issues.

Thank you very much for taking the time to respond.


#15

It’s a chicken and egg problem in that if our Python interface was more developed, maybe more people would use Python. There is a lot of buy-in around R in the stats world, and it’s what Andrew uses, and it’s where we have the largest concentration of programmers. We’re about to try to hire someone at Columbia to concentrate on building out the Python interfaces to bring them in line with the R interface functionality. So if that sounds like an interesting project to anyone, drop me a line (carp@alias-i.com). We’ll be advertising officially soon, I hope.

It may be. It doesn’t go to zero super rapidly, so it’s still consistent with collapsing within floating-point error. Compare its rate of approach to zero versus a lognormal, for example.

There’s often an identifiability problem here if you have an intercept in the model, as it means the cutpoint set’s location isn’t identified. So one way to mitigate this problem is to assume by fiat that the lowest cutpoint is zero, and then use our positive_ordered type to generate the rest of the cutpoints above zero. Then you can have an intercept in your model.

Yes, you absolutely need the simplex constraints.

You can’t actually pack these things with memory locality (array of simplex is row major, matrix is column major), so it doesn’t matter much how you do the conversion. The loops are relatively fast in Stan (unlike in R or Python), since the compile straight down to C++.

Sometimes for hard problems that’s the case, and ordinal logit can sometimes be very hard.

It’s also often symptomatic of weak identifiability or poorly specified models for the data, both of which lead to slow mixing. Most well specified models that aren’t too complex, fit fine with default inits in Stan. When regressions get big, especially if the predictors are large in magnitude, we can have numerical issues with random inits.