Request for help understanding model behavior

I have coded up a Stan model related to a problem in industrial engineering. The procedure in question is referred to in the field as “Gauge R & R” (Repeatability & Reproducibility). The goal is to estimate the variance components associated with making acceptance measurements. So for example, one might imagine test operators measuring gasket thicknesses with calipers. We would like the measurement variability to be due mostly to the actual part variability, and to minimize the variability resulting from repeated measurements by the same operator on the same part (repeatability), and variability due to different operators inducing varying amounts of bias due to how they operate the equipment (reproducibility). A typical Gauge R & R study will involve several operators selected at random and several parts selected at random, with each operator measuring each part several times.

The model I coded is based largely on WinBugs code from example 2 (Two-Way Random-Effects Model) of a paper entitled “A Bayesian Approach to the Analysis of Gauge R&R Data” (Weaver, B. P., Hamada, M. S., Vardeman, S. B., Wilson, A. G. (2012). Quality Engineering, 24:486–500). The variance components to be estimated are sigmaRep (repeatability), sigmaO (operator variability), and sigmaP (part variability). To keep things simple I have excluded the part/operator interaction component included in the paper.
My Stan code is included below:

gaugeRR_model.txt (1.5 KB)

Typically the model seems to behave fairly well: it recovers the parameters I used to generate simulated data; Rhat values are 1; the trace plots and marginal posterior histograms look good. N_Eff values are decent I guess, typically 10 – 50% of the total number of samples. Time to generate 20,000 samples (4 chains, 5000 iterations) is about 15 seconds.

Where the model struggles, is when trying to fit to data generated using variance components that differ significantly from one another. For example, fit to data (dataset1) using sigmaRep = 0.1, sigmaO = 1, and sigmaP = 10, resulted in Rhat still 1, but N_Eff as low as 7%, and sampling time 50 seconds.

dataset1.txt (741 Bytes)

Taking it a step further (dataset2), if we try sigmaRep = 0.02, sigmaO = 1, and sigmaP = 50, we get Rhat as high as 1.4, N_Eff as low as 0.075%, and sampling took 70 seconds.

dataset2.txt (741 Bytes)

So I was hoping someone could explain why the model behaves this way and if anything can be done about it.

BTW, in the paper cited above, in order to obtain decent results with WinBugs, the authors ran 5 million iterations, dropped the first 200,000 as burn-in, and thinned the rest by a factor of 100. So Stan is already doing much better.

Thanks

1 Like

Some quick ideas (without a deep understanding of your model). Stan works best when all parameters are approximately ~N(0,1), having parameters on vastly different scales complicates adaptation of the leapfrog step. To bring your parameters on a similar scale, you may try non-centered parametrization, e.g.: instead of

parameters {
 vector[nP] P;
}

model {
  P ~ normal(0,sigmaP);
}

you’ll have

parameters {
 vector[nP] P_raw;
}

transformed parameters {
  vector[nP] P = P_raw * sigmaP;
}

model {
  P_raw ~ normal(0,1); //parameter
}

This brings all the actual parameters on the same scale, which should help Stan in sampling.

You might want to do something like this to the Cauchy distributions, but I am not sure that a linear transformation of a Cauchy variable actually can be done by transforming the location and scale.

Also the Cauchy priors for muP and the sigmas seem to have some very heavy tails. I don’t know why you chose them, but maybe using normal priors or student-t with >2 degrees of freedom (if you can justify them) could get rid of some of the problems - do you really believe such a big spread in the true values is plausible? Also AFAIK, student-t with >2 DOF can take advantage of the non-centered parametrization safely. You may find https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations helpful.

Thanks for your suggestions.

I tried several variations on your non-centered parameterization suggestion, but no luck. The results were either not any better than before, or completely failed.

Regarding the priors, I got slightly better results with normal priors, but it doesn’t solve the issue.

I’m not hugely concerned about this issue from a practical point of view - the scenario with radically different variance components is probably not very likely to occur in a real application. I’m mainly just interested understanding this phenomenon and Stan in general, so I’m still open to any further recommendations.

Thanks again

It’d be interesting to see if the model’s mixing slowly, or just failing to mix. If you run it for ten times as many iterations, do the Rhat values go down? It may not have had enough time to adapt or it may just be slowly mixing.

One thing to do is crank up the target acceptance rate (adapt_delta), which will force the Hamiltonian simulation to be more accurate, which can help in high-curvature models. The other is to increase adaptation.

In this case, you should also be reparameterizing a bit. See below.

Thanks for the clear explanation and model. I really like these kinds of models. They’re also very common in epidemiology for diagnostic tests and I’ve been applying the same ideas to crowdsourced data annotation.

There are a bunch of computational issues with your model, which I repeat here.

data {
     int<lower=0> nM;           // # of measurements
     int<lower=0> nP;           // # of parts
     int<lower=0> nO;           // # of operators
     vector[nM] y;              // measurements
     int part[nM];              // part indicator
     int oper[nM];              // operator indicator
     real<lower=0> sigmaPest;   // scale for sigmaP prior
     real<lower=0> sigmaRepEst;  // scale for sigmaRep prior
     real<lower=0> sigmaOest;   // scale for sigmaO prior
}
transformed data {
     real my;               // mean of y
     real sdy;              // st dev of y
     my = mean(y);
     sdy = sd(y);
}
parameters {
     real muP;                   // part mean
     real<lower=0> sigmaP;       // part sigma
     real<lower=0> sigmaRep;     // repeatability sigma
     real<lower=0> sigmaO;       // operator sigma
     vector[nP] P;
     vector[nO] O;
}
model {
     muP ~ cauchy(my,10*sdy/sqrt(nM));                  // muP prior
     sigmaP ~ cauchy(sigmaPest,10*sigmaPest/sqrt(nP)); // sigmaP prior
     sigmaRep ~ cauchy(sigmaRepEst,10*sigmaRepEst/sqrt(nM)); // sigmaRep prior
     sigmaO ~ cauchy(sigmaOest,10*sigmaOest/sqrt(nO));  // sigmaO prior
     P ~ normal(0,sigmaP);
     O ~ normal(0,sigmaO);
     y ~ normal(muP + P[part] + O[oper],sigmaRep);
}
generated quantities {
     real<lower=0> sigmaG;
     real<lower=0> sigmaTot;
     real<lower=0> sigGtoTot;
     sigmaG = sqrt(square(sigmaRep) + square(sigmaO));
     sigmaTot = sqrt(square(sigmaG) + square(sigmaP));
     sigGtoTot = sigmaG ./ sigmaTot;
}

You have the usual problem of identifiability with three intercepts — muP, P, and O — you get the same thing with IRT models. This is usually manageable with reasaonble priors and I’ve found the hierarchical variance parameters will help fitting. This is what’s really going to cause Gibbs samplers to struggle. NUTS will also struggle if the variables are different scales, as it only does diagonal adaptation. The place to learn more about all this is Michael Betancourt’s arXiv papers.

Having said that, you probably are going to need non-centered parameterizations for P and O:

parameters {
  vector[nP] P_std;
  vector[nO] O_std;
  ...  
transformed parameters {
  vector[nP] P = P_std * sigmaP;
  vector[nO] O = O_std * sigmaO;
  ...
model {
  P_std ~ normal(0, 1);
  O_std ~ normal(0, 1);

Then if you have wildly different scales for the sigmaX, you want to do the same thing:

parameters {
  real<lower = 0> sigmaRep_std;
transformed parameters {
  real<lower = 0> sigmaRep
     = sigmaRepEst + sigmaRep_std * 10 * sigmaRepEst / sqrt(nM);

and so on. That puts sigmaRep_std on the unit scale, which is where the sampling is happening. Assuming that is, your priors are weakly informative. This is what @martinmodrak was suggesting. You should also take his advice on heavy tails, which is rolled into this. Yes, you can do this with Cauchy, but unless you expect huge error scales, there’s no need for it.

For convenience, I’d suggest declare-define for the generated quantities and hypot() function (hypoteneuse):

generated quantities {
  real<lower=0> sigmaG = hypot(sigmaRep, sigmaO);
  ...
}

Bob, thanks for your help on this. Following up on your suggestions:

Running for 10 times as many iterations didn’t help. The chains appear to not be mixing.

I increased adapt_delta from 0.8 to 0.9. Didn’t help.

By that I assume you mean more warmup samples? I went to 20,000 - didn’t help.

Here is my updated model using non-centered parameterization. (I also slightly modified the way I’m scaling the priors):

data {
     int<lower=0> nM;         // # of measurements
     int<lower=0> nP;         // # of parts
     int<lower=0> nO;         // # of operators
     vector[nM] y;            // measurements
     int part[nM];            // part indicator
     int oper[nM];            // operator indicator
     real muP_scale;          // scale factor for muP prior
     real<lower=0> sigmaRep_scale;  // scale factor for sigmaRep prior
     real<lower=0> sigmaP_scale;    // scale factor for sigmaP prior
     real<lower=0> sigmaO_scale;    // scale factor for sigmaO prior
}
parameters {
     real muP_std;                   // part mean
     real<lower=0> sigmaP_std;       // part sigma
     real<lower=0> sigmaRep_std;     // repeatability sigma
     real<lower=0> sigmaO_std;       // operator sigma
     vector[nP] P_std;
     vector[nO] O_std;
}
transformed parameters {
     real my = mean(y);
     real muP = my + muP_scale * muP_std;
     real<lower=0> sigmaP = sigmaP_scale * sigmaP_std;
     real<lower=0> sigmaO = sigmaO_scale * sigmaO_std;
     real<lower=0> sigmaRep = sigmaRep_scale * sigmaRep_std;
     vector[nP] P = P_std * sigmaP;
     vector[nO] O = O_std * sigmaO;
}
model {
     muP_std ~ normal(0,1);       
     sigmaP_std ~ normal(0,1);
     sigmaO_std ~ normal(0,1);
     sigmaRep_std ~ normal(0,1);
     P_std ~ normal(0,1);
     O_std ~ normal(0,1);
     y ~ normal(muP + P[part] + O[oper],sigmaRep);
}

And here are my updated data sets (same as previous data sets except with updated prior scale factors):

dataset1a.txt (736 Bytes)
dataset2a.txt (748 Bytes)

Unfortunately, the performance of the re-parameterized model is pretty similar to the original model. To reiterate, data set 1 works OK, data set 2 doesn’t (low N_Eff, high Rhat).

Thanks again

You shouldn’t be defining a mean of the data as a parameter—you can define my as transformed data. But the bigger question is why scale a parameter (here muP) by the data (here my = mean(y))?

I don’t see anything else that’s problematic in here. Other than the usual problem with three intercepts.

What does the sampler actually do? Multiple modes? Get stuck without moving?

My mistake. But that’s not hurting anything, correct?

Actually, the normal prior on muP is scaled by muP_scale and centered at my.

Could you elaborate on what you mean by the usual problem with three intercepts?

See attached trace plots:

Nothing but efficiency.

Oops, meant why relocated by the data?

None of muP, P, or O here are classically identified since they’re additive. You can subtract from muP and add to P and get the same likelihood. So only the prior is providing a real identification of the results. It centers the effects around zero for P and O to make them interpretable and then the work gets done by muP. And it’s a soft form of centering—the values of P won’t quite add up to zero in any given iteration.

And I’m guessing that’s just hwat’s happening here. Your muP is probably covarying with P and O. This kind of thing often occurs if the model’s not well specified for the data. Did you actually generate the parameters from the priors then the data from the parameters?

Just my way of specifying a weakly regularizing prior for muP. If this were real data I’d use a legitimate informative prior.

This seems inconsistent with something I observed, which is that the model will fit pretty well with flat priors, as long as there’s enough data (> ~ 20 points) and the variance components are roughly on the same scale.

I specified the values of parameters muP, sigmaRep, sigmaP, and sigmaO (e.g. for data set 2, muP = 100, sigmaRep = 0.02, sigmaP = 50, and sigmaO = 1). Then given those parameter values, I drew nP P vales from normal(0,sigmaP), nO O values from normal(0,sigmaO), and nM repeatability error values from normal(0,sigmaRep). So a measurement corresponding to the ith part, jth operator would be: meas = muP + P(i) + O(j) + rep(k), where k = 1:nM.

That would be very surprising indeed giving that it’s classically non-identified so that the posterior would be improper. I show what happens with Stan in that case in the manual chapter on problematic posteriors. But maybe you’re talking about something else.

And it sounds like you did a proper simulation, so it should be fitting if everything lines up and there’s not some terrible posterior geometry destroying things.

I believe what I’m doing here is essentially analysis of variance, which is classically identified (I think).

There’s some discussion of this in BDA chapter 15, where they mention that variance components near zero can cause problems for Gibbs samplers, and suggest “parameter expansion” as a way to remedy it. Further there’s some discussion of a transformation useful for HMC, which looks to me like the non-centered parameterization. Anyway, I’m wondering if there’s in there that could help here.

The parameters muP, P and O are additive. That means you can add to muP and subtract from P and get the same likelihood. That’s pretty much the definition of non-identified.

Also true for HMC—there is the issue of rounding (not so bad near zero as near one), but also the fact that we transform parameters, so it really turns into overflow. I don’t recall what “parameter expansion” is. We just suggest rescaling if you really have variables with very tiny posterior scales (0 plus or minus some small epsilon or some very large epsilon).

Non-centered parameterization helps for hierarchical parameters which aren’t well identified by the prior or data. You’re doing that with the P_std and O_std thing already.

Now that I look at it, your non-centering of the scale isn’t the usual approach. Usually you’d do something like to get a non-centered lognormal distribution:

parameters {
  real sigmaP_unc_std;

transformed parameters {
  real sigmaP_unc = sigmaP_unc_std * sigmaP_scale;
  real sigmaP = exp(sigmaP_unc);


model {
  target += sigmaP_unc;  // Jacobian adjustment
  sigmaP_std ~ normal(0, 1);

What you’re doing is scaling after the fact. If you really want sigmaP_scale to be the lognormal scale, then you need to work on the right scale and recompute as above, or let Stan do the Jacobian and work on the positive scale, with the declarations you have and definition

real<lower = 0> sigmaP_std;

real<lower = 0> sigmaP = pow(sigmaP_std, sigmaP_scale);

There’s nothing wrong with what you have, there’s just not a standard interpretation for your sigmaP_scale, whereas above is the lognormal scale parameter.

Sorry, I never properly learned frequentist stats, so I’m not completely clear on some of the concepts, like identifiability.

That said, I’m interpreting what you’re saying as: y ~ normal(muP + P[part] + O[oper],sigmaRep) by itself is non-identified, but if you include the priors P ~ normal(0,sigmaP) and O ~ normal(0,sigmaO), they force the P’s and O’s to be centered around zero, and muP to be centered around the mean of y.

Correct me if I’m wrong, but I think part of the misunderstanding is that I was considering those priors on P and O to be part of the model. So when I said earlier that the model fits even with flat priors (implying identifiability), I meant flat priors on muP, sigmaRep, sigmaP, and sigmaO.

So a stripped down version of the original model, with flat priors on muP, sigmaRep, sigmaP, and sigmaO would be:

 data {
     int<lower=0> nM;         // # of measurements
     int<lower=0> nP;         // # of parts
     int<lower=0> nO;         // # of operators
     vector[nM] y;            // measurements
     int part[nM];            // part indicator
     int oper[nM];            // operator indicator' 
}
parameters {
     real muP;                   // part mean
     real<lower=0> sigmaP;       // part sigma 
     real<lower=0> sigmaRep;     // repeatability sigma
     real<lower=0> sigmaO;       // operator sigma
     vector[nP] P;
     vector[nO] O;
}
model {
     P ~ normal(0,sigmaP);
     O ~ normal(0,sigmaO);
     y ~ normal(muP + P[part] + O[oper],sigmaRep);
}

which as I said earlier will fit as long as there’s enough data and sigmaRep, sigmaO, and sigmaP are not too different. E.g. this data set will work: dataset3.txt (660 Bytes)

So what sort of reparameterization would you recommend for the basic model (no sigmaP_scale etc.)?

Me either, but identifiability is important for Bayes, too! Non-identified Bayesian models have improper posteriors. And there is more than one way to remediate—hard identification or soft identification with priors.

Right—if you have proper priors, you’ll have a proper posterior. That’s what I’m calling “soft identification” for lack of a better term.

They are. They’re part of the prior. The likelhood by itself isn’t identified, but the prior gives you soft identification.

Did the centered versions of P and O work better here? They will, in general, when the posterior’s well identified by the data and/or prior.

Yes, in terms of N_Eff and R_hat. However, the non-centered version worked “better” in one respect - it was less likely to draw extreme sample values for sigmaP and sigmaO. Note that neither version could fit data set 2 (but non-centered was worse).

Something that’s always perplexed me about this is that, intuitively, one might think that for data sets generated using very different variance components it would be especially easy to estimate the variance components. E.g. looking at data set 2 below (sigmaRep = 0.02, sigmaO = 1, sigmaP = 50), one can almost estimate the sigmas by inspection:

Repetition 1
oper 1 oper 2 oper 3
part 1 76.501 75.857 74.532
part 2 81.037 80.338 79.069
part 3 144.322 143.658 142.392
part 4 41.589 40.891 39.650
part 5 160.984 160.241 158.987
Repetition 2
oper 1 oper 2 oper 3
part 1 76.505 75.853 74.513
part 2 81.048 80.305 79.056
part 3 144.323 143.618 142.372
part 4 41.602 40.893 39.624
part 5 160.966 160.285 158.991

So it seems strange to me that this data set would be especially hard to fit.

If the posterior intervals differed, at least one of them was just wrong. If they both draw from the posterior, the extremes should be the same. Often the non-centered version lets you draw more extreme values—it’s unusual to see it go the other way unless you have a lot of data.

If you can figure out a better adaptation algorithm for NUTS, we’d be the first people to implement it!

Well, I finally found something that works. I changed the model block from:

model {
     P ~ normal(0,sigmaP);
     O ~ normal(0,sigmaO);
     y ~ normal(muP + P[part] + O[oper],sigmaRep);
}

to:

model {
     P ~ normal(muP,sigmaP);
     O ~ normal(0,sigmaO);
     y ~ normal(P[part] + O[oper],sigmaRep);
}

which I think you’ll agree is essentially the same model, but that small change resulted in a significant performance improvement.

I’d be interested if you had some insight into why this made a difference.

Thanks

The key is understanding why it’s not the same model from Stan’s perspective. In the first case, the parameters are (P, O, y, muP, sigmaP), and the in the second they’re (Pctr, O, y, muP, sigmaP), where P = Pctr + mu. The fully non-centered version is a distribution over (Pstd, 0, y, muP, sigmaP), with

P_std ~ normal(0, 1);
y ~ normal(muP + sigmaP * P[part] + O[oper],sigmaRep);

Which version is most efficient is determined by how well constrained the posterior is. As the informativeness of the data and prior increase, the centered parameterization becomes better than the non-centered. See the Stan manual (search for “funnel”) and Betancourt and Girolami’s paper on hierarchical models.