Stochastically weighted regression

The following model seems to work and obtain reasonable ESS and Rhat. However, it often exhibits divergences. I’m not surprised about the divergences, because the regression weights are themselves stochastic variables. Is this a sensible way to parameterize the model? Can I tweak things so the divergences go away? Are the divergences a problem? If I remove clarity[px] * then the model runs easily, but then I don’t have stochastic weights.

functions {
  vector calcThresholds(real eta, data array [] real thr) {
    int K = num_elements(thr)+1;
    vector[K] theta;
    theta[1] = 1 - Phi(eta - thr[1]);
    for (k in 2:(K - 1)) {
      theta[k] = Phi(eta - thr[k - 1]) - Phi(eta - thr[k]);
    theta[K] = Phi(eta - thr[K - 1]);
    return theta;
data {
  int<lower=1> N;
  int<lower=1> Days;
  real propCorrect[N];
  int psiTolData[N];
transformed data {
  int K = 3;
  real thresholds[K-1];   // cutpoints for probit
  thresholds[1] = -0.43;  // qnorm(1/3)
  thresholds[2] =  0.43;  // qnorm(2/3)
parameters {
  real<lower=0,upper=1> clarity[N];
  real psiTol;
model {
  psiTol ~ std_normal();

  for (px in 1:N) {
    clarity[px] ~ beta((2+Days) * propCorrect[px], (2+Days) * (1-propCorrect[px]));

  vector[K] theta;
  theta = calcThresholds(psiTol, thresholds);
  for (px in 1:N) target += clarity[px] * categorical_lpmf(psiTolData[px] | theta);

Here’s some data.
stage1.csv (2.6 KB)
stage2.csv (1.2 KB)

N = 40
Days = 31
propCorrect = pmin(.99, apply(stage1, 2, mean))
psiTolData = unclass(stage2$psi.tol)

I would first suggest trying to figure out where they’re coming from. Typically they arise when parameters try to work themselves into high curvature regions, which often arise from hard constraints. Or when the model’s generative process doesn’t match the data well. The only constraints here are the (0, 1)-constrained weights.

We have the bet distribution you’re using built-in, so you don’t need to do the arithmetic. See the beta_proportion distribution here in our Functions Reference here: Continuous Distributions on [0, 1]

The Phi function is also problematic at its edges. Converting to ordered logit and using the built-in should be both more robust and more efficient. You want to make sure that the boundaries don’t collapse in ordered regressions, which can often be achieved by putting a zero-avoiding prior on the distances between the boundaries.

1 Like

Thanks for the suggestion! Divergences are gone.

You want to make sure that the boundaries don’t collapse in ordered regressions, which can often be achieved by putting a zero-avoiding prior on the distances between the boundaries.

That’s only a problem if you’re estimating thresholds. I only have 1 item so my thresholds are fixed.

See the beta_proportion distribution here

Nice! How do I understanding \kappa? For example, if I want to represent an unfair coin and flip it 10 times, do I set the proportion to heads/10 and \kappa=10?

I spoke too soon. When I run the model against a large number of simulated datasets (about 560) then I get divergences in about 25% of the runs. This is after converting the model to logit response.

:-). Sorry for the red herring.

beta_proportion(theta, kappa) =def=  beta(theta * kappa, (1 - theta) * kappa)

So it’s the overall concentration. This is how Andrew Gelman reparameterizes the first hierarchical model example in chapter 5 of Bayesian Data Analysis.. He then gives kappa a Pareto prior (though he doesn’t say it as such that’s what his reparameterization amounts to—I found this example of Andrew’s almost impossible to understand when I first saw it because it has everything from moment matching to implicit changes of variables with Jacobians).

logit is only going to help if they are numerical function divergences. With hierarchical regressions, we don’t know how to fully eliminate them. That is, there are data size/prior combos where both a centered and non-centered parameterization produce divergences. Usually if there’s very little data, non-centered is OK and if there’s a ton of data, centered is OK, but in between remains problematic when there’s just a little bit of data.

1 Like

I’m not even doing hierarchical regressions. My model is literally as simple as the code above. I’m estimating a latent score for a single item with fixed thresholds. I’m doing this twice and calculating the difference between latent scores. Is there some sampler setting that I can try tweaking? With the stochastic weights, maybe a smaller stepsize would help?

You can increase adapt_delta beyond its default of 0.8 to something like 0.95. That usually cuts the step size roughly in half and can eliminate a few divergences.

I’m not sure why that model’s throwing divergences, but my guess would be that it’s trying to push the clarity values to the edges of parameter space, so they’d have values near 0 or 1 when constrained.

How close are those propCorrect to 1 or 0? Are they constrained to fall in that interval? If so, I’d add that declaration to the data block.

Otherwise, I can suggest the usual thing of adding print statements to try to incrementally catch computations. Then you can often trap where things go off the rails. But the error message for the divergence should tell you which line of the program is causing the problem.

1 Like

This idea cut the number of divergent fits to 11% (from 25%). Seems like a big improvement.

Given that propCorrect is fixed, why would it matter whether it is constrained to [0,1] or not?

Oh?! How do I see that? I’m using RStan. Do I need to use cmdstan to get that level of detailed output?

For adapt_delta=0.975, divergent fits are reduced to 6%. Is there any issue with setting adapt_delta to 0.99 or similar? Almost all the failing models have just 1-2 divergent transitions so it seems close to cured.