# What does it mean if inference is only sometimes working?

Consider \vec{v_x}=m\vec{X}+b and \vec{v_y}=m\vec{Y}+b. Then let r=\mathcal{N}(r_{mean},\sigma), where r_{mean}=v_x/v_y. With little algebra, we can see that
r_{mean} = \frac{\vec{X}}{\vec{Y}+b/m} + \frac{1}{m/b\ \vec{Y} +1}. \tag{1}
So we see that we can infer b/m. The stan code is

data {
int N;
vector[N] X;
vector[N] Y;
vector[N] r;   // input ratios
}
parameters {
real b_m;      // b/m value
real<lower=0> sigma;
}
model {
vector[N] ratio;

for (i in 1:N) {
ratio[i] = X[i] / (Y[i] + b_m) + 1 / (Y[i]/b_m + 1);
}

r ~ normal(ratio,sigma);
}


And the generative code is:

N = 20
m = 1.5
b = 3.2
sigma = 0.1
X = np.random.uniform(1,10,size=N)
Y = np.random.uniform(1,10,size=N)
vx = (m*X+b)
vy = (m*Y+b)
r = vx/vy
r = np.random.normal(r,sigma)   # input to Stan


Questions:

1. Model successfully infers b_m and sigma only sometimes. For instance, once out of 5 times of running, I will get convergence with rhat of 1, with the rest having rhat very large.
2. Frequency of successful runs described above seems to decrease with increasing N.

I can’t understand why I observe these behaviors. The model isn’t that difficult, and the parameters are well defined. Moreover, I’m adding noise after division so I don’t run into the problem of ratio distributions. Lastly, the fact that it runs correctly sometimes means the essence is there, I just don’t know what’s causing the seemingly random behavior of working and not working. Any insight would be appreciated.

I think that your parameters need priors else they may never converge even if your model was spot on. So far as I can tell your model is identifiable otherwise … The non-linearity may be throwing HMC off; could you please try adding priors and report back what happens?

2 Likes

So I tried the following priors on b_m:

b_m ~ normal(2.0,0.5);       --> sometimes works
b_m ~ cauchy(2.0,0.5);       --> sometimes works
real<lower=0> b_m;           --> works all the time


Above, I have chosen 2 because my generative model is using m=1.5 and b=3.2, so we expect b/m\approx2.1. Additionally, I know for sure that b/m>0 due to underlying reasons, so that’s good! I guess now I’m interested to understand why a uniform distribution prior is better than a tightly fitted normal-like distribution around the correct value. Maybe the latter is worse because of how tight the prior is?

Update: So I noticed that for larger sample sizes, that is larger N, I run into the issue of convergence sometimes. Does that mean I need a stronger prior the more data there is? It seems counter-intuitive.

I do not completely understand what is going on and I am just a novice user but I did a few Stan runs and I think I have some hints about what is happening.

With X and Y spanning [1, 10] and b_m = 2 (using that so I can to the math in my head) then the value of ratio in the model
ratio[i] = X[i] / (Y[i] + b_m) + 1 / (Y[i]/b_m + 1);

will be in the range [3/12, 12/3]. If the fitting process should try a negative b_m value, say -2, then the situation is a little more complicated. When X and Y are well above 2, say in the range [3, 10], then ratio will span [1/8, 8/1]. Those values are certainly not the same as when b_m = 2 but they are not wildly different. When Y is near 2, the value of ratio blows up because Y + b_m goes to zero. In that case, the only way to increase the calculated log probability is to increase sigma so that those few wild values have some probability density. If N is large, the calculated log probability will be a very strong function of b_m, when it is negative, preferring values as far from any Y value as possible. The lp__ value will be very peaky and a chain will tend to stay near where it starts.

I think another factor that plays into the observed behavior is that Stan randomly initializes parameters in the range [-2, 2] in the unconstrained space unless explicit initial values are set. With multiple chains and large N, it is fairly likely that at least one chain will start in the bad negative b_m range and get stuck. I do not think that the prior specification modifies the initialization behavior but I may be wrong about that. If I am right, setting a prior on b_m will not help as long as b_m values below -1 are supported because a chain can still start in the region where it can get stuck due to Y values where Y + b_m goes to zero.

Here is some edited output from a run with your original model and N = 200. You can see that chain 3 finds b_m near -2 and it has a very small sd. It has an lp__ MUCH worse than the other chains but it is stuck. Also, sigma is very large for that chain because for some points the calculated ratio values are miles away from the actual r values.

, , chains = chain:1

stats
parameter        mean          sd
b_m     2.2119536 0.059135560
sigma   0.1014174 0.005317003
lp__  356.2567261 1.074719669

, , chains = chain:2

stats
parameter        mean          sd
b_m     2.2106468 0.056079842
sigma   0.1014685 0.005187229
lp__  356.3442797 1.073179447

, , chains = chain:3

stats
parameter        mean          sd
b_m     -2.000805 0.000829513
sigma   36.134660 1.901796925
lp__  -813.771634 0.963752625

, , chains = chain:4

stats
parameter        mean          sd
b_m     2.2097303 0.055092151
sigma   0.1009864 0.005210844
lp__  356.3395068 0.990030658


Here are results from another run with your normal(2.0, 0.5) prior placed on b_m. Now two chains get stuck at negative values. Chain 1 below is stuck in a place very close to chain 3 in the previous result.

, , chains = chain:1

stats
parameter        mean           sd
b_m     -2.000752 0.0008670722
sigma   36.086717 1.7148653649
lp__  -845.746086 1.0490657236

, , chains = chain:2

stats
parameter        mean          sd
b_m     -1.816427 0.002699981
sigma   11.106288 0.537259635
lp__  -607.233117 0.865268460

, , chains = chain:3

stats
parameter        mean          sd
b_m     2.2053075 0.053667885
sigma   0.1011639 0.005219636
lp__  356.2724433 1.004095468

, , chains = chain:4

stats
parameter       mean          sd
b_m     2.209487 0.056054877
sigma   0.101199 0.005196984
lp__  356.240830 1.026173929


The solution, as you have observed, is to not allow negative b_m values or at least keep them above -1 so Y + b_m will not be zero. It would still be a good idea to have priors on all parameters.

I hope none of that is too far off base.

This is interesting. I learned something new, thanks!

So I just looked at the traces I get from running the code with prior real<lower=0> b_m, and consistently all my traces are above zero as enforced.

Another interesting, and baffling point, is that today the code is running perfectly fine every time for all values of N. I didn’t do anything different since the posting of this code, and I don’t know why! A one time fluke yesterday perhaps?

I saw the fit failing to converge 1 out of 5 times with N = 20 and most of the time with N = 200, so I do not think the failures you saw are a fluke.

Interesting problem. I had to make some plots, but I think this isn’t true (even though Eq. 1 in your first post can be solved for \frac{b}{m} if everything else is assumed to be a scalar).

For the reasons @FJCC pointed out, the likelihood will blow up if b_m is equal to any -Y[i].

This will create a multimodality.

This is easier to see if you don’t estimate sigma and plot the density (this is looking at the log of the negative log density to keep the y-scales from going crazy; x-axis is the value of the parameter): library(tidyverse)
library(shinystan)
library(rstan)

N = 20
m = 1.5
b = 3.2
sigma = 0.1
X = runif(N, 1, 10)
Y = runif(N, 1, 10)
vx = (m * X + b)
vy = (m * Y + b)
rmean = vx / vy
r = rnorm(N, rmean, sigma)

fit = stan("identify.stan", data = list(N = 20,
X = X,
Y = Y,
r = r), cores = 4, seed = 153)

# Check out where different chains end up
launch_shinystan(fit)

# Plot the log density. There should be spikes around values of -Y
xs = seq(-4, 4, length = 1000)
logp = sapply(xs, function(x) { log_prob(fit, x) })
plot(xs, log(-logp), type = 'l')

data {
int N;
vector[N] X;
vector[N] Y;
vector[N] r;   // input ratios
}
parameters {
real b_m;      // b/m value
}
model {
vector[N] ratio;

for (i in 1:N) {
ratio[i] = X[i] / (Y[i] + b_m) + 1 / (Y[i]/b_m + 1);
}

r ~ normal(ratio, 0.1);
}

3 Likes

This is cool; but multi-modality shouldn’t in itself be a problem for HMC right? Is it jaggedness or discontinuity that is the problem?

HMC isn’t good multimodalities (not that anything is good in an automatic way). It’s hard for it to crawl over the energy barriers. And those barriers go to infinity, so unless the integrator accidentally steps over one, it’s not gonna explore the modes, and that plot is log of log, so there’s probly not gonna be any accidentally hopping over barriers.

2 Likes

I agree with this, but all Y is positive, and <lower=0> prior is ensuring that no b\_m < 0 is ever tested. So the likelihood shouldn’t be diverging in this specific example. Am I missing something in your explanation?

Your explanation makes 100% sense though in the case of divergences I observe when I have no prior, or a prior that does not strictly ignore all negative numbers.

It wasn’t in the original model (which is what I was responding to). I guess it’s working as expected now? Or still getting chains that fail sometimes?

Yeah, all chains converge just fine right now. Thank you!

1 Like