Hi there!
TL;DR I am using MLE to optimize the parameters of a Negative-binomial-2-log generalized linear model. It seems the optimizer is not able to optimize the dispersion parameter \Phi. The result basically always yields my initial guess of \Phi. Is there any reparametrization or similar strategy that may help?
Long version
I am trying to set up a model similar to Facebook’s Prophet. Apart from small changes, I have mostly adapted their Stan code. The only major change is that I allow different distribution models to be used. Binomial and Poisson work out pretty well, but now I am stuck with the negative Binomial. Boiled down to the important parts, my Stan code is the following
parameters {
real<lower = 0> phi_raw;
}
transformed parameters {
real<lower = 0> phi = (phi_est/phi_raw);
}
model {
phi_raw ~ std_normal();
y ~ neg_binomial_2_log_glm(
X_sa,
trend .* (1 + X_sm * beta),
beta,
phi
);
}
Brief summary of the stuff i left out: X_{sa}, X_{sm} are regressor matrices, trend
is a vector with same length as y
depending on two scalar model parameters with normal priors and one vector model parameter with doulby exponential prior. \beta is another vector model parameter for scaling the regressors. All of it is equivalent to the Prophet code.
You can see I am trying to use the inverse of the overdispersion parameter \Phi scaled to some estimation of the parameter \Phi_{est} (which is within an error margin of ~10% of the true value). I am using cmdstanpy.optimize
to run MLE in order to fit the model to my data. All parameters except for \Phi are being properly optimized. For \Phi the optimizer yields basically my initial guess. I have tried the following things:
- Do not initialize \Phi at all, initialize it at values way below or way above my estimate
- Play around with
iter
or tol_obj
parameters of the optimize
method
- Try different parametrizations besides the scaled inverse: (1) a scaled parameter proportianal to \Phi and centered around \Phi_{est}. For the scaling i tried both weakly and strongly informed priors. (2) just use \Phi itself, (3) \log(\Phi)
Of these approaches, none had a measurable effect. My current impression is that whatever objective function is used, it is mostly indifferent to \Phi. However, the other parameters indeed depend on the given \Phi. Roughly put: if the overdispersion is strong, the other regressors are suppressed, if there is only few or no overdispersion, the other regressors need to explain the structure of the data and therefore have higher amplitudes.
Is this insensitivity a known issue? Is there a workaround?
I very much appreciate your help. Thanks!
Benjamin
You have phi_est / phi_raw
, so small values of phi_raw
are going to lead to huge values of phi
. This can lead to very high variance in such random variables (e.g., a Cauchy is just a ratio of two normals). You might want to try to control this estimate better.
First, if you fix phi
to a large value to imply basically no over dispersion, do you get the same answer as with a Poisson?
Second, I would suggest removing the neg-binomial-2-log-glm function and making sure you can reproduce this problem with a direct implementation. You do realize this version is expecting the mean on the log scale? It’s not inconceivable there’s a bug in our GLM function, though we do try to test very carefully.
Third, what estimates do you get for phi
? And I just want to verify that you meant this winds up having an estimate near its initialization no matter what the initialization is. How much data do you have? Does this happen if you simulate data from the model and fit it?
I have no explanation for why phi
wouldn’t change from initialization as it tends to have a big effect.
Warning: Stan won’t find true zeroes for this the way a specialized lasso solver would.
Hi Bob,
thanks for your reply
You have phi_est / phi_raw
, so small values of phi_raw
are going to lead to huge values of phi
. This can lead to very high variance in such random variables
I understand this can be an issue, but I guess it’s not the root cause of my problem as I was also trying it with a parameter directly proportional to \Phi.
First, if you fix phi
to a large value to imply basically no over dispersion, do you get the same answer as with a Poisson?
I tried that. I didn’t check whether it’s exactly the same like Poisson, but it was reasonably close to it.
Second, I would suggest removing the neg-binomial-2-log-glm function and making sure you can reproduce this problem with a direct implementation. You do realize this version is expecting the mean on the log scale? It’s not inconceivable there’s a bug in our GLM function, though we do try to test very carefully.
I will try the direct implementation and give feedback. But yes, I considered the correct link function. After all the Poisson GLM, which works fine for me, is also on the log scale.
Third, what estimates do you get for phi
? And I just want to verify that you meant this winds up having an estimate near its initialization no matter what the initialization is. How much data do you have? Does this happen if you simulate data from the model and fit it?
The data are actually from a binomial model and i manually inflated the variance (using (y-y_{mean})*factor + y_{mean}). In this data set, there are 400 data points. By estimate you mean the result of the optimizer? Yes, this is indeed always close to the initial value no matter what the initial value is. Well, once I found a setup, where the estimate was indeed different from the initial value, but I forgot what the settings were and couldn’t reproduce it. I believe it was with a very narrow prior around the real value of \Phi. Anyway, even \Phi changed in this case, it still didn’t converge close to the real value, but somewhere in no man’s land.
Second, I would suggest removing the neg-binomial-2-log-glm function and making sure you can reproduce this problem with a direct implementation.
I did that with the following code:
parameters {
real<lower = 0> phi_raw;
}
transformed parameters {
real<lower = 0> phi = phi_raw*phi_est;
}
model {
phi_raw ~ std_normal();
for (n in 1:num_elements(y)) {
real eta_n;
eta_n = trend[n] * (1 + X_sm[n] * beta) + X_sa[n] * beta;
y[n] ~ neg_binomial_2_log(eta_n, phi);
}
}
That indeed helped, ie. \Phi finally starting moving away from the initial guess. However, it moved only a bit and the result was very dependent on the initial guess. Therefore, I started to inspect the CmdStan output files and saw that the optimization stopped, because the relative gradient dropped below its tolerance. Reducing it from 1e7 to 1e6 helped a great deal. The results now seem reasonable and they don’t depend much on the initial guess anymore. That’s great already!
I tried to do the same change in the relative tolerance for the GLM, but this didn’t help at all, unfortunately. I wish I could use the GLM for the extra bit of efficiency.