Failure to start because of initial values

Michael_Elliott · July 24, 2017, 10:00pm

I’m trying to fit a relatively simple hierarchical logistic regression mode but I keep running into

Rejecting initial value:
Error evaluating the log probability at the initial value.[RFEPEATED 100 TIMES]

Initialization between (-2, 2) failed after 100 attempts.
Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
[1] “Error in sampler$call_sampler(args_list[[i]]) : Initialization failed.”
[1] “error occurred during calling the sampler; sampling not done”

I’ve tried every trick I can think of – I am scaling the polynomial times to lie between 0 and 1, I’ve used starting values from a logistic regression fit. As far as I can tell by code matches the examples given in the Stan documentation. This model also fits in SAS using a standard frequentist mixed model approach. The data size is large – about 100 clusters and 60K observations.

Any obvious stupidities?

stanmodelcode <-“
data {
int<lower=0> n;
int<lower=0> N;
int<lower=0> P;
int<lower=0> Q;
int<lower=0> cumidcount[(n+1)];
int<lower=0> cumidcountl[(n+1)];
int<lower=1,upper=n> jj[N];
matrix[N,P] x;
matrix[N,Q] z;
vector[Q] zero;
vector[P] mbeta;
matrix[P,P] vbeta;
int<lower=0,upper=1> y[N];
}
parameters {
real alpha;
vector[P] beta;
row_vector[Q] b;
corr_matrix[Q] Omega; // prior correlation
vector<lower=0>[Q] tau; // prior scale
}
model {
tau ~ cauchy(0, 2.5);
Omega ~ lkj_corr(5);
beta ~ multi_normal(mbeta,vbeta);
{
for(i in 1:n)
b ~ multi_normal(zero,quad_form_diag(Omega, tau));
}
for(j in 1:N)
y[j] ~ bernoulli_logit(alpha + x[j] * beta + z[j] * b[jj[j]]);
}
”

betanalpha · July 24, 2017, 10:22pm

Why are you repeating the distribution for b a total of n times? Are you sure that vbeta is positive definite?

Michael_Elliott · July 25, 2017, 9:05pm

I put it into the loop because I thought I would need to define it as a
prior on each of the cluster units. Taking out to loop statement had no
impact on the problem, nor did increasing the eta hyperprior on the
correlation matrix to extreme values (i.e., 1000).

I think vbeta is positive definite, because I can get an ML estimator for
it using Laplace approximations.

Mike

Michael Elliott

Professor of Biostatistics
Dept. of Biostatistics
M4124 SPH II
1415 Washington Heights
Ann Arbor, MI 48109
(734) 647-5160

Research Professor of Survey Methodology
Survey Methodology Program
Rm. 4068, 426 Thompson Street
Ann Arbor, MI 48109
(734) 647-5563

stijn · July 26, 2017, 12:47am

The following are a couple of things that might help you debug. They are not all related to your specific problem but might help you to find the problem or narrow it down.

You have some data that are not used in the model like cumidcount. Maybe comment these out as long as you don’t need them.
You could validate your vbeta by using cov_matrix[P] vbeta. Stan might not accept vbeta as a covariance matrix if you precalculated vbeta in another program and there is some rounding along the way.
You do not have a prior on alpha which means you have an improper uniform prior here. I wonder whether that leads to overflow problems in the bernouilli_logit
Depending on whether the z matrix has a columns of 1s or not you either might have no group intercepts but group slopes (no 1s) or an identifiability issue between alpha and the group intercepts (1s present).
It probably would be instructive to see whether the stan program gets initialized when you change the last line to. y[i] ~ bernouilli_logit(alpha + x[j] * beta) and you comment out all the lines related to the group effects.

Michael_Elliott · July 26, 2017, 3:11pm

Thank you for your comments; unfortunately they didn’t help.

I’m not sure why having a random intercept and a fixed intercept should
cause a problem with identifyability – in any event, the model works fine
in Stan when I drop out the random effects altogether (as you suggest) or
include a random intercept only. Even one random slope, however, leads to
this failure to start.

Again, this model works fine (random intercept together with a in a
frequentist setting (e.g., using GLIMMIX in SAS). I moved to Stan because
I wanted to try adding additional random effects because I thought I would
be able to have more stability using stronger proper priors, but it appears
that I can’t even get the base model to work.

Mike

Michael Elliott

Professor of Biostatistics
Dept. of Biostatistics
M4124 SPH II
1415 Washington Heights
Ann Arbor, MI 48109
(734) 647-5160

Research Professor of Survey Methodology
Survey Methodology Program
Rm. 4068, 426 Thompson Street
Ann Arbor, MI 48109
(734) 647-5563

bgoodri · July 26, 2017, 4:03pm

For simple hierarchical models like this one, I would start with the stan_glmer function in the rstanarm package or the brm function in the brms package, which both will estimate models like this (if you specify family = binomial(link = "logit")) using the syntax from the lme4 package, and the glmer function in the lme4 package usually produces results that are similar to SAS. If stan_glmer and / or brm won’t initialize, then it is likely that your data are very unfavorable in some way. In the likely event that those yield good results, then you have a benchmark to compare to if and when you decide to write your own Stan program for this model. If you do eventually want to write your own Stan program, I would recommend starting with the result of make_stancode in the brms package.

Bob_Carpenter · July 27, 2017, 9:32pm

Then you probably wanted to say

for (n in 1:N)
  b[n] ~ multi_normal(...)

rather than

for (n in 1:N)
  b ~ multi_normal(...);

The latter applies a prior to each b[n] a total of N times, because each use of b ~ multi_normal(...) applies the prior to each member of b.

But what you really want to do is have a look at the manual and use a Cholesky factor for correlation matrices. It’s much more arithmetically stable. And you want to get rid of the loop and just use the vectorized form—again, much much more efficient as we only have to factor the covariance matrix (or it’s Cholesky factor) once.

When initial values are rejected it’s because the log density doesn’t evaluate to a finite value. That can come about for any number of reasons.

because initial values are supplied and one of the constraints is violated (e.g., Omega isn’t positive definite or tau isn’t positive), or with valid parameters (such as those initialized in (-2,2) on the unconstrained scale) there are other exceptions
because arithmetic underflows or overflows and you get a divide by zero or the like somewhere
another is because you provide an index out of bounds
because you add negative infinity or a NaN value to the target log density (directly or indirectly through sampling statements)

If you haven’t updated to Stan 2.16, I would highly recommend it, as I believe it’s now fixed to print out the source of the error during warmup. This is our fault for not letting you know in that message exactly where things failed. Now that might not trigger if the problem is because you add negative infinity to the target either directly or indiretly.

One possible cause of exceptions is arithmetic underflow, overflow, asymmetry, etc. —that’s why I was recommending the Cholesky parameterization (it’s also much more efficient). the other thing that can happen is that indexing can be messed up, for instance if jj[j] isn’t defined or b[jj[j]] isn’t defined—for example, you have b defined to be size Q, but then constrain the elements of jj to lie between 1 and n rather than 1 and Q as I’d have expected.

Michael_Elliott · July 27, 2017, 10:19pm

It is still failing. I dropped the loop some time ago, and now tried the
Cholesky decomposition. Here is my code (updated based on various
suggestions):

stanmodelcode <-“
data {
int<lower=0> n;
int<lower=0> N;
int<lower=0> P;
int<lower=0> Q;
int<lower=1,upper=n> jj[N];
matrix[N,P] x;
matrix[N,Q] z;
vector[Q] zero;
vector[P] mbeta;
cov_matrix[P] vbeta;
int malpha;
int<lower=0> valpha;
int<lower=0,upper=1> y[N];
}
parameters {
real alpha;
vector[P] beta;
row_vector[Q] b;
cholesky_factor_corr[Q] lOmega; // prior correlation
vector<lower=0>[Q] tau; // prior scale
}
model {
tau ~ cauchy(0, 2.5);
lOmega ~ lkj_corr_cholesky(Q);
alpha ~ normal(malpha,valpha);
beta ~ multi_normal(mbeta,vbeta);
b ~ multi_normal(zero,quad_form_diag(lOmega*lOmega, tau));
for(j in 1:N)
y[j] ~ bernoulli_logit(alpha + x[j] * beta + z[j] * b[jj[j]]);
}
”

The jj index seems correct – it is a vector of size N, whose value
corresponds to the nth cluster in which the the Nth observation is nested.
malpha and valpha are 0 and 3, and mbeta and vbeta are
vectors of 0 and diagonal matricies of 3, respectively. Since the latter
are fixed, I didn’t describe them in tau and Cholesky form, though I could
if you think that could possible matter. I’ve tried different priors for
alpha, beta, and tau – including some extremely informative and extremely
not informative, with no change to the problem.

It does appear that the problem is in the formation of the correlation
matrix, since a model with Q independent priors for the components of b
seems to be working fine.

Finally, I am using start values for alpha, beta, and tau that correspond
to the ML estimates for this model obtained using a Laplace approximation.

I just downloaded Stan, but is does appear that I have a 2.15 version.
I’ll upload the new version as soon as I can (other Stan stuff finishes).

Mike

Michael Elliott

Professor of Biostatistics
Dept. of Biostatistics
M4124 SPH II
1415 Washington Heights
Ann Arbor, MI 48109
(734) 647-5160

Research Professor of Survey Methodology
Survey Methodology Program
Rm. 4068, 426 Thompson Street
Ann Arbor, MI 48109
(734) 647-5563

stijn · July 28, 2017, 1:04am

You can use

b ~ multi_normal_cholesky(zero, lOmega)

If this solves your problem it’s probably because something went wrong in constructing the correlation matrix. I suspect you need lOmega * lOmega’ and multi_normal_cholesky will do that for you internally.

Did you mean to use the Q in the following?

lOmega ~ lkj_corr_cholesky(Q)

If Q is a considerably large number this implies a pretty narrow prior around the identity matrix for the correlation matrix lOmega lOmega’. The following is a more reasonable default prior.

lOmega ~ lkj_corr_cholesky(2)

I am still uneasy about the indices. You have to make sure that n <= Q otherwise the loop will give you troubles. The following would check this is the case.

int<lower=n>Q

But then it is not clear to me what happens for the elements of the vector b[n+1:Q] that are not constrained by the data (I think) or why you need those.

stijn · July 28, 2017, 6:23am

Sorry. That first line should be

b ~ multi_normal_cholesky(zero, diag_pre_multiply(tau, lOmega))

Michael_Elliott · July 28, 2017, 2:27pm

I tried these fixes and they also failed, But I think we are getting to
the core of my confusion.

I am using b to denote the random effects. Thus b_i is a vector of Q=6
random effects, and b is a matrix of size n x Q, where n=952 is the number
of clusters.

Exactly how should I specify this in Stan? I’m not allow to specify it as
an n x Q matrix, which would be what would seem natural. The documentation
and examples seemed to indicate that I should specify is as a row vector of
length Q, which is what I have been doing.

Mike

Michael Elliott

Professor of Biostatistics
Dept. of Biostatistics
M4124 SPH II
1415 Washington Heights
Ann Arbor, MI 48109
(734) 647-5160

Research Professor of Survey Methodology
Survey Methodology Program
Rm. 4068, 426 Thompson Street
Ann Arbor, MI 48109
(734) 647-5563

Bob_Carpenter · July 28, 2017, 3:03pm

The manual chapter on regression has an example on how to code a multivariate prior for vectors of random effects. There’s both a centered and non-centered version, with the non-centered one being better when n is small. At n = 952, you’ll probably be best using the centered version, which is the one explained in the regression chapter. Basic idea is to scale a Cholesky factor and use a vectorized form of multi_normal_cholesky.

You should also be able to use print statements to try to figure out where your log density goes off the rails. Also, I’d fix the constraints on the indexes to match what they have to be in case that’s the source of the problem.

Michael_Elliott · July 29, 2017, 8:55pm

I think I’ve solved the problem.

The example in the documentation (I assume you mean 8.13, p. 144-150)
assumes that I want a random regression coefficient for each of the
predictors in the model. I do not want that – rather I want a mixed
model, where some of the coefficients are fixed, and some are random.
There are no “group” predictors – rather, the random coefficients are
associated with predictors that vary by observation.

I have tried to create this here by following traditional statistical
terminology for mixed models: an x matrix (N x P) containing the predictors
of all of the P=13 coefficients, and a z matrix(N x Q) consisting of the
subset of the Q=6 that I wanted to be random coefficients. In the standard
mixed model literature, that looks something like

(1) logit(P(y_i=1)) = X’_ibeta + Z’_ib_i

where b_~MVN(0,Sigma)

However, it appears that Stan won’t take this model; rather it requires
that the P-Q fixed effects be defined in their own matrix, say W, and that
the random effects thus be required to have a non-zero mean gamma:

(2) logit(P(y_i=1)) = W’_ialpha + Z’_ib_i

where b_~MVN(gamma,Sigma)

Models (1) and (2) are equivalent, with beta’=(gamma’ alpha’)’ if the first
Q covariates in X are the covariates that also appear in Z. But Stan
doesn’t see it that way.

Thanks for your help with this.

Mike

Michael Elliott

Professor of Biostatistics
Dept. of Biostatistics
M4124 SPH II
1415 Washington Heights
Ann Arbor, MI 48109
(734) 647-5160

Research Professor of Survey Methodology
Survey Methodology Program
Rm. 4068, 426 Thompson Street
Ann Arbor, MI 48109
(734) 647-5563

bbbales2 · July 29, 2017, 10:15pm

From the line:

y[j] ~ bernoulli_logit(alpha + x[j] * beta + z[j] * b[jj[j]]);

x[j] * beta is a scalar (cause x[j] is a row_vector and beta is a vector) but z[j] * b[jj[j]] is a row_vector (z[j] is a row_vector and b[jj[j]] is a scalar) so overall alpha is a row_vector instead of a scalar which I think is what you want?

Is this possibly the source of the problem?

edit: Ooops… I got distracted reading the rest of the thread and forgot that you said the problem was solved. Either way, still seems weird to me. Leaving this here.

Michael_Elliott · July 30, 2017, 7:04pm

Also, here is the working model code:

stanmodelcode <-“
data {
int<lower=1> N;
int<lower=1> n;
int<lower=1> P;
int<lower=1> Q;
int<lower=1,upper=n> jj[N];
matrix[N,P] newx;
matrix[N,Q] z;
int<lower=0,upper=1> y[N];
}
parameters {
cholesky_factor_corr[Q] lOmega; // prior correlation
vector<lower=0>[Q] tau; // prior scale
vector[Q] gamma;
vector[Q] beta[n];
vector[P] alpha;
}
model {
tau ~ cauchy(0, 2.5);
lOmega ~ lkj_corr_cholesky(2);
to_vector(gamma) ~ normal(0,5);
to_vector(alpha) ~ normal(0,5);
beta ~ multi_normal_cholesky(gamma, diag_pre_multiply(tau, lOmega));
for(i in 1:N)
y[i] ~ bernoulli_logit(newx[i] * alpha + z[i] * beta[jj[i]]);
}
”

Michael Elliott

Professor of Biostatistics
Dept. of Biostatistics
M4124 SPH II
1415 Washington Heights
Ann Arbor, MI 48109
(734) 647-5160

Research Professor of Survey Methodology
Survey Methodology Program
Rm. 4068, 426 Thompson Street
Ann Arbor, MI 48109
(734) 647-5563

Bob_Carpenter · July 31, 2017, 2:52pm

All Stan “sees” is the log density. As long as you can formulate well-formed arguments to multivariate normal densities, you can formulate that log density however you want. You may be trying to literally provide the argument 0 rather than a vector zero values for multi_normal, but that’s just a guess in the absence of code you thought should work.

You cannot write logit(x) = y in Stan—it’s not part of Stan’s syntax. That was a BUGS/JAGS thing and I found it confusing as it implies the inverse applied to the right-hand side, so I never included it in the Stan syntax.

Bob_Carpenter · July 31, 2017, 2:55pm

The to_vector on alpha is redundant, and there’s no proior on beta. You could vectorize the likelihood as

y ~ bernoulli_logit(new * alpha + z * beta[jj]);

if you declared beta as

matrix[n, Q] beta;

It’d be clearer and more efficient and it’d also make it easier to put a prior on to_vector(beta).

Topic		Replies	Views
Stan ignoring initial values when amount of data increases Modeling	14	2594	June 26, 2017
Initialization between (-2, 2) failed after 100 attempts Modeling	4	6046	April 24, 2018
Rejecting initial value: Error evaluating the log probability at the initial value Modeling fitting-issues	4	2651	July 7, 2018
Initialization error Modeling	18	1809	August 8, 2020
Initialization failed, initial values rejected Modeling	10	2409	October 17, 2018

Failure to start because of initial values

Related topics