Infinite location parameter warning and correlated parameters


#1

I’m new to both Bayesian Inference and Stan, I’m experimenting a bit and I ran into an error message I don’t understand.

After a successful linear regression I’m now trying to fit a power function: y = a*x^b + Normal(0,\sigma)

I have a very strong reason to believe that b should be around 3. A good value for a would be 4.
\sigma in my case is expected to be around 2000, but I read in the manual that it is better when parameters are of the same scale so I normalize so the expected variance is around 1.

My Stan model is:

  data {
      int<lower=1> N;
      vector[N] x;
      vector[N] y;
      real nrm_factor;
  }
  transformed data {
      vector[N] y_nrm;
      y_nrm <- y/nrm_factor;
  }
  parameters {
      real<lower=0> a;
      real<lower=0> b;
      real<lower=0> sigma;
  }
  model {
      vector[N] y_pred;
      for (i in 1:N)
          y_pred[i] <- (a*x[i]^b)/nrm_factor;
      //priors
      a ~ normal(4, 1);
      b ~ normal(3, 0.1);
      sigma ~ gamma(2, 2);
      //likelihood
      y_nrm ~ normal(y_pred, sigma);
  }

When I run this (with MatlabStan’s defaults) I get around 20 of the following warnings:

output-4.csv: Exception: normal_lpdf: Location parameter[1] is 1.#INF, but must be finite!  (in 'H:/Projects/stan/powercurvetest/anon_model.stan' at line 31)
output-4.csv: If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
output-4.csv: but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

However the model does provide answers that I expect based on the data:

  4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.
  Warmup took (13, 13, 14, 12) seconds, 53 seconds total
  Sampling took (12, 14, 14, 13) seconds, 53 seconds total
                   Mean     MCSE   StdDev     5%    50%    95%  N_Eff  N_Eff/s    R_hat
  lp__             -846 3.6e-002 1.3e+000   -849   -846   -845   1266       24 1.0e+000
  accept_stat__    0.93 1.7e-003 1.1e-001   0.70   0.97   1.00   4000       75 1.0e+000
  stepsize__      0.048 5.2e-005 3.3e-003  0.045  0.048  0.053   4000       75 1.$e+000
  treedepth__       4.8 2.4e-002 1.5e+000    2.0    5.0    6.0   3597       68 1.0e+000
  n_leapfrog__       54 6.3e-001 3.6e+001    3.0     59    127   3305       62 1.0e+000
  divergent__      0.00 0.0e+000 0.0e+000   0.00   0.00   0.00   4000       75-1.$e+000
  energy__          848 4.9e-002 1.8e+000    845    847    851   1312       25 1.0e+000
  a                  12 2.2e-002 7.8e-001     10     12     13   1277       24 1.0e+000
  b                 2.6 6.9e-004 2.4e-002    2.5    2.6    2.6   1269       24 1.0e+000
  sigma             1.1 5.5e-004 2.1e-002    1.0    1.1    1.1   1514       28 1.0e+000
  Samples were drawn using hmc with nuts.

My question is; can you provide me some insight where the infinity comes from? The parameters a and b are of course strongly correlated. Is this what is causing the problem with the infinite location parameter?

Also, can you give some pointers on how to resolve the correlation between a and b?
Some googling pointed me into the direction of transforming this into a linear problem: log(y) = log(a) + b*log(x) and then do a QR-decomposition on that linear problem. I’m interested in an explanation of that strategy, but also generally in strategies for uncorrelating parameters in similar parameter estimation problems.

Thanks in advance!


#2

Hi Torben,

usually these exceptions are not problematic. Note that if you sample b, the resulting values can become extremely large/small very quickly. It seems you have prior information which you already put into the model, so everything should be fine.

You can explicitly model the correlation of a and b (and maybe you should, because we’d expect them to be correlated, right?). Basically, you put a bi-variate normal prior on the two parameters. Then you can re-parametrize this prior to it’s non-centered form. Check out the chapter in the manual (28.6 a bit further down there is an example for the multivariate normal), it is really good.

The log-approach seems reasonable, too.

I tried so simulate some data in R that would match your description (additionally assuming x>0). (I know I shouldn’t write the model code in the R script, but it was easier this time.)

First the with the log specification. (Note that I simulate the data so that a log-normal is fine.)

library(tidyverse)
library(rstan)

N <- 500
x <- exp(rnorm(N, 0, 1))
a <- rnorm(N, 4, 1)
b <- rgamma(N, 100, 100/3)

standata <- list(
  N = N,
  y = a*x^(b)*exp(rnorm(N, 0, 3))
)

stancode <- "
data{
  int N;
  vector<lower=0>[N] x;
  vector<lower=0>[N] y;
}
transformed data{
  vector[N] log_x = log(x); 
}
parameters{
  vector[2] beta_tilde;
  real<lower=0> sigma;
  vector<lower=0>[2] sigma_b;
  cholesky_factor_corr[2] L;
}
transformed parameters{
  vector[2] beta;
  beta = [1.38, 3]' + sigma_b .* (L * beta_tilde);
}
model{

  beta_tilde ~ normal(0,1);
  sigma_b ~ student_t(7, 0, 1);
  L ~ lkj_corr_cholesky(2);
  sigma ~ student_t(3, 0, 1);

  y ~ lognormal(beta[1] + beta[2]*log_x, sigma);
}
"

s <- stan(model_code = stancode, data = standata, control = list(adapt_delta = 0.99))
s

This is the output.

Inference for Stan model: 5e39c7f688cc5860681611c3175d3a73.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
beta_tilde[1]    0.24    0.02 0.54   -0.84   -0.02    0.18    0.48    1.42  1202 1.00
beta_tilde[2]    0.11    0.02 0.65   -1.18   -0.22    0.08    0.41    1.59  1607 1.01
sigma            3.00    0.00 0.10    2.81    2.93    3.00    3.06    3.19  2146 1.00
sigma_b[1]       0.48    0.02 0.54    0.01    0.13    0.29    0.64    1.86  1097 1.00
sigma_b[2]       0.43    0.01 0.47    0.01    0.11    0.28    0.58    1.75  1890 1.00
L[1,1]           1.00    0.00 0.00    1.00    1.00    1.00    1.00    1.00  4000  NaN
L[1,2]           0.00    0.00 0.00    0.00    0.00    0.00    0.00    0.00  4000  NaN
L[2,1]           0.02    0.01 0.46   -0.83   -0.34    0.02    0.37    0.84  2408 1.00
L[2,2]           0.87    0.00 0.15    0.46    0.81    0.93    0.99    1.00  1443 1.01
beta[1]          1.46    0.00 0.12    1.24    1.38    1.44    1.53    1.71  2880 1.00
beta[2]          3.03    0.00 0.11    2.82    2.97    3.02    3.09    3.25  4000 1.00
lp__          -805.38    0.07 2.19 -810.57 -806.64 -805.07 -803.77 -802.09   936 1.01

Samples were drawn using NUTS(diag_e) at Thu Apr 12 20:26:33 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

So, correlation between beta[1] (that is \log(a)) and beta[2] (that is b) is essentially zero. Nice.

rstan::extract(s, pars = c("beta"), permuted = FALSE) %>% 
  reshape2::melt() %>% 
  filter(!str_detect(parameters, "tilde")) %>%
  spread(parameters, value) %>% 
  ggplot(aes(x=`beta[1]`, y=`beta[2]`)) + 
    geom_point()

betas_log

Now with slightly different data (error term changed) and the non-log approach.

standata2 <- list(
  N = N,
  y = a*x^(b) + (rnorm(N, 0, 2000))
)

stancode2 <- "
data{
  int N;
  vector[N] x;
  vector[N] y;
}
parameters{
  vector[2] beta_tilde;
  real<lower=0> sigma;
  vector<lower=0>[2] sigma_b;
  cholesky_factor_corr[2] L;
}
transformed parameters{
  vector[2] beta;
  beta = [4, 3]' + sigma_b .* (L * beta_tilde);
}
model{

  beta_tilde ~ normal(0,1);
  sigma_b ~ student_t(7, 0, 1);
  L ~ lkj_corr_cholesky(2);
  sigma ~ student_t(3, 0, 1);

  for(n in 1:N)
    y[n] ~ normal(beta[1]*x[n]^beta[2], sigma);
}
"

s2 <- stan(model_code = stancode2, data = standata2, control = list(adapt_delta = 0.99))
s2

With output:

Inference for Stan model: 9b7576b91cb7ef926e090a2450382b58.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                  mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
beta_tilde[1]     0.83    0.03  1.00    -1.16     0.14     0.84     1.51     2.68   885 1.00
beta_tilde[2]    -0.39    0.02  0.65    -1.71    -0.78    -0.35    -0.03     0.89  1056 1.01
sigma          2129.50    1.54 67.01  2003.89  2082.51  2127.76  2176.17  2264.45  1884 1.00
sigma_b[1]        1.49    0.03  1.38     0.06     0.51     1.12     2.06     5.08  1924 1.00
sigma_b[2]        0.54    0.02  0.49     0.04     0.20     0.38     0.73     1.87   926 1.00
L[1,1]            1.00    0.00  0.00     1.00     1.00     1.00     1.00     1.00  4000  NaN
L[1,2]            0.00    0.00  0.00     0.00     0.00     0.00     0.00     0.00  4000  NaN
L[2,1]           -0.10    0.01  0.42    -0.84    -0.44    -0.11     0.21     0.73  1463 1.00
L[2,2]            0.89    0.00  0.14     0.50     0.84     0.94     0.99     1.00  1218 1.00
beta[1]           5.83    0.08  2.85     3.29     4.04     4.79     6.52    13.66  1224 1.00
beta[2]           2.82    0.00  0.14     2.49     2.75     2.86     2.92     3.00  1007 1.00
Omega[1,1]        1.00    0.00  0.00     1.00     1.00     1.00     1.00     1.00  4000  NaN
Omega[1,2]       -0.10    0.01  0.42    -0.84    -0.44    -0.11     0.21     0.73  1463 1.00
Omega[2,1]       -0.10    0.01  0.42    -0.84    -0.44    -0.11     0.21     0.73  1463 1.00
Omega[2,2]        1.00    0.00  0.00     1.00     1.00     1.00     1.00     1.00  4000 1.00
lp__          -4108.86    0.08  2.19 -4114.19 -4110.05 -4108.47 -4107.28 -4105.80   691 1.01

Samples were drawn using NUTS(diag_e) at Thu Apr 12 20:26:11 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Note that beta[1] (that is a) and beta[2] (that is b) are correlated:

rstan::extract(s2, pars = c("beta"), permuted = FALSE) %>% 
  reshape2::melt() %>% 
  filter(!str_detect(parameters, "tilde")) %>%
  spread(parameters, value) %>% 
  ggplot(aes(x=`beta[1]`, y=`beta[2]`)) + 
    geom_point()

betas_non_log

But beta_tilde[1] and beta_tilde[2], the parameters that are actually sampled, are not:

rstan::extract(s2, pars = c("beta_tilde"), permuted = FALSE) %>% 
  reshape2::melt() %>%
  spread(parameters, value) %>% 
  ggplot(aes(x=`beta_tilde[1]`, y=`beta_tilde[2]`)) + 
    geom_point()

beta_tildes_non_log

You can recover the correlation matrix like this:

generated quantities{
 corr_matrix[2] Omega = L * L';
}

For the data that I simulated the correlation between a and b is around -0.1. But you can actually also read that off the non-zero off-diagonal of the Cholesky decomposed correlation matrix L

Hope this helps!
Max


#3

You can also make these plots easily with our bayesplot package:

draws <- as.matrix(stanfit, pars = "beta")
mcmc_scatter(draws) # if beta is two parameters, otherwise pick two 

That said, nothing wrong with doing the coding yourself like @Max_Mantei (in fact probably good to do it a few times to familiarize yourself with stanfit object structure).


#4

Good point! I’m just way too familiar with the extract>melt>tidyverse>ggplot approach.


#5

Max! Thank you very much for your elaborate and well written answer. This is extremely helpful.

There is so much information in your post that it is taking me a while to digest it all. I’ll try to go through it step by step.

You say that the 1#.INF warning could result from an exponent being sampled. This makes sense to me. However, your model with the log approach is still giving me warnings that the location parameter of the lognormal distribution is 1#.INF. That can’t be right can it?

In your examples you are drawing a and b from distributions. This confuses me a bit. In my mind they were fixed but unknown parameters, in which case the variation of a and b in the posterior reflects the uncertainty of their predicted value. By letting a and b actually be random I no longer understand what the posterior tells me about them. Any thoughts on this?

Explicitly modelling the correlation of a and b seems like an excellent idea. Thanks for your examples and the reference to the relevant section in the manual.

Am I right in saying that when the original data has a normally distributed error term the log approach will not work? In this case we would be taking the logarithm of a normal distribution centred around 0, which is problematic.

What I don’t quite understand though, are the large differences between the two models. In the first model log(a) and b are uncorrelated, but surely if a and b are correlated (as expected, and apparent in the results from the second model) then log(a) and b must be correlated? Also exp(log(a)) = 4 in the first model and a = 5.8 in the second model. Did you really only change the error term?

Yesterday I got slightly different results with your code rewritten to MatlabStan. I’m going to install RStan on my laptop now so that I can copy your code and try again. I’ll report back if I figure out the differences between these two models.

Thanks once again for your incredibly instructive/informative answer.


#6

Dear Torben,

apparently my answer seemed less quick and dirty than it actually was.

The Inf warning in the lognormal could maybe come from very small values in log_x times small sampled value in b (going to minus Inf)? I don’t know to be honest. But like I said, this usually is not problematic. The draw will simply be rejected and if this happens only a few times everything should be fine.

The draws of a and b come from a bivariate prior distribution, yes. (Technically I think this is not correct, since with HMC/NUTS we are only adding the priors to the log-likelihood, so that we’re not actually drawing from the priors). In your original model the priors were

a ~ normal(4, 1);
b ~ normal(3, 0.1);

with a and b being the parameters from the parameter block. Note that you could mathematically (and in Stan) re-write these to a = 4 + 1*Z1 and b = 3 + 0.1*Z2, where Z1 ~ normal(0,1) and Z2 ~ normal(0,1). Now Z1 and Z2 are in the parameters block in Stan and a and b would be assigned as transformed parameters. Mathematically these two approaches are the same, but in the latter Stan “draws” from two standard normals. This is called the non-centered parametrization and it is usually (not always!) much better for sampling.

This is exactly what’s going on in the correlated parameters model that I wrote down. I could have written something like

beta ~ multi_normal([4, 3]', Sigma);

and fix the diagonal elements of Sigma to 1^2 and 0.1^2, something like this:

Sigma[1,1] = 1^2; //or just 1...
Sigma[2,2] = 0.1^2; //or 0.0001...
Sigma[1,2] = rho*1*0.1; //or just rho*0.01...
Sigma[2,1] = rho*0.1*1; //or just rho*0.01...

and then sample real<lower=-1,upper=1> rho, which implies a uniform prior \rho \sim \text{Uniform}(-1,1). This would actually be closer to your original model, since not the priors for the marginal standard deviations for a and b are fixed to 1 and 0.1, instead in my approach I assigned a vague prior on both SDs (sigma_b ~ student_t(7, 0, 1);). However, when we apply the non-centered parametrization like we did in the case above, we can actually reformulate the multivariate normal prior in such a way that we draw from two independent standard normals. This is what’s happening in the model (along with a decomposition of Sigma in to a correlation matrix—or rather the Cholesky decomposition of it L—and a vector of SDs— in the model that is sigma_b, which I think I could have fixed to sigma_b = [4, 3] to match your model more closely).

If you think that the error is normally distributed, then you assume that the variance is fixed over all estimated expected values (homosecedasticity). The Log-normal assumes homoscedasticity on the model for \log(y), but the error term with respect to the original scale is heteroscedastic (check out the “Varaince” of the Log-normal on its wikipedia page). Usually (but not always) non-negative values in y come along with some form of heteroscedasticity—think of a count models, like the Poisson or Negative-binomial. Then you are in GLM land, and you could (if y>0) for example also use a Gamma distribution to fit your model. (Note that linear models are just a special case of GLMs.) The second model I wrote down was also a GLM with (homoskedastic!) Gaussian error.

The difference in the two models could be due to two reasons: 1) I did a bad job coding it up, or 2) I did not think about the simulated data enough… As I said, this was done quick and dirty and I basically included it to show the non-centered multivariate prior on a and b, so I would suggest not to spend too much time replicating this, but more time on your actual data.

Maybe you could also try something like y \sim \text{Normal}(exp(\log(a)+b*\log(x)), \sigma). This would be a Gaussian GLM with a log link-function, \log(\mu)=\log(a)+b*\log(x). Like in the Log-normal case, this only makes sense if y>0. But for low values of \hat\mu and a large value of \hat\sigma this model can imply \hat{y}<0

I hope that this was not confusing…


#7

Pretty much what bayesplot is doing internally!


#8

I can imagine this feeling quick and dirty to you Max, but for a forum post it exceeded my expectations :)
I feel like I’m starting to understand the reason and mechanism behind the re-parametrisations now.

As you recommended I focused on my own data and found that your models worked as expected. Both the log and linear model gave consistent results with reduced correlation in the beta_tilde parameters. I did observe that the posterior looked more like a butterfly or sandglass, rather than an equal multinormal distribution.

Unfortunately some more pressing projects have come up, and I currently have little time to investigate further. I hope to add some of my own results here later.