Double-Pareto Lognormal Distribution in Stan

RachaelMeager · August 19, 2019, 1:20pm

Wow, this is all amazing, thank you for doing such a deep dive into this Mike! That second model is what I was trying to grasp for with the idea of a reparameterisation but I just could not see how to make it work before.

I had to make some tweaks to the second model that you wrote in order for it to sample. Eventually I was able to run this one:

data {
  int N;
  vector[N] y;
}
parameters {
  real<lower=0> alpha;
  real<lower=0> epsilon[N];
  real nu;
  real<lower=0> tau;
}
transformed parameters{
  real eps_over_alpha[N];
  for(n in 1:N){eps_over_alpha[n]  =  epsilon[n]*inv(alpha);
  }
}
model {
  alpha ~ normal(0, 10);
  nu ~ normal(0, 10);
  tau  ~ normal(0, 10);

  epsilon ~ exponential(1);
  for(n in 1:N){
  y[n] ~ normal(nu + eps_over_alpha[n], tau);
  }
}

It samples well and fits well for larger values of alpha, but for the smaller values of alpha I tried, the inference isn’t very good. I did a closer look – and I’m not sure what is going wrong here – but both this model and the first function with the improved log(1-phi(x)) function, seem to always return inference of alpha centered around 8.5 no matter what the true alpha is. I’m attaching the code and the two models, the basic one with your updated function and the “betan” one with your cool reparameterisation. Do these fit well on your machine (or did I make an obvious silly mistake in implementing)? Or alternatively does anyone here see the problem? I feel we are so close to solving this problem now, thank you all again for your work on this!

normalhalflaplace-betan-fit.stan (425 Bytes) normalhalflaplace-fit.stan (1.0 KB) single-pareto-lognormal-test.R (5.4 KB)

betanalpha · August 19, 2019, 8:24pm

Sorry, I was being sloppy. You have to use a vector type for the epsilon to get all of the vectorized algebra to work out.

data {
  int N;
  vector[N] y;
}

parameters {
  real<lower=0> alpha;
  vector<lower=0>[N] epsilon;
  real nu;
  real<lower=0> tau;
}

model {
  alpha ~ normal(0, 10);
  nu ~ normal(0, 10);
  tau  ~ normal(0, 10);
  epsilon ~ exponential(1);
  
  y ~ normal(nu + epsilon / alpha, tau);
}

generated quantities {
  vector[N] y_ppc;
  for (n in 1:N) {
    real e1 = exponential_rng(1);
    real z = normal_rng(0, 1);
    y_ppc[n] = nu + tau * z + e1 / alpha;
  }
}

I don’t think one can say that the inference isn’t good. Running both the integrated and expanded models yield the same fit with posterior predictive checks that looks lovely.

params = extract(fit)

par(mfrow=c(1, 1))

B <- 25

min_y <- min(params$y_ppc)
max_y <- max(params$y_ppc)
breaks <- seq(min_y, max_y, (max_y - min_y) / B)

idx <- rep(1:B, each=2)
x_idx <- sapply(1:length(idx), function(b) if(b %% 2 == 0) idx[b] + 1 else idx[b])
xs <- breaks[x_idx]

obs <- hist(data$y, breaks=breaks, plot=FALSE)$counts
pad_obs <- do.call(cbind, lapply(idx, function(n) obs[n]))

ppc <- sapply(1:4000, function(n) hist(params$y_ppc[n,], breaks=breaks, plot=FALSE)$counts)
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:B, function(b) quantile(ppc[b,], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9, n]))

plot(1, type="n", main="Posterior Predictive Distribution",
     xlim=c(head(breaks, 1), tail(breaks, 1)), xlab="y",
     ylim=c(0, max(c(obs, cred[9,]))), ylab="")

polygon(c(xs, rev(xs)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
lines(xs, pad_cred[5,], col=c_dark, lwd=2)

lines(xs, pad_obs, col="white", lty=1, lw=2.5)
lines(xs, pad_obs, col="black", lty=1, lw=2)

I think what you’re seeing here is a fundament nonidentifiabiltiy of the model. Data simulated from large alpha are able to excluded small alpha in the fit, and avoid the alpha/nu ridge, but data simulated from small alpha are consistent with both small alpha and large alpha which requires the fit into that nasty alpha/nu ridge. Look all of those lovely divergences!

Because the non-identifiability isn’t purely in the latent part of the model, there is no parameterization that will fix the problem. The only way forward is with informative priors that might be able to contain the fit into a region where the model is better identified, like we’ve done with Gaussian processes for certain kernels.

andre.pfeuffer · August 20, 2019, 4:37am

Isn’t 1 - Phi(x) = Phi(-x)? log(Phi(x)) could be replaced by normal_lcdf. That should be stable.

betanalpha · August 20, 2019, 2:02pm

No, it is not. All of the log CDFS are currently implemented as the literal log of CDF functions and hence share the same numerical inaccuracy problems.

RachaelMeager · August 23, 2019, 5:43pm

Thank you again for such a detailed explanation Mike!

My concern is that the original normal-halflaplace model I posted is much tighter around the true parameters when alpha is small than the new models with the better sampling behaviour are. Although of course these new models work better when alpha is big, and the old model works badly when alpha is big, the trouble with big alpha sort of made sense as Reed and Jorgenson say they have identification problems when alpha is big (because the model shouldn’t actually have the extra laplace tail in that case). So this situation with the new model is confusing to me, because it seems like the new improved model has trouble in the case that Reed and Jorgenson don’t have trouble.

Of course the original model (which had the Reed and Jorgenson type of trouble) also had much worse sampling behaviour in terms of divergences in all cases. But still, the posterior draws were much more closely clustered around the actual truth when alpha was small - and the results were appropriately responsive to changes in the generating alpha, so it’s not just that the prior was directing them to the region near zero or anything. So I am somewhat reluctant to adopt the new model in all cases – especially because in the application I’m considering I suspect alpha is in fact small.

betanalpha · August 23, 2019, 6:28pm

Sure, but this was entirely due to the model fit not capturing the actual posterior. The new Stan programs are the same mathematically – the only difference are the more numerically robust implementations. In other words the results you saw in your initial Stan program weren’t the result of the model, but rather the results of the model marred by the numerical issues.

The old model appears to fit small alpha better only because the numerical issues coincidentally removed the parameter configurations far from the ground truth. Just like a broken clock is correct twice a day, a broken model will appear to be correct when the truth is far away from pathological neighborhoods.

I can’t comment on the Reed and Jorgenson results all that much, but one does have to be careful with the exact meaning of identification. If they’re referring to classic identifiability (likelihood converges to an increasingly smaller neighborhood around the truth as N increases) then that won’t necessarily be relevant here as the diffuse likelihood functions indication that such convergence would be really, really slow.

It may be worth exploring the practical identifiability (the shape of the likelihood function for finite N) more thorough to validate these results or not. In my cursory analyses the posterior predictive checks appeared to validate that the large alpha configurations predicted data entirely consistent with small alpha configurations which would be consistent with the weak identifiability. You can also compare the predictive distributions from different fits to see if there are any observable differences.

I wouldn’t be surprised if this model was indeed that weakly identified for finite N – trying to match tails is a notoriously hard problem.

dchilders · June 4, 2024, 7:21pm

I’m reviving this thread much later because the problem continues to be of applied interest, and I was able, with some minor modifications of results in this thread and one new sampler, to make some progress.

The tl;dr is that while it remains the case that the posterior distributions remain multimodal and highly curved and posterior summaries do a poor job recovering parameter values from simulations, the posteriors do seem to be accurately computed when sampling uses Variational Parallel Tempering algorithm in Pigeons.jl, which can sample from Stan programs using BridgeStan.

For implementation, see my github repo. doubleparetolognormal.jl, which calls DPLN-fit.stan fits the Normal-Skew Laplace model, singlaparetolognormalvariational.jl which calls normalhalflaplace-fit.stan fits the Half-Normal-Laplace model. Other files represent alternative methods or attempts that did not go so well (eg, parallel tempering starting from the prior, instead of a variational approximation, or the reparameterized approach that fits the noise variables, seem not to sample well). The Stan code is a mix of the suggestions above. It uses the asymptotic expansion for log1m(Phi(p)) and some of the bug fixes mentioned but not given as code in the thread.

Diagnostics including Rhat, ESS, and eyeballing the trace plots here suggest that the posterior is being traversed well despite its unpleasant shape. NUTS diagnostics are unfortunately not available with this sampler, which samples several adaptive MALA chains corresponding to tempered posteriors in parallel, with a swapping scheme. My understanding is that one could in principle use NUTS instead for the base samplers, but it is not done by default maybe due to issues choosing initializations.

Anyway, here are some plots and tables for 2000 observations of a simulated DPLN (actually Normal-Skew Laplace) with α = 4.0, β = 2.0, ν = 1.0, τ = 4.0

I’d welcome any feedback, as well as bug checks on the Stan code.

spinkney · June 6, 2024, 6:00pm

Hi @dchilders, is that fit good? Your simulation parameters and mean estimates are

\begin{aligned} \text{True: } \: & \begin{matrix} \alpha = 4.0 & \beta = 2.0 & \nu = 1.0 & \tau = 4.0 \end{matrix}\\ \text{Mean Estimates: } \: & \begin{matrix} \alpha = 5.8 & \beta = 9.1 & \nu = 0.5 & \tau = 4.0 \end{matrix} \end{aligned}

Our pyro friends have an implementation of what they call the “Soft Asymmetric Laplace”. The data generating process is very similar to a normal-laplace. You can see that in their docs at Distributions — Pyro documentation.

This is the data generating process for it

z ~ Normal(loc, scale * softness)
u ~ Exponential(1 / (scale * asymmetry))
v ~ Exponential(asymmetry / scale)

It seems to sample alright. I didn’t go through to see if you can reformulate the dpln into this or not.

Here’s the R code. Laplace approximation also does a decent job on this simulation.

n <- 1000
Z <- rnorm(n, mean = 0, sd = 1)

loc <- 1 
scale <- 3 
asymmetry <- 2 
softness <- 1.5 

left_scale <- scale * asymmetry
right_scale <- scale / asymmetry
soft_scale <- scale * softness

u <- rexp(n, rate = 1)
v <- rexp(n, rate = 1)

Y <- loc + soft_scale * Z - u * left_scale + v * right_scale 

soft_asym_laplace_mod <- cmdstan_model("soft_asym_laplace.stan")

soft_asym_laplace_fit <- soft_asym_laplace_mod$sample(
  data = list(N = n,
              log_x = Y),
  parallel_chains = 4,
  init = 1
)
soft_asym_laplace_fit$summary(c("loc", "scale", "asymmetry", "soft"))

soft_asym_laplace_laplace <- soft_asym_laplace_mod$laplace(
  data = list(N = n,
              log_x = Y)
)
soft_asym_laplace_fit$summary(c("loc", "scale", "asymmetry", "soft"))

And the stan code. Note that the std_normal_lcdf is more precise than the erfc function so I used this.

functions {
  real soft_asym_laplace_lpdf (vector x, real loc, real scale, real asymmetry, real soft) {
    int N = rows(x);
    vector[N] x_std = (x - loc) / scale;
    real L = asymmetry;
    real R = 1 / L;
    real S = soft;
    real SS = square(soft);
    real S2 = S * sqrt2();
    vector[N] Lx = L * x_std;
    vector[N] Rx = R * x_std;
    real half_ss = 0.5 * SS;
    real L_sq = square(L);
    real R_sq = square(R);
    real L_times_S2 = L * S2;
    real R_times_S2 = R * S2;
    
    real out = -N * (log(L + R) + log(scale) + log2());
    for (n in 1:N) {
      out += log_sum_exp( 
      (half_ss + Lx[n]) / L_sq + log2() + std_normal_lcdf(-sqrt2() * (SS + Lx[n]) / L_times_S2),
      (half_ss - Rx[n]) / R_sq + log2() + std_normal_lcdf(-sqrt2() * (SS - Rx[n]) / R_times_S2)
      );
    }
    

    return out;
  }
  
}
data {
  int N;
  vector[N] log_x;
}
parameters {
  real loc;
  real<lower=0> scale;
  real<lower=0> asymmetry;
  real<lower=0> soft;
}
model {
  // priors
  loc ~ normal(0, 2);
  scale  ~ normal(0, 10);
  asymmetry ~ normal(0, 10);
  soft  ~ normal(1, 2);

  log_x ~ soft_asym_laplace(loc, scale, asymmetry, soft);
}

dchilders · June 6, 2024, 7:09pm

Oh to be clear, the mean statistics are absolutely terrible, from the perspective of recovering the parameters.

With the sample size and parameters chosen, it appears that there is very little information to distinguish the tails, so I don’t really expect it to concentrate near the truth, and with multimodality there’s no reason why the mean should be a particularly good point estimate. So my sense that it looks like it sampled fine, in terms of accurately characterizing the posterior, is based purely on the sampling diagnostics. If you look at the density plots, you’ll see that the tallest modes for tau and nu are quite close, but with long tailed secondary modes that move the posterior mean away. Alpha and beta, the tails, do worse, though are highly dispersed. If that’s right, the Laplace approximation should do much better as a point estimator, but would be a bad description of posterior uncertainty.

That said, there absolutely could be issues still. I used Michael’s expansions for the CDF tails because log1m(Phi) was giving me trouble. The soft asymmetric Laplace model you wrote down is identical to the double Pareto log normal, up to reparameterization: you split into a scale, an asymmetry, and a normal vs Laplace weight rather than rather than two tail exponents and a normal scale, which might be one thing that helps. You also are using tighter priors and simulating from another part of the parameter space, and finally coded it independently so either the use of std_normal_lcdf or just my having some bug could all be sources of discrepancy. If the reparameterization is what does it, then I’d be perfectly happy to just use that.

I have an experiment with a very large sample size to see if it does concentrate in large samples taking up my stone age CPU at the moment, but I will run ablations with your code to test it out and report back maybe tomorrow with results.

spinkney · June 7, 2024, 10:26am

Excellent! Looking forward to seeing the comparison.

spinkney · June 7, 2024, 10:39am

On the asymptotic expansion of the normal cdf, the post by @betanalpha happened a few months before it was implemented in Stan. See Bugfix/issue #1284 numerical precision of normal lcdf by PhilClemson · Pull Request #1411 · stan-dev/math · GitHub.

However, this hasn’t been translated over to the normal_lccdf nor to the Phi() functions :( yet. See:

dchilders · June 7, 2024, 9:00pm

OK, I didn’t run all the experiments I meant to run, but the weekend is approaching so I’ll report what I have, which is suggestive but not 100% conclusive.

A rough assessment is: your parameterization seems to help when the parameter values are benign, but not when the values are those that I used (taken from Rachael’s simulations, which may or may not be reasonable values). Tight priors like you used improve sampling quality, but are highly concentrated on values far away from the simulation parameters, so posterior means are confidently very far from the simulated truth. Loosening the priors, posteriors become highly curved and multimodal even in the new parameterization. It’s a bit hard to isolate prior shape vs parameterization effects though because the transforms are nonlinear.

Variational parallel tempering improves a lot over parallel tempering starting from priors in all cases: if I wanted to be confident in my results I would run it regardless of the parameterization. I would like to have also run NUTS and Laplace approximations to compare but didn’t get to it. I ran your Stan code but not the Stan samplers because I don’t have cmdstanr installed and (my possibly outdated) rstan installation apparently runs it differently.

Experiments are in the github repo in soft_asym_laplace.jl. It calls your code as soft_asym_laplace.stan and a version with more dispersed priors and with the old parameterization in the generated quantities as soft_asym_laplacev2.stan.

A few tables and plots:

Sampling (Variational PT) with the parameter values you tried (n=1000 data points, loc = 1.0, scale = 3.0, asymmetry = 2.0, softness = 1.5)

Iterations        = 1:1:1024
Number of chains  = 2
Samples per chain = 1024
parameters        = loc, scale, asymmetry, soft
internals         = log_density

Summary Statistics
  parameters      mean       std      mcse    ess_bulk    ess_tail      rhat   ess_per_sec 
      Symbol   Float64   Float64   Float64     Float64     Float64   Float64       Missing 

         loc    1.1165    0.7419    0.0191   1488.5445   1354.6194    1.0032       missing
       scale    2.3418    0.9291    0.0236   1531.9079   1259.3160    1.0050       missing
   asymmetry    3.0882    1.5139    0.0416   1488.6721   1303.6528    1.0084       missing
        soft    2.2568    1.2134    0.0331   1539.6592   1184.9357    1.0086       missing

Parameter recovery here is okay but not great. Plots suggest there are still secondary modes, but they are quite muted.

To test whether the difference came from parameter values:
The parameterization conversion from α=4, β=2, ν=1, τ =4 is

loc = ν #This one is unchanged 1.0
asymmetry = sqrt(α/β) # √2 ≈ 1.4142135623730951 
scale = asymmetry/α # 0.3535533905932738
soft = τ/scale # 11.31370849898476

Trying variational PT with 2000 samples from this, parameters are much further off

Summary Statistics
  parameters      mean       std      mcse    ess_bulk    ess_tail      rhat   ess_per_sec 
      Symbol   Float64   Float64   Float64     Float64     Float64   Float64       Missing 

         loc    0.2098    0.3953    0.0088   2044.9533   1633.8895    0.9998       missing
       scale    1.1455    0.2441    0.0063   1482.9371   1469.9517    1.0037       missing
   asymmetry    0.7319    0.1497    0.0033   2037.7326   1596.0299    1.0030       missing
        soft    3.3618    0.8951    0.0218   1492.3261   1422.4557    1.0018       missing

I thought maybe the bias could be because the priors are too tight, so I tried changing the priors:

loc ~ normal(0, 10);
scale  ~ normal(0, 10);
asymmetry ~ normal(0, 10);
soft  ~ normal(0, 10);

These priors led to multimodality, decreasing sampling quality even with variational PT:

Summary Statistics
  parameters      mean       std      mcse    ess_bulk    ess_tail      rhat   ess_per_sec 
      Symbol   Float64   Float64   Float64     Float64     Float64   Float64       Missing 

         loc    0.2430    0.5834    0.0148   1536.9162   1265.4302    1.0046       missing
       scale    0.7891    0.3396    0.0123    332.8498   1156.6664    1.0565       missing
   asymmetry    0.7387    0.4395    0.0121   1324.3414   1267.0997    1.0130       missing
        soft    6.0125    3.2720    0.2814    602.6881   1127.2040    1.0547       missing
       alpha    1.2225    1.4818    0.0436   1536.0009   1251.9407    1.0061       missing
        beta    2.8278    2.4350    0.1383    311.2141    911.9796    1.0516       missing
         tau    3.7885    0.1797    0.0049   1389.4848   1345.6215    1.0159       missing
   log_alpha   -0.0927    0.6301    0.0178   1536.0009   1251.9407    1.0061       missing
    log_beta    0.7652    0.7018    0.0474    311.2141    911.9796    1.0516       missing
     log_tau    1.3308    0.0480    0.0013   1389.4848   1345.6215    1.0159       missing

The pairs plot reveals that this weird univariate behavior is coming from high multivariate curvature. An excerpt:

At these values, the old parameterization (previous post) did better, though in all the others the new parameterization does seem like an improvement. Given the parameter dependent geometry, I would be quite surprised if default NUTS did well uniformly here, if a specialized sampler for multimodal densities is struggling, so this seems like a case where people should try out multiple options.

I ran several more experiments, checked for bugs, etc, but that’s enough for now.

Topic		Replies	Views
Questions about Maximum Likelihood Estimation in Stan Modeling	1	1061	July 31, 2019
Replicating bernoulli_logit_lpmf in R Modeling	2	400	August 11, 2021
Multivariate lognormal distribution in Stan General	4	3321	April 3, 2022
Fitting a truncated Pareto in stan Modeling	1	436	June 7, 2022
Generalized Beta of the second kind (GB2) Modeling specification	5	764	November 7, 2023

Double-Pareto Lognormal Distribution in Stan

Related topics