Model with product of priors on a parameter does not converge as expected, new to Stan

I would like to get a posterior distribution for the shape k and scale \lambda parameters of a 2-parameter Weibull distribution, and my prior on k is not behaving as expected. I would like to combine two prior beliefs on k by multiplying them together, but when I use the prior k~two_normals(2.25,0.2,2.0,0.2); the fit returns k=0.24, which is way off of the Weibull distribution I created the data from, with shape$=2.25$. Why? Shouldn’t this prior bias samples towards 2.1 or so?

I am new to Stan and to Bayesian statistics so any advice, comment and/or discussion is welcome.

functions{
    // the "two_normals" must end in lpdf
    real two_normals_lpdf(real x,real mu1,real sigma1, real mu2, real sigma2){
        real prob;
        prob = normal_lpdf(x|mu1,sigma1)*normal_lpdf(x|mu2,sigma2);
        return(prob);
    }
}
data {
    int<lower=0> N;
    real<lower=0> y[N];
}
parameters {
    real k;
    real lambda;
}
model {
    // shorthand for target+=two_normals_lpdf(k | 2.25,0.2,2.0,0.2)
    k~two_normals(2.25,0.2,2.0,0.2);
    lambda~normal(30,3);
    y~weibull(k,lambda);
}

Here is the data generation and model sampling

N_OBSERVATIONS = 1000
w_shape = 2.25
w_scale = 30
observation = w_scale*np.random.weibull(w_shape,(N_OBSERVATIONS,)) 
observation[-10:]

data = {'N': len(observation), 'y':observation}

sm = pystan.StanModel(model_code=model)

fit = sm.sampling(data=data, iter=N_SAMPLES, chains=4, warmup=500, thin=1, seed=101,verbose=True)
print(fit)

Output…

4 chains, each with iter=10000; warmup=500; thin=1; 
post-warmup draws per chain=9500, total post-warmup draws=38000.

         mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
k       BAD -->0.24<--BAD  5.3e-5 9.5e-3   0.22   0.23   0.24   0.25   0.26  32394    1.0
lambda  27.58    0.01   2.36  23.09  25.97  27.55  29.15  32.28  29979    1.0
lp__    -3681  7.7e-3    1.0  -3684  -3681  -3681  -3680  -3680  16836    1.0

I thought I was specifying something like this (adapted from what I learned at MCMC/MCMC.ipynb at master · Joseph94m/MCMC · GitHub)

transition_model = lambda x: np.random.normal(x,[0.2,5],(2,))

def log_prior(w):
    if(w[0]<=0 or w[1]<=0):
        return np.log(0)
    else:
        # Two priors for k=w[0], one prior for lambda=w[1], all multiplied together (and log taken)
        return scipy.stats.norm.logpdf(w[0],2.25,.2)+scipy.stats.norm.logpdf(w[0],2,.2)+scipy.stats.norm.logpdf(w[1],30,3)
		
def log_lik(x,data):
    return np.sum(scipy.stats.weibull_min(c=x[0],scale=x[1],loc=0).logpdf(data))
	
def acceptance(x, x_new):
    if x_new > x:
        return True
    else:
        accept = np.random.uniform(0,1)
        return (accept < (np.exp(x_new-x)))
		
def metropolis_hastings(likelihood_computer,log_prior,transition_model,param_init,iterations,data,acceptance_rule):
    x = param_init
    accepted = []
    rejected =[]
    for i in range(iterations):
        x_new = transition_model(x)
        x_lik = likelihood_computer(x,data)
        x_new_lik = likelihood_computer(x_new,data)
        if(acceptance_rule(x_lik + log_prior(x),x_new_lik+log_prior(x_new))):
            x=x_new
            accepted.append(x_new)
        else:
            rejected.append(x_new)
    return np.array(accepted),np.array(rejected)
	
accepted,rejected = metropolis_hastings(log_lik,log_prior,transition_model,[0.1,0.1],5000,observation,acceptance)

The last 10 accepted values are

GOOD-->array([[ 2.23895553, 29.34343911],
       [ 2.26532006, 29.54858588],
       [ 2.1739458 , 29.28039943],
       [ 2.17658563, 29.65725183],
       [ 2.18846247, 29.37056121],
       [ 2.17330239, 30.10245334],
       [ 2.26786791, 30.83535897],
       [ 2.30913441, 30.71661784],
       [ 2.16745388, 29.93292969],
       [ 2.16873363, 29.98811978]])

Hello,

I am very new to stan; so I might be wrong. But normal_lpdf returns the log of the normal density. So the product of two normal distributions (if independent) should be the sum of their log density (and not the product)

Here is what i get if i replace prob = normal_lpdf(x|mu1,sigma1)*normal_lpdf(x|mu2,sigma2);
by : prob = normal_lpdf(x|mu1,sigma1) + normal_lpdf(x|mu2,sigma2);

4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.
Warmup took (0.25, 0.27, 0.32, 0.26) seconds, 1.1 seconds total
Sampling took (0.22, 0.24, 0.21, 0.26) seconds, 0.94 seconds total
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -3872 2.2e-02 0.98 -3874 -3872 -3871 1918 2051 1.00
accept_stat__ 0.91 1.7e-03 0.11 0.67 0.95 1.0 4434 4742 1.0e+00
stepsize__ 0.86 5.5e-02 0.078 0.77 0.92 0.95 2.0 2.1 2.0e+13
treedepth__ 2.0 2.0e-02 0.57 1.0 2.0 3.0 824 881 1.0e+00
n_leapfrog__ 3.9 2.5e-01 1.9 1.0 3.0 7.0 56 60 1.0e+00
divergent__ 0.00 nan 0.00 0.00 0.00 0.00 nan nan nan
energy__ 3873 3.2e-02 1.4 3871 3873 3876 1800 1926 1.0e+00
k 2.2 9.5e-04 0.051 2.1 2.2 2.3 2955 3161 1.0
lambda 29 8.1e-03 0.41 28 29 29 2645 2829 1.0

and diagnoses seem to be ok.

Edit: Now that i am thinking about it; i am not sure about what this prior on k means (the product of two gaussian); so you might want to do some prior predictive check to make sure the prior is ok (i am afraid it is a very informative prior ?) : 26.5 Example of prior predictive checks | Stan User’s Guide

3 Likes

Note also that the product of two gaussian PDFs is guaranteed to be proportional to another Gaussian PDF. So the two normals can always be replaced with a single normal. Here’s a link (not vouching for the site, I just googled it) for the mean and variance you can use:

2 Likes

Thanks for pointing out the log pdf issue, I should have realized that!

As for my prior, the idea is to combine multiple priors in a simple way to reflect the elicitation and use of prior knowledge from multiple experts. If you know a more sophisticated way to do this I will gladly read about it!

Multiplying the PDFs does not combine multiple priors in a useful way. For example, if two experts give the exact same prior, then taking the product of the PDFs yields a new prior with smaller variance! The more experts you have, the tighter your prior will get, regardless of how uncertain any individual expert is. If the experts were exclusively leveraging independent private information to make their judgments, then this could be desirable. But it’s unlikely to be the real-world case. Adding the densities (on the identity scale, not the log scale) avoids this particular problem, but raises new problems. For example, if your experts give well separated priors, then the normalized sum of the PDFs will be strongly bimodal, but this seems like not what you want. Maybe @Ara_Winter has faced the challenge of combining multiple priors from different experts before?

2 Likes

Unless I am miseading things, multiplying densities (or adding their logs, as I should have done) is proportional to logarithmic pooling, which is a recognized method of pooling expert opinions. See here for an example description https://arxiv.org/pdf/2202.11219.pdf

I do not understand why it matters whether the experts leverage independent information or not.

1 Like

That approach is fancy!

Typically I do one of two things:
Multi-verse of models - each expert gets their own model with it’s priors. This works if the priors are wildly different.

Single model - but the priors are wide enough to capture every experts prior. This can cause some issues in modeling if the priors end up being to wide.

For either approach I find it helpful for each expert (stakeholders in my case) to draw out the distribution of the prior.

3 Likes

Some things to check out. FYI these are in my wheelhouse but provide some guidance that crosses disciplines.

Bélisle AC, Asselin H, LeBlanc P, Gauthier S. Local knowledge in ecological modeling. Ecology and Society. 2018 Jun 1;23(2).

Knapp CN, Fernandez-Gimenez M, Kachergis E, Rudeen A. Using participatory workshops to integrate state-and-transition models created with local knowledge and ecological data. Rangeland Ecology & Management. 2011 Mar 1;64(2):158-70.

2 Likes

The problem with taking the product of nonindependent priors is fully general, but it is easiest to see in the event that each expert has access to the same public information and formulates the exact same prior. Suppose that all your experts have read the latest analysis which leads them to all formulate identical standard normal priors. They’re all in agreement, based on the same shared information, that the appropriate prior is a standard normal.

The multiplicative procedure does not reproduce that standard normal; instead it yields the different priors for different numbers of experts. Again, no expert is bringing anything new information-wise; they are all just confirming that everybody is using the same correct prior based on shared public information.

df <- data.frame(x = seq(-3, 3, .01))

# the one-expert prior:
df$p1 <- dnorm(df$x)
plot(p1 ~ x, data = df, type = "l")

# the multi-expert prior:
dn <- function(x, n){dnorm(x)^n}
multiply_priors <- function(x, n) {
  dn(x, n)/integrate(function(z){dn(z, n)}, -Inf, Inf)$value
}

# 2 experts
df$p2 <- multiply_priors(df$x, 2)
plot(p2 ~ x, data = df, type = "l")

# 10 experts
df$p10 <- multiply_priors(df$x, 10)
plot(p10 ~ x, data = df, type = "l")

# 500 experts
df$p500 <- multiply_priors(df$x, 500)
plot(p500 ~ x, data = df, type = "l")

Here’s that 500 expert prior

If all 500 experts were leveraging independent info, then this would be correct (like if there were 500 previous analyses of different data and they each came up with a standard-normal posterior, and each expert knew of 1 previous analysis and no two experts knew of the same previous analysis). This would be our appropriate prior. But if there is just one previous analysis and every expert knows about it (the expert info is not independent), then this is wrong.

2 Likes

That’s right. But the “proportional” bit only works if you’re using equal weights. Also, if you are logarithmically pooling a set of Gaussians, the pool is also Gaussian. I can say more later.

4 Likes

Thanks for those. I was reading one of the papers they cite, Kuhnert, P. M., T. G. Martin, and S. P. Griffiths. 2010. A guide to eliciting and using expert knowledge in Bayesian ecological models. Ecology Letters 13(7):900-914. http://dx.doi.org/10.1111/j.1461-0248.2010.01477.x

Since we are on the topic of elicitation, I find little general information on how to convert expert opinions into prior distributions on parameters. How do you approach this in ecology in situations where you want to estimate a parameter that experts (especially non-scientific ones) are not familiar with? In that case you need to construct a prior indirectly from the expert responses. Is there a general approach, or is a bespoke method devised for each case?

2 Likes

I usually take their narrative and write it down as closely as possible.

A brief aside:** My job often involves working with sovereign nations, underrepresented, or exploited folks within the United States that have their own data sovereignty or a well founded and healthy mistrust of how their data will be used, so you gotta check to me sure you can use that kind of stuff. There is a whole other community building aspect to elicitation that's super key. 

Once I have the narrative (goes into supplemental) and permission to use it, I can build out a prior and then show the expert and ask does this make sense. That’s probably the more general approach. I think I have some examples from a paper we are working on. I ended up with huge multi-generational, multi-family narrative centered around the system I work in. Let me get it cleaned up and maybe into a workflow? flow chart?

Other times we try and work towards a consensus or consensus minus 1 model to arrive at some shared narrative/prior.

And you really gotta keep an eye on what @jsocolar pointed out. That is so common. I usually try and suss out where their source is from. 'cause if it’s all the same place then you have another problem.

2 Likes

Another FYI here is an overview of a formal method for elicitation

Not that you really want to wargame priors ;) but it’s a thought provoking way to learn about priors and how they might shift.

3 Likes

For some additional discussion on eliciting domain expertise for the development of prior models see also Prior Modeling, especially Section 4.1.4 and Section 5.

As @jsocolar mentioned naively multiplying density functions together does not yield consistent results, not only when the two density functions are consistent but also when they are not consistent (the product ends up concentrating in the overlap, resulting in a very information density function). In some sense what we often what in situations in like this is a linear mixture of density functions which incorporates all of the neighborhoods of high probability from all of the component density functions and not just their intersection, although in higher-dimensions mixture priors can still lead to counter-intuitive behavior.

Personally I prefer to elicit thresholds and then convert those into useful prior models, as discussed in the note linked above. See for example the first few paragraphs of Section 3 for some discussion as to why this can facilitate merging elicitation from multiple parties.

1 Like

FYI for folks using community members for gathering priors from Andrew Gleman’s blog
https://statmodeling.stat.columbia.edu/2022/06/20/webinar-on-using-expert-information-in-bayesian-statistics/