# Bayesian likelihood - mu~N(data, sigma)?

Hi,
I’m trying to understand better the difference between likelihood and probability in Bayesian framework. As I’m weak in maths I’m looking for an intuitive approach.
McElreath in this video makes a distinction between the likelihood or prior probability.
In Stan an anova model is the same no matter if the likelihood looks like,

1. data~N(mu, sigma) i.e. p(data|mu,sigma) - looks like the frequents likelihood, right? Or,
2. mu~N(data, sigma) ie p(mu|data,sigma) - Now, what’s that? Is it still likelihood? (since it’s condition over data and parameter (sigma))

I’ve used this code to play with,

 set.seed(123)
ngroups <- 5  #number of populations
nsample <- 10  #number of reps in each
pop.means <- c(40, 45, 55, 40, 30)  #population mean length
sigma <- 3  #residual standard deviation
n <- ngroups * nsample  #total sample size
eps <- rnorm(n, 0, sigma)  #residuals
x <- gl(ngroups, nsample, n, lab = LETTERS[1:5])  #factor
means <- rep(pop.means, rep(nsample, ngroups))
X <- model.matrix(~x - 1)  #create a design matrix
y <- as.numeric(X %*% pop.means  eps)
data <- data.frame(y, x)

Xmat <- model.matrix(~x, data)
data.list <- with(data, list(y = y, X = Xmat, nX = ncol(Xmat), n = nrow(data)))

# Model with data~N(mu, sigma)
modelString_A = "
data {
int<lower=1> n;
int<lower=1> nX;
vector [n] y;
matrix [n,nX] X;
}
parameters {
vector[nX] beta;
real<lower=0> sigma;
}
transformed parameters {
vector[n] mu;

mu = X*beta;
}
model {
//Likelihood
y~normal(mu,sigma);

//Priors
beta ~ normal(0,1000);
sigma~cauchy(0,5);
}

"

fitA <- stan(data = data.list, model_code = modelString_A , chains = 4, cores = 4, seed = 1)

# Model with mu~N(data, sigma)
modelString_B = "
data {
int<lower=1> n;
int<lower=1> nX;
vector [n] y;
matrix [n,nX] X;
}
parameters {
vector[nX] beta;
real<lower=0> sigma;
}
transformed parameters {
vector[n] mu;

mu = X*beta;
}
model {
//Likelihood
mu~normal(y,sigma);

//Priors
beta ~ normal(0,1000);
sigma~cauchy(0,5);
}

"

fitB <- stan(data = data.list, model_code = modelString_B , chains = 4, cores = 4, seed = 1)



Hi.

I would recommend reading some statistics textbooks. “Pattern recognition and Machine learning” by Bishop is a good place to start in my opinion. To give you a brief rundown however, in a frequentist sense we can view Bayesian models as being some loss function which we want to minimize. The likelihood can be seen as the typical loss function without any regularization, whilst the prior densities enforces some regularization into the model, i.e. favoring certain settings of parameters more then others. In typically frequentist approaches however we are only interest in the expected value, thus no incorporation of any type of uncertainty in our parameters. Bayesian approaches on the other focuses on the error propagation by specifying our model through probability densities. That is instead of our parameters just taking on a single value, we can obtain a range of values which may be suitable for the model.

If we have some noisy data (\boldsymbol{x},\boldsymbol{y}) which we know follows the linear model f(x)=\boldsymbol{X}\boldsymbol{\beta} where we want to estimate the parameters \boldsymbol{\beta}. The frequentist approach would be to set up the objective function as:

J(\boldsymbol{\beta}) = (\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta})^T(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta})

and solve for \boldsymbol{\beta} by just minimizing J(\boldsymbol{\beta})

In a Bayesian setting, instead of now having the loss function above, we define the likelihood (defined as the probability of observing the data conditioned on the parameters) P(\boldsymbol{y}|\boldsymbol{\beta}). In the case of l_{2}-norms (as above), the equivalent likelihood function would be a normal distribution. That is:

P(\boldsymbol{y}|\boldsymbol{\beta})=\mathcal{N}(\boldsymbol{y}|\boldsymbol{X}\boldsymbol{\beta}, \sigma^2\boldsymbol{I})

Where \sigma is the standard deviation of the noise on y.

We can also include some regularization terms in our model. Typical regularization is that of ridge regression (l_2-norm penalization). That is we now penalize the model if the values of \boldsymbol{\beta} are large by including additional terms in the loss function:

J(\boldsymbol{\beta}) = (\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta})^T(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta})+\lambda \boldsymbol{\beta}^T\boldsymbol{\beta}

The equivalent way to incorporate regularization in a Bayesian setting would be to define what is referred to as prior densities. The prior is our beliefs about the parameter values before seeing the data. In this specific case we would have normal priors on \boldsymbol{\beta} with mean of zero. That is:

P(\boldsymbol{\beta}) = \mathcal{N}(\boldsymbol{\beta}|\boldsymbol{0},\sigma_{\beta}^2\boldsymbol{I})

By specifying both a prior and a likelihood we can obtain the posterior density, P(\boldsymbol{\beta}|\boldsymbol{y}). This is defined as the probability of our parameters after observing the data. We relate the likelihood and priors to the posterior distribution through the use of Bayes’ rule:

P(\boldsymbol{\beta}|\boldsymbol{y})=P(\boldsymbol{y}|\boldsymbol{\beta})P(\boldsymbol{\beta})/P(\boldsymbol{y})

Where P(y)=\int_{\boldsymbol{\beta}}{P(\boldsymbol{y}|\boldsymbol{\beta})P(\boldsymbol{\beta}) \space d\boldsymbol{\beta}} is known as the evidence term (a constant which makes sure that the posterior is a valid distribution by integrating to unity).

By evaluating the negative log posterior density we would obtain a similar form as that of the objective function specified. That is:

-ln(P(\boldsymbol{\beta}|\boldsymbol{y}))=1/2(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta})^T(\sigma^2\boldsymbol{I})^{-1}(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}) + 1/2\boldsymbol{\beta}^T(\sigma_{\beta}^{2}\boldsymbol{I})^{-1}\boldsymbol{\beta} + constant

This would be the correct form of the likelihood function:

This is not the likelihood nor the posterior, nor the priors. Since you are using a normal distribution you would still get the same answers as the normal distribution is symmetric but this is statistically incorrect.

2 Likes