So we have a single level hierarchical model with gaussian priors, one global and p-local, to be explicit:
p(\theta | x) \propto p(x | \theta) p(\theta)
In this case, where for simplicity we’re assuming variance is known (which is not what we like):
p(\beta_0) \sim normal(0, \sigma^2_0)
p(\beta_j |\beta_0, \sigma_j^2) \sim normal(\beta_0, \sigma_j^2)
p(y | \beta_j, X) \sim normal(X \beta_j, \sigma^2)
p(x|\theta)p(\theta) becomes:
p(y_i | \beta_j, x) p(\beta_j | \beta_0) p(\beta_0)
Take log, drop normalization constant to get our objective function, we get:
\frac{\sum(y - X\beta_j)^2}{2\sigma_y^2} + \frac{(\beta_j - \beta_0)^2}{2\sigma_j^2} + \frac{\beta_0^2}{2\sigma_0^2}
We’d want to maximize the negative log-posterior.
And then taking a quick look at convexity, WRT to each parameter. Here I’m fuzzy. I have two inequalities:
A few questions:
- I’ve done Gibbs sampler derivations a few times and in that case it’s easy to see how for the posterior variance can be a weighted average of global and local coefficients. Here, it was appealing to just take log because it looks so much like a standard optimization regularized regression problem. What am I doing wrong?
- I’m given this inequality to check if something is convex, which makes sense when we have some arbitrary function and not many parameters but is not so clear as to what to do when I’m looking at a Bayesian model (with lots of parameters): Check to see that: f(\theta x +(1-\theta)y) \leq \theta f(x) + (1-\theta) f(y):
Ok, and, for simplicity ignoring the hierarchical prior in the objective function, the left hand side would turn out to be… ok we need x,y in dom f, so, I’m guessing I hold everything constant and only look at beta… how do I unpack this inequality exaclty? It’s not so clear when the function becomes more complex.
I’ll think a bit more about it.
Any recasting of this problem to make it more clear would be more appreciated. For example, one exercise just substituted an arbirtary line a quadratic equation and it was clear from “generalized” intuition from basic high school algebra/calculus that it was convex (positive leading coefficient and then even term). This was very easy to see with some basic matrix algebra.
What am I missing?
Edit:
first element of summand should be, excluding constant: (y-XB)^T (y - XB)