Combining/suming hyperpriors in hierarchical model, pathologies?

Hello,

I am currently working on a model to estimate clicks based on a combination of ads and audiences. The dependent variable is the number of clicks per ad and audience, while the independent variable is the associated cost per ad and audience. Given the large number of ads and audiences, and limited data points, I am concerned about data sparsity and would like to pool the estimates to shrink them toward the mean. Specifically, I want to have one hyperprior for each audience and one for each ad. These hyperpriors should then be combined in some way to form the actual prior for the ad-audience pair (at least that is my idea).

In the context I am modeling, the coefficient b_{ad, audience} should be strictly positive, and the exponent sat_{ad, audience} should be bounded between 0 and 1, which leaves several choices. Currently, I multiply each hyperprior by 0.5 and sum them, though I could also multiply them, or potentially use a Dirichlet hyperprior to distribute across each of the individual priors. As for the likelihood function, I have not yet defined it. The output should be strictly positive and discrete unless I scale the output, so I am considering a Poisson likelihood function.

I am looking for feedback from more experienced Bayesian modelers about any potential issues with this approach. While I plan to check for R-hats, divergences, and energy plots, I’m curious whether there are any structural challenges with this setup that might present problems for an HMC sampler, or any inherent pathologies I might be overlooking. Any intuition on this would be greatly appreciated.

The model is as follows:

y_{ad, audience} = b_{ad, audience} \cdot \text{cost}^{sat_{ad, audience}}_{ad, audience}

Where:

  • y_{ad, audience} is the number of outcomes for a given ad and audience
  • \text{cost}_{ad, audience} is the associated cost for that ad and audience
  • b_{ad, audience} is a parameter for the relationship between the outcomes and cost
  • sat_{ad, audience} is the saturation parameter for each ad and audience

The priors I am using are:

\text{hyperprior}_{coef-ad}, \text{hyperprior}_{coef-audience} \sim \mathcal{N^+}(0, 1)
b_{ad, audience} \sim \mathcal{N^+}\left(\frac{\text{hyperprior}_{ad} + \text{hyperprior}_{audience}}{2}, \sigma \right)
\text{hyperprior}_{sat-ad}, \text{hyperprior}_{sat-audience} \sim \text{Gamma}(2, 1)
\text{hyperprior}_{sat-ad-beta}, \text{hyperprior}_{sat-audience-beta} \sim \text{Gamma}(2, 1)
sat_{ad, audience} \sim \text{Beta}\left( \frac{\text{hyperprior}_{sat-ad} + \text{hyperprior}_{sat-audience}}{2}, \\ \frac{\text{hyperprior}_{sat-ad-beta} + \text{hyperprior}_{sat-audience-beta}}{2} \right)

where \mathcal{N^+} represents a truncated normal.

This is very standard. I think it’s helpful to talk about combining prior parameters (not combining prior distributions), so that it’s clear that you’re using a kind of regression for the prior. You’ll then find parameterizing on the unconstrained scale like a generalized linear model (GLM) is very flexible. So rather than trying to take two positive parameters and averaging them for the beta, just add them and use a logistic regression. Then for a beta, it’s really helpful to reparameterize as a mean \theta = \alpha / \beta \in (0, 1) and total concentration \kappa = \alpha + \beta \in (0, \infty), i.e., \text{betaMean}(\phi \mid \theta, \kappa) = \text{beta}(\phi, \theta \cdot \kappa, (1 - \theta) \cdot \kappa). This then looks like this:

\textit{sat}_{\textit{ad, aud}} \sim \text{betaMean}\left(\text{logit}^{-1}(\alpha_{i,j} + \beta_{j, k}), \exp(\gamma_{i, j} + \delta_{j, k})\right)

The indexes don’t line up because I wasn’t entirely sure what your intent was with indexing using \text{hyperprior}_\textit{sat-ad-beta}—is that three indexes? In any case, once you have all these parameters, you can give them normal or multivariate normal priors. This lets you use the non-centered parameterization for partial pooling directly using an offset/multiplier specification, as

vector<offset=mu, multiplier=sigma>[N] alpha;
...
alpha ~ normal(mu, sigma);

I then see you wrote y = b \cdot c (modulo some subscripts and superscripts). This is not the usual way to specify a regression because you haven’t added noise. The conventional notation is y_n = \alpha + \beta \cdot x_n + \epsilon_n with \epsilon_n \sim \text{normal}(0, \sigma). In sampling notation, y_n \sim \text{normal}(\alpha + \beta \cdot x_n, \sigma). You want to do all of your sampling with uncertainty or you won’t be able to fit (the system will be overconstrained, just like in a linear regression with more observations than dimensions). In your case, you might draw y from a Poisson distribution whose mean is given by b \cdot c. You will see this more typically with additive parameters b and c on the log scale and something like this,

y_{i, j} \sim \text{Poisson}(\exp(\alpha_i + \beta_j)).

This is just a GLM with a log link (the inverse of the link function shows up in the sampling notation).

This is a lot, but we discuss all of these issues in the first chapter of the User’s Guide, which is on regression.

P.S. I would strongly recommend switching to more conventional notation following Bayesian Data Analysis, which is what Stan’s documentation tries to follow. There, unknown parameters are Greek letters, observed and modeled data as well as covariates are roman letters. Usual bound variables are italics as in most math. Indexes are single letters. The notation evolved as a compact way of expressing models and we get so used to using it that it’s hard to grasp a model’s structure with unconventional notation.