Maximum Likelihood estimation for Random Effect model for Meta Analysis

Hi everyone, I am trying to run a simple Rubin Random Effect model for a meta analysis of the sort:


data {
  int<lower=0> N;              // num observations
  vector[N] y;                 // outcome
  vector[N] se;                // vector of standard error estimates
}

parameters {
  vector[N] theta;
  real beta;
  real<lower=0> sigma_theta;
}

model {
  // Likelihood or 1st hierarchical layer
  y ~ normal(theta, se);
   
  // 2nd hierarchical layer
  theta ~ normal(beta, sigma_theta);
  
  // not specifying priors for both beta and sigma_theta
}

My question is how to best estimate this model using Maximum Likelihood. My first approach would be to not specify priors for beta and sigma_theta, so that Stan uses improper flat priors. This should, in theory, closely approximate a ML estimation, am I correct?

Another approach I tried is to use the stan_glmer() function, but I get a very different result.

fit <- stan_glmer(
  y ~ 1 + (1 | obs),              # Intercept-only model with a random effect per observation
  data = dt,                      # Assuming `dt` is the dataset containing y and se
  weights = 1 / se^2,             # Use inverse-variance weighting
  prior = NULL,                   # No priors specified for a flat (improper) prior equivalent
  prior_intercept = NULL,         # Flat prior on the intercept
  prior_covariance = decov(regularization = 1e-10),  # Very small regularization for weak prior
  chains = 4,                     # Number of Markov Chains
  iter = 2000,                    # Number of iterations per chain
  seed = 11397                    # Set a seed for reproducibility
)

Thank you in advance for the help.

1 Like

In theory and in practice the MAP will be exactly MLE; if somehow it’s not, something is very likely wrong. Nevertheless, if instead of looking only at the MAP you look at the entire chain or averages of any parameter or model output, it will be different from MLE, which is (almost invariably) a point estimate.

I’m not familiar with stan_glmer(), but it’s not clear to me that it specifies that the intercepts (theta) come from a normal distribution that follows a normal distribution – there’s no theta, beta, or sigma_theta in that code, right?
So maybe you are getting different results because it’s just letting the intercepts be independently estimated, instead of being constrained by the higher hierarchical level. (again, I may be off, since I don’t know stan_glmer’s syntax).

Hi @caesoma, do you have any reference for this that I can look at? (And also cite in the paper)

I am initially replicating the results from a previous meta analysis, where the authors use the metafor package with the following code:

# Random Effects Model
eff_average<- dt %>%
  rma.mv(data=.,
         yi = effect_size_percent,
         V = SE_percent^2,
         random = ~ idEffect | idStudy,
         method = "REML")

The results are very close to each other (-12.5 vs -11.7), so I guess the discrepancy is because of different algorithms/approximations?

I do not have a specific reference, because it is a very general statement, I guess you can find that in many books on bayesian statistics, although I suspect most won’t bother writing a proof, as it is quite simple.

In a nutshell, MLE will get a (usually point) estimate the likelihood of parameters given the data, \mathcal{L}(\theta|D), by simulating the likelihood of the data given the parameters, \mathcal{L}(\theta|D), and using whatever optimization algorithm to maximize log(\mathcal{L}(\theta|D)) (even a while ago, many actually used MCMC under the hood and spitted out the point value with highest likelihood, but that’s only marginally important). Posterior probability on the other hand adds prior probability and becomes \mathcal{P}(\theta|D) = \mathcal{L}(\theta|D)\ p(\theta). Therefore, if priors are improper, p(\theta) is constant and doesn’t affect the relative likelihood, so you will get the same result, i.e. \mathcal{P}(\theta|D) = \mathcal{L}(\theta|D)\ c up to a constant c, which in practice is probably omitted for calculation purposes anyway.
If that is estimated using MCMC or any other stochastic algorithm, the precise result will differ, but will probably be very close, and if a regular optimization method is used, it may still be affected by machine precision (and depending on the complexity of the model, these methods may fail in spectacular ways).

I don’t know that package, but depending on the range of the parameter values, maybe that is the case. You would have to check exactly what the shorthand notation stands for to make sure you are dealing with the exact same model. An easier way of checking is using a simpler model with simulated data, so you know what to expect and may be able to spot mistakes and inconsistencies more easily.

@caesoma thank you very much for the detailed explanation. I agree that from a theoretical point is straightforward, I was talking about some papers in the context of the Stan algorithm or any paper that used a Stan model with improper priors to estimate ML.

It looks like you specified method = "REML" which will perform restricted maximum likelihood where the fixed effects are marginalized out of the likelihood and optimization is performed only over the random effect parameters. This should in general give a different answer than full ML, however, it should also be equivalent to maximizing the marginal posterior of the random effects assuming an improper flat prior on the fixed effects.

Ref section 1.2.3 of Jim Hodges’ “Richly Parametrized Linear Models”.

I don’t remember anything from the top of my head, since improper priors are generally not recommended (and often frowned upon), but if you look up “Maximum Likelihood” here on discourse a number of topics will pop up.

What’s happening in Stan?

In Stan, the MAP estimate and MLE will differ if there are constrained parameters, because the MAP estimate optimizes a target that includes the change of variables adjustment for any constraints.

For example, suppose I write the following Stan program

parameters {
  real<lower=0> alpha;
}
model {
  alpha ~ exponential(1);
}

This defines a density \text{exponential}(\alpha \mid 1) on the constrained scale where \alpha \in (0, \infty). But Stan’s algorithms all operate over unconstrained densities, so we apply a log transform to \alpha and define \beta = \log \alpha. Now we can use the change-of-variables formula to derive

p(\beta) = \text{exponential}(\exp(\beta) \mid 1) \cdot \left| \dot{\exp}(\beta) \right| = \text{exponential}(\exp(\beta) \mid 1) \cdot \exp(\beta).

Stan’s MAP estimate is going to be \exp(\widehat{\beta}), where

\widehat{\beta} = \text{argmax}_\beta \ \text{exponential}(\exp(\beta) \mid 1) \cdot \exp(\beta).

Stan’s MLE estimate is going to be \exp(\beta^*), where

\beta^* = \text{argmax}_\beta \ \text{exponential}(\exp(\beta) \mid 1).

Maximum likelihood estimates

I hope you don’t mind if I try to sharpen this up a bit. Suppose we have a Bayesian model specified by a prior p(\theta) and sampling distribution p(y \mid \theta) for unknown parameters \theta and observed data y. Given a fixed data set y,

\mathcal{L}(\theta) = p(y \mid \theta)

is called the likelihood function. Note that it’s just a function, not a density, because

\int \mathcal{L}(\theta) \, \text{d}\theta \neq 1.

Also, the frequentists don’t think of \theta as being random, so you wouldn’t want a density for it in that context. And you wouldn’t typically write it as \mathcal{L}(\theta \mid D) because that makes it look like a conditional density.

The maximum likelihood estimate is defined by

\theta^* = \text{arg max}_\theta \ \mathcal{L}(\theta).

To be a little more precise, suppose we have a sampling distribution p(\theta \mid y) (we call it a “likelihood” when considering it a function of the parameters for a fixed data set) and prior p(\theta). If we define \mathcal{L}(\theta) as above, then we have p(\theta \mid y) \propto p(y \mid \theta) \cdot p(\theta) by Bayes’s rule and hence p(\theta \mid y) \propto \mathcal{L}(\theta) \cdot p(\theta).

You can’t use MCMC for optimization

This is wrong. Don’t do this. The highest likelihood point from a sample will be nowhere near the maximum likelihood estimate due to concentration of measure. I discuss this in my intro to Stan and in the curse of dimensionality/typical set case study.

The stated relation only holds up to a proportion, so should be written as

p(\theta \mid y) = \dfrac{p(y \mid \theta) \cdot p(\theta)}{p(y)} \propto p(y \mid \theta) \cdot p(\theta).

This is not true in general. First, a prior is supposed to be a density, so it can’t technically be improper. People are sloppy about this all the time and talk about “flat priors”. Those don’t fit properly into the above framework because p(\theta) is not a density, but we can plug it into Bayes’s rule anyway and run with the result. Secondly, you can have a constant prior that is not improper, such as p(\theta) = \text{uniform}(\theta \mid 0, 1). It is true that you cannot have a proper constant prior over an unbounded random variable.

Now let’s suppose p(\theta) is a constant (whether proper or not). Because p(\theta \mid y) \propto p(y \mid \theta) \cdot p(\theta), we conclude p(\theta \mid y) \propto p(y \mid \theta) = \mathcal{L}(\theta). That is, the posterior is proportional to the likelihood when there is a flat prior. In this case, it’s obvious that

\text{argmax}_\theta \ p(\theta \mid y) = \text{argmax}_\theta \ \mathcal{L}(\theta).

How close is this?

This is not very close for optimization results.

Penalized MLE

In frequentist stats, people often fit penalized maximum likelihood where the penalty looks a lot like a prior. They typically work on the log scale, where if the penalty is the negative log prior and we’re optimizing on the log scale, we again get the same results for MAP and MLE (unless there’s an implicit change of variables hiding).

2 Likes

Right. Thanks for correcting me on that, @Bob_Carpenter. @Tommaso11397, that is an important point for Stan users to remember, constrained parameters in Stan are transformed via exponential or logistic functions and it is the estimate of this transformed variable which is computed. Maybe the details are beyond your interests and needs at this time, but the essential part is that.
I am not sure if that is the same for brms in general or in this particular case, maybe Bob or someone else can give you this information.

I don’t mind at all, but really \mathcal{L}(\theta | D), p(\theta | D), or f(\theta | D) is mostly arbitrary notation that will need to be specified explicitly most times. I personally like the \mathcal{L} notation when I am teaching given the central importance of The Likelihood. If we want to be pedantic about it, \mathcal{L}(\theta) is only correct if assume the data is fixed for all purposes – in general I would write \mathcal{L}(\theta, D) (although it may be a slight abuse of the notation if we are using \mathcal{L} as the sum/product over all data points – but so is \mathcal{L}(\theta) = p(y| \theta) ). I think it is useful to think of it as a joint distribution in both parameter and data space, and \mathcal{L}(\theta | D) as the probability of the parameters given the data – maybe it’s not typical in some circles, but it’s typical enough to be in the Wikipedia definition of Likelihood function.
Also, \mathcal{L}(\theta | D) is not a density when defined over several data points, for which it is shorthand for a sum/product of densities, but for any given data point d_i, \mathcal{L}(\theta | d_i) is indeed a density, so all of the above applies, we only need to sum the log-likelihoods afterwards for the purposes of MCMC algorithms.

It may not be the best method, but it is wrong to say it is wrong. Deterministic Optimization in general is mostly hill-climbing algorithms applied to the likelihood surface, so you can absolutely use MCMC to get a MAP or MLE; in fact, that is exactly what many simulated annealing algorithms are doing (although I agree it may jump around the mode too much, but I think “nowhere near” is probably an exageration).

Of course that is true, but also of course p(y) is constant, so it doesn’t affect the inference.

No, but if the priors are over an unbounded interval, it is the only way of implementing it without assigning an infinitesimal density is acknowledging p(\theta) will be the same for any value and just ignore it. For “flat priors” it will not be constant, but the probability will be zero outside of some range, and in both cases it would be a density, either because you divide by the size of the interval, or using the shortcut above for infinite intervals.

I agree, some of the issues mentioned here (particularly variable transformation, if I had to guess) are causing the discrepancy. I also agree there is some nuance, abuse of notation, and confusion that may need to be treated more strictly, but for the purpose of @Tommaso11397’s problem, most of that is irrelevant. I’d say try:

  1. Unconstrainining the variable in the Stan code (the more correct way would be making the Jacobian correction, but if the net effect is just to get more rejections due to errors in the support of the gaussian variance parameter, it’s easier);
  2. Making sure that brms reproduces what you are doing in the MLE with the other package (or that what you are doing there is actually what you want).

Hi @Bob_Carpenter and @caesoma, thank you both for your very helpful replies. I am pretty ok with the theoretical considerations. But since I am not super learnt in how Stan works under the hood, I wanted to make sure I was estimating what I wanted to.

Hence, for my very applied goal of estimating the MLE of the model I sent (and potentially more complicated ones), what should I do?

  1. Unconstraining the parameters seems to be a first solution as @caesoma suggested, especially if I do not specify the prior, Stan by default gives them a “flat improper prior” (I understand this might not be the best terminology)
  2. Any other possible solutions to do this while remaining in the Stan package universe? I say this because I would like to avoid any other package where I do not know exactly what is going on and I end up comparing things that should not be compared

Thank you very much for all the help.

Hi @Bob_Carpenter and @caesoma, I wanted to know if you had any closing remarks on this. Long story short: unconstrain the parameters and do not specify priors (i.e. improper constant priors) to get MLE in Stan.

In my assessment, the most practical alternatives are the following:

  1. Using Stan (not brms) and keeping variables unconstrained and improper/no priors;
  2. Alternatively, and if you have the time and resources for it, keep the constraints but figure out the correct correction for the likelihood;
  3. Figure out what options in brms are equivalent o the above, and therefore to MLE.
1 Like

I just wanted to share that, when unconstraining the parameters I get practically the same result as the package. So this seems to be the way to do it. Thank you both for your help.

1 Like

I didn’t even realize that brms can do maximum likelihood!

Good point. Stats notation is an absolute mess! I’ll stand by the notation being inappropriate for a frequentist because it’s not conditioning \theta because frequentists don’t allow distributions over parameters \theta.

I’m not sure what you mean here. The likelihood is defined as a function over parameters for a fixed data set. In that sense, it doesn’t define a density over \theta. Rather it’s defined the other way around as \mathcal{L}(\theta | y) = p(y \mid \theta) (note that’s not proportional to, but equal). In that sense it’s not a density over \theta. Given that y is fixed, it’s not even a function over y.

I prefer to think of p(y \mid \theta) as the sampling distribution—it tells you how to generate y given \theta.

Let me quantify. Let’s say I have a 100-dimensional standard normal and I take samples y \sim \text{normal}(0, \text{I}) in high dimensions. In N dimensions, the squared distance of a draw y to the mode is y^\top y = \sum_{n=1}^{100} y^2, which follows a \text{chiSq}(N) distribution (chi square N is just the distribution of the sum of N standard normals). For instance, in 100 dimensions, the central 99.99% interval for sampling from \textrm{chiSq}(100) is (55, 161).

Another case where this will fail is with something like a \text{beta}(y \mid 0.5, 0.5), which does not have a mode. Sampling is fine, but all the draws will be strictly within the open interval (0, 1). Over that interval, there is no maximum—the density grows without bound as y \rightarrow 0 or y \rightarrow 1.

In simulated annealing you are not sampling from the posterior. It works with a series of posteriors p(\theta | y)^{w_i} and it will concentrate around \theta as w_i \rightarrow \infty. So what you do is shrink the target density toward a delta function around the mode. Assuming your sampler can keep up, this approach will get the right answer. But you won’t get it from running Stan’s sampler and taking the maximum value.

I would recommend (a) keeping the parameters constrained, and (b) turning off the Jacobian adjustment. That gives you a frequentist maximum likelihood estimate if that’s what you want. Depending on which interface you’re in, this may be your only optimization option as we just added MAP optimization pretty recently. I know MAP is available in CmdStanPy and thus probably in CmdStanR, but I doubt it’s in either PyStan or RStan.

In a maximum likelihood setting, priors can be flipped around and used as so-called penalty functions and you will get a penalized maximum likelihood estimate. This is pretty standard in frequentist stats with (a) lasso or ridge regression for shrinkage, or (b) empirical Bayes for regularization (which is a super confusing name that has multiple meanings, including as finding a regularization target for frequentist stats, see for example, Efron and Morris’s classic empirical Bayes papers: Stein's Estimation Rule and Its Competitors--An Empirical Bayes Approach on JSTOR and http://www.med.mcgill.ca/epidemiology/Hanley/bios602/MultiLevelData/EfronMorrisJASA1975.pdf).

1 Like

Hi, just a couple comments:

  1. “Maximum likelihood” for hierarchical models has multiple meanings. If we use notation y for data, alpha for parameters in the data model, and phi for hyper parameters, so that the posterior density is p(alpha,phi|y) proportional to p(y|alpha)p(alpha|phi)p(phi), then “maximum likelihood” could refer to three different things:
    (a) literally maximizing the likelihood, that is, maximizing p(y|alpha), which is just the maximum likelihood estimate of alpha and doesn’t use the hierarchical part of the model at all;
    (b) jointly maximizing the posterior density p(alpha,phi|y) assuming a flat prior on phi. This joint mode is in fact undefined, because the joint distribution has a pole (i.e., it goes to infinity, and its local integral goes to infinity) in the limit of group-level variance going to zero;
    (c) maximizing the marginal posterior density p(phi|y) assuming a flat prior on phi. This marginal mode is defined, although sometimes the estimated group-level variance parameter can be zero, which is one reason I recommend using a zero-avoiding prior for the group-level variance parameters, as explained in our papers from 2013 (http://stat.columbia.edu/~gelman/research/published/chung_etal_Pmetrika2013.pdf) and 2014 (http://stat.columbia.edu/~gelman/research/published/chung_cov_matrices.pdf).

  2. If you ever want to do straight maximum likelihood, you can write your model in Stan without a prior and run Stan in optimize setting. This can be done in Stan directly and also when calling Stan using cmdstanr or rstan. For the reason discussed in point (a) above, you would not want to do this with a hierarchical model, as the mode is degenerate at the group-level variance parameter being zero.

2 Likes

I completely agree, and while I also think some level of strict formalism/pedantism is necessary to make sure things remain objective, there’s a point where things get counterproductive.

I mean that if you take your “fixed data” to be a single data point, you would be able to say that \mathcal{L}(\theta|D) and \mathcal{L}(D|\theta) are densities. But more importantly, I am convinced that in most practical contexts it is safe to used “likelihood” \mathcal{L} interchangeably with “probability” p, even if most times the former actually means a product of probabilities, so those become equivalent to p(\theta|y) and p(y|\theta) in Bayes’s Theorem – you could even get pedantic and show that is true for each individual observation (I’m sure you could also get equally as pedantic to show some other context when that interchange is not desirable, but does it matter that much?). So it’s also fairly safe to talk about likelihood and probability of parameters given the data, data given parameters, and even joint probability of parameters and data – i.e. \mathcal{L}(\theta, D) or p(\theta, y).

I find “Sampling distribution” a confusing term; when teaching introductory classes (probably not to future statisticians), I find it more intuitive to say it’s the probability of the data given a fixed set of parameters under some model, even given the caveats above, may not be exactly what it is, but that’s what it’s actually doing for the inference. But that’s a personal choice, and I take responsibility for its consequences.

Agreed, but you are describing what the algorithm is able to do, and wether it can or will do it well; that is not to say that it won’t get you an answer you could use (e.g. you could get a very large number of samples, or make MCMC adapt to the acceptance rate). But I guess that is besides the main point here, so I’ll just drop it.

Hi @andrewgelman, thank you very much for your comment, very helpful. I understand that your preferred approach would be (c), how would you go about implementing this on Stan?

And, secondly, though you do not like (a), what if I ex post “check” that there is indeed group-level variance? In the case of the classic Rubin model for meta analyses, every study has its own Random Effect, so unless the common distribution of the REs has variance zero, the condition should hold, correct?

@Bob_Carpenter, thank you again for the tip: when you say “turning off the Jacobian adjustment”, do you mean something of this sort, i.e.:

parameters {
  real x;  // Unconstrained parameter
}
model {
  real y = exp(x); // Manual transformation, no Jacobian adjustment
  target += normal_lpdf(y | 0, 1); // Specify the likelihood without adjustment
}

instead of:

parameters {
  real<lower=0> y; // y is declared in its transformed space
}
model {
  y ~ normal(0, 1); // Implicit Jacobian adjustment applied
}
  • If you were planning to compute the marginal maximum likelihood estimate of the hyperparameters in a hierarchical model, I would instead recommend you perform full Bayesian inference, or if that is too computationally expensive, i would recommend you compute the marginal posterior mode using a zero-avoiding prior for reasons given in my 2013 and 2014 papers linked above.
  • How to do that in Stan? I don’t think it can currently be done in Stan, but I think that Charles Margossian developed a nested Laplace algorithm which would do it, and that this algorithm could be in Stan–maybe it’s currently implemented on a branch?
  • You ask about checking that group-level variance is nonzero. I’d just recommend including this as a parameter in the model and fitting it. You can also compare to a fit with zero group-level variance. In any case, this problem will not be solved with a joint maximum likelihood estimate of data-model parameters and hyperparameters, as this joint maximum likelihood estimate does not exist: the joint likelihood function has a pole in the limit that group-level variance goes to zero.
  • To put it another way, there’s no principle that would recommend that joint maximum likelihood estimate. It’s just a bad idea for hierarchical models.

There’s actually a jacobian argument for the optimize() method in CmdStanR and CmdStanPy, although the default is already FALSE.

I don’t think there’s a fully functional branch for that quite yet although I could be wrong. I think @charlesm93 and @stevebronder are working on it so they would know for sure. If there is a working implementation on a branch then it’s probably possible to get it to run through CmdStanR or CmdStanPy but not RStan or PyStan.