Hi everyone, I am trying to run a simple Rubin Random Effect model for a meta analysis of the sort:
data {
int<lower=0> N; // num observations
vector[N] y; // outcome
vector[N] se; // vector of standard error estimates
}
parameters {
vector[N] theta;
real beta;
real<lower=0> sigma_theta;
}
model {
// Likelihood or 1st hierarchical layer
y ~ normal(theta, se);
// 2nd hierarchical layer
theta ~ normal(beta, sigma_theta);
// not specifying priors for both beta and sigma_theta
}
My question is how to best estimate this model using Maximum Likelihood. My first approach would be to not specify priors for beta and sigma_theta, so that Stan uses improper flat priors. This should, in theory, closely approximate a ML estimation, am I correct?
Another approach I tried is to use the stan_glmer() function, but I get a very different result.
fit <- stan_glmer(
y ~ 1 + (1 | obs), # Intercept-only model with a random effect per observation
data = dt, # Assuming `dt` is the dataset containing y and se
weights = 1 / se^2, # Use inverse-variance weighting
prior = NULL, # No priors specified for a flat (improper) prior equivalent
prior_intercept = NULL, # Flat prior on the intercept
prior_covariance = decov(regularization = 1e-10), # Very small regularization for weak prior
chains = 4, # Number of Markov Chains
iter = 2000, # Number of iterations per chain
seed = 11397 # Set a seed for reproducibility
)
In theory and in practice the MAP will be exactly MLE; if somehow itâs not, something is very likely wrong. Nevertheless, if instead of looking only at the MAP you look at the entire chain or averages of any parameter or model output, it will be different from MLE, which is (almost invariably) a point estimate.
Iâm not familiar with stan_glmer(), but itâs not clear to me that it specifies that the intercepts (theta) come from a normal distribution that follows a normal distribution â thereâs no theta, beta, or sigma_theta in that code, right?
So maybe you are getting different results because itâs just letting the intercepts be independently estimated, instead of being constrained by the higher hierarchical level. (again, I may be off, since I donât know stan_glmerâs syntax).
I do not have a specific reference, because it is a very general statement, I guess you can find that in many books on bayesian statistics, although I suspect most wonât bother writing a proof, as it is quite simple.
In a nutshell, MLE will get a (usually point) estimate the likelihood of parameters given the data, \mathcal{L}(\theta|D), by simulating the likelihood of the data given the parameters, \mathcal{L}(\theta|D), and using whatever optimization algorithm to maximize log(\mathcal{L}(\theta|D)) (even a while ago, many actually used MCMC under the hood and spitted out the point value with highest likelihood, but thatâs only marginally important). Posterior probability on the other hand adds prior probability and becomes \mathcal{P}(\theta|D) = \mathcal{L}(\theta|D)\ p(\theta). Therefore, if priors are improper, p(\theta) is constant and doesnât affect the relative likelihood, so you will get the same result, i.e. \mathcal{P}(\theta|D) = \mathcal{L}(\theta|D)\ c up to a constant c, which in practice is probably omitted for calculation purposes anyway.
If that is estimated using MCMC or any other stochastic algorithm, the precise result will differ, but will probably be very close, and if a regular optimization method is used, it may still be affected by machine precision (and depending on the complexity of the model, these methods may fail in spectacular ways).
I donât know that package, but depending on the range of the parameter values, maybe that is the case. You would have to check exactly what the shorthand notation stands for to make sure you are dealing with the exact same model. An easier way of checking is using a simpler model with simulated data, so you know what to expect and may be able to spot mistakes and inconsistencies more easily.
@caesoma thank you very much for the detailed explanation. I agree that from a theoretical point is straightforward, I was talking about some papers in the context of the Stan algorithm or any paper that used a Stan model with improper priors to estimate ML.
It looks like you specified method = "REML" which will perform restricted maximum likelihood where the fixed effects are marginalized out of the likelihood and optimization is performed only over the random effect parameters. This should in general give a different answer than full ML, however, it should also be equivalent to maximizing the marginal posterior of the random effects assuming an improper flat prior on the fixed effects.
Ref section 1.2.3 of Jim Hodgesâ âRichly Parametrized Linear Modelsâ.
I donât remember anything from the top of my head, since improper priors are generally not recommended (and often frowned upon), but if you look up âMaximum Likelihoodâ here on discourse a number of topics will pop up.
In Stan, the MAP estimate and MLE will differ if there are constrained parameters, because the MAP estimate optimizes a target that includes the change of variables adjustment for any constraints.
For example, suppose I write the following Stan program
This defines a density \text{exponential}(\alpha \mid 1) on the constrained scale where \alpha \in (0, \infty). But Stanâs algorithms all operate over unconstrained densities, so we apply a log transform to \alpha and define \beta = \log \alpha. Now we can use the change-of-variables formula to derive
I hope you donât mind if I try to sharpen this up a bit. Suppose we have a Bayesian model specified by a prior p(\theta) and sampling distribution p(y \mid \theta) for unknown parameters \theta and observed data y. Given a fixed data set y,
\mathcal{L}(\theta) = p(y \mid \theta)
is called the likelihood function. Note that itâs just a function, not a density, because
Also, the frequentists donât think of \theta as being random, so you wouldnât want a density for it in that context. And you wouldnât typically write it as \mathcal{L}(\theta \mid D) because that makes it look like a conditional density.
To be a little more precise, suppose we have a sampling distribution p(\theta \mid y) (we call it a âlikelihoodâ when considering it a function of the parameters for a fixed data set) and prior p(\theta). If we define \mathcal{L}(\theta) as above, then we have p(\theta \mid y) \propto p(y \mid \theta) \cdot p(\theta) by Bayesâs rule and hence p(\theta \mid y) \propto \mathcal{L}(\theta) \cdot p(\theta).
You canât use MCMC for optimization
This is wrong. Donât do this. The highest likelihood point from a sample will be nowhere near the maximum likelihood estimate due to concentration of measure. I discuss this in my intro to Stan and in the curse of dimensionality/typical set case study.
The stated relation only holds up to a proportion, so should be written as
This is not true in general. First, a prior is supposed to be a density, so it canât technically be improper. People are sloppy about this all the time and talk about âflat priorsâ. Those donât fit properly into the above framework because p(\theta) is not a density, but we can plug it into Bayesâs rule anyway and run with the result. Secondly, you can have a constant prior that is not improper, such as p(\theta) = \text{uniform}(\theta \mid 0, 1). It is true that you cannot have a proper constant prior over an unbounded random variable.
Now letâs suppose p(\theta) is a constant (whether proper or not). Because p(\theta \mid y) \propto p(y \mid \theta) \cdot p(\theta), we conclude p(\theta \mid y) \propto p(y \mid \theta) = \mathcal{L}(\theta). That is, the posterior is proportional to the likelihood when there is a flat prior. In this case, itâs obvious that
In frequentist stats, people often fit penalized maximum likelihood where the penalty looks a lot like a prior. They typically work on the log scale, where if the penalty is the negative log prior and weâre optimizing on the log scale, we again get the same results for MAP and MLE (unless thereâs an implicit change of variables hiding).
Right. Thanks for correcting me on that, @Bob_Carpenter. @Tommaso11397, that is an important point for Stan users to remember, constrained parameters in Stan are transformed via exponential or logistic functions and it is the estimate of this transformed variable which is computed. Maybe the details are beyond your interests and needs at this time, but the essential part is that.
I am not sure if that is the same for brms in general or in this particular case, maybe Bob or someone else can give you this information.
I donât mind at all, but really \mathcal{L}(\theta | D), p(\theta | D), or f(\theta | D) is mostly arbitrary notation that will need to be specified explicitly most times. I personally like the \mathcal{L} notation when I am teaching given the central importance of The Likelihood. If we want to be pedantic about it, \mathcal{L}(\theta) is only correct if assume the data is fixed for all purposes â in general I would write \mathcal{L}(\theta, D) (although it may be a slight abuse of the notation if we are using \mathcal{L} as the sum/product over all data points â but so is \mathcal{L}(\theta) = p(y| \theta) ). I think it is useful to think of it as a joint distribution in both parameter and data space, and \mathcal{L}(\theta | D) as the probability of the parameters given the data â maybe itâs not typical in some circles, but itâs typical enough to be in the Wikipedia definition of Likelihood function.
Also, \mathcal{L}(\theta | D) is not a density when defined over several data points, for which it is shorthand for a sum/product of densities, but for any given data point d_i, \mathcal{L}(\theta | d_i) is indeed a density, so all of the above applies, we only need to sum the log-likelihoods afterwards for the purposes of MCMC algorithms.
It may not be the best method, but it is wrong to say it is wrong. Deterministic Optimization in general is mostly hill-climbing algorithms applied to the likelihood surface, so you can absolutely use MCMC to get a MAP or MLE; in fact, that is exactly what many simulated annealing algorithms are doing (although I agree it may jump around the mode too much, but I think ânowhere nearâ is probably an exageration).
Of course that is true, but also of course p(y) is constant, so it doesnât affect the inference.
No, but if the priors are over an unbounded interval, it is the only way of implementing it without assigning an infinitesimal density is acknowledging p(\theta) will be the same for any value and just ignore it. For âflat priorsâ it will not be constant, but the probability will be zero outside of some range, and in both cases it would be a density, either because you divide by the size of the interval, or using the shortcut above for infinite intervals.
I agree, some of the issues mentioned here (particularly variable transformation, if I had to guess) are causing the discrepancy. I also agree there is some nuance, abuse of notation, and confusion that may need to be treated more strictly, but for the purpose of @Tommaso11397âs problem, most of that is irrelevant. Iâd say try:
Unconstrainining the variable in the Stan code (the more correct way would be making the Jacobian correction, but if the net effect is just to get more rejections due to errors in the support of the gaussian variance parameter, itâs easier);
Making sure that brms reproduces what you are doing in the MLE with the other package (or that what you are doing there is actually what you want).
Hi @Bob_Carpenter and @caesoma, thank you both for your very helpful replies. I am pretty ok with the theoretical considerations. But since I am not super learnt in how Stan works under the hood, I wanted to make sure I was estimating what I wanted to.
Hence, for my very applied goal of estimating the MLE of the model I sent (and potentially more complicated ones), what should I do?
Unconstraining the parameters seems to be a first solution as @caesoma suggested, especially if I do not specify the prior, Stan by default gives them a âflat improper priorâ (I understand this might not be the best terminology)
Any other possible solutions to do this while remaining in the Stan package universe? I say this because I would like to avoid any other package where I do not know exactly what is going on and I end up comparing things that should not be compared
Hi @Bob_Carpenter and @caesoma, I wanted to know if you had any closing remarks on this. Long story short: unconstrain the parameters and do not specify priors (i.e. improper constant priors) to get MLE in Stan.
I just wanted to share that, when unconstraining the parameters I get practically the same result as the package. So this seems to be the way to do it. Thank you both for your help.
I didnât even realize that brms can do maximum likelihood!
Good point. Stats notation is an absolute mess! Iâll stand by the notation being inappropriate for a frequentist because itâs not conditioning \theta because frequentists donât allow distributions over parameters \theta.
Iâm not sure what you mean here. The likelihood is defined as a function over parameters for a fixed data set. In that sense, it doesnât define a density over \theta. Rather itâs defined the other way around as \mathcal{L}(\theta | y) = p(y \mid \theta) (note thatâs not proportional to, but equal). In that sense itâs not a density over \theta. Given that y is fixed, itâs not even a function over y.
I prefer to think of p(y \mid \theta) as the sampling distributionâit tells you how to generate y given \theta.
Let me quantify. Letâs say I have a 100-dimensional standard normal and I take samples y \sim \text{normal}(0, \text{I}) in high dimensions. In N dimensions, the squared distance of a draw y to the mode is y^\top y = \sum_{n=1}^{100} y^2, which follows a \text{chiSq}(N) distribution (chi square N is just the distribution of the sum of N standard normals). For instance, in 100 dimensions, the central 99.99% interval for sampling from \textrm{chiSq}(100) is (55, 161).
Another case where this will fail is with something like a \text{beta}(y \mid 0.5, 0.5), which does not have a mode. Sampling is fine, but all the draws will be strictly within the open interval (0, 1). Over that interval, there is no maximumâthe density grows without bound as y \rightarrow 0 or y \rightarrow 1.
In simulated annealing you are not sampling from the posterior. It works with a series of posteriors p(\theta | y)^{w_i} and it will concentrate around \theta as w_i \rightarrow \infty. So what you do is shrink the target density toward a delta function around the mode. Assuming your sampler can keep up, this approach will get the right answer. But you wonât get it from running Stanâs sampler and taking the maximum value.
I would recommend (a) keeping the parameters constrained, and (b) turning off the Jacobian adjustment. That gives you a frequentist maximum likelihood estimate if thatâs what you want. Depending on which interface youâre in, this may be your only optimization option as we just added MAP optimization pretty recently. I know MAP is available in CmdStanPy and thus probably in CmdStanR, but I doubt itâs in either PyStan or RStan.
âMaximum likelihoodâ for hierarchical models has multiple meanings. If we use notation y for data, alpha for parameters in the data model, and phi for hyper parameters, so that the posterior density is p(alpha,phi|y) proportional to p(y|alpha)p(alpha|phi)p(phi), then âmaximum likelihoodâ could refer to three different things:
(a) literally maximizing the likelihood, that is, maximizing p(y|alpha), which is just the maximum likelihood estimate of alpha and doesnât use the hierarchical part of the model at all;
(b) jointly maximizing the posterior density p(alpha,phi|y) assuming a flat prior on phi. This joint mode is in fact undefined, because the joint distribution has a pole (i.e., it goes to infinity, and its local integral goes to infinity) in the limit of group-level variance going to zero;
(c) maximizing the marginal posterior density p(phi|y) assuming a flat prior on phi. This marginal mode is defined, although sometimes the estimated group-level variance parameter can be zero, which is one reason I recommend using a zero-avoiding prior for the group-level variance parameters, as explained in our papers from 2013 (http://stat.columbia.edu/~gelman/research/published/chung_etal_Pmetrika2013.pdf) and 2014 (http://stat.columbia.edu/~gelman/research/published/chung_cov_matrices.pdf).
If you ever want to do straight maximum likelihood, you can write your model in Stan without a prior and run Stan in optimize setting. This can be done in Stan directly and also when calling Stan using cmdstanr or rstan. For the reason discussed in point (a) above, you would not want to do this with a hierarchical model, as the mode is degenerate at the group-level variance parameter being zero.
I completely agree, and while I also think some level of strict formalism/pedantism is necessary to make sure things remain objective, thereâs a point where things get counterproductive.
I mean that if you take your âfixed dataâ to be a single data point, you would be able to say that \mathcal{L}(\theta|D) and \mathcal{L}(D|\theta) are densities. But more importantly, I am convinced that in most practical contexts it is safe to used âlikelihoodâ \mathcal{L} interchangeably with âprobabilityâ p, even if most times the former actually means a product of probabilities, so those become equivalent to p(\theta|y) and p(y|\theta) in Bayesâs Theorem â you could even get pedantic and show that is true for each individual observation (Iâm sure you could also get equally as pedantic to show some other context when that interchange is not desirable, but does it matter that much?). So itâs also fairly safe to talk about likelihood and probability of parameters given the data, data given parameters, and even joint probability of parameters and data â i.e. \mathcal{L}(\theta, D) or p(\theta, y).
I find âSampling distributionâ a confusing term; when teaching introductory classes (probably not to future statisticians), I find it more intuitive to say itâs the probability of the data given a fixed set of parameters under some model, even given the caveats above, may not be exactly what it is, but thatâs what itâs actually doing for the inference. But thatâs a personal choice, and I take responsibility for its consequences.
Agreed, but you are describing what the algorithm is able to do, and wether it can or will do it well; that is not to say that it wonât get you an answer you could use (e.g. you could get a very large number of samples, or make MCMC adapt to the acceptance rate). But I guess that is besides the main point here, so Iâll just drop it.
Hi @andrewgelman, thank you very much for your comment, very helpful. I understand that your preferred approach would be (c), how would you go about implementing this on Stan?
And, secondly, though you do not like (a), what if I ex post âcheckâ that there is indeed group-level variance? In the case of the classic Rubin model for meta analyses, every study has its own Random Effect, so unless the common distribution of the REs has variance zero, the condition should hold, correct?
@Bob_Carpenter, thank you again for the tip: when you say âturning off the Jacobian adjustmentâ, do you mean something of this sort, i.e.:
parameters {
real x; // Unconstrained parameter
}
model {
real y = exp(x); // Manual transformation, no Jacobian adjustment
target += normal_lpdf(y | 0, 1); // Specify the likelihood without adjustment
}
instead of:
parameters {
real<lower=0> y; // y is declared in its transformed space
}
model {
y ~ normal(0, 1); // Implicit Jacobian adjustment applied
}
If you were planning to compute the marginal maximum likelihood estimate of the hyperparameters in a hierarchical model, I would instead recommend you perform full Bayesian inference, or if that is too computationally expensive, i would recommend you compute the marginal posterior mode using a zero-avoiding prior for reasons given in my 2013 and 2014 papers linked above.
How to do that in Stan? I donât think it can currently be done in Stan, but I think that Charles Margossian developed a nested Laplace algorithm which would do it, and that this algorithm could be in Stanâmaybe itâs currently implemented on a branch?
You ask about checking that group-level variance is nonzero. Iâd just recommend including this as a parameter in the model and fitting it. You can also compare to a fit with zero group-level variance. In any case, this problem will not be solved with a joint maximum likelihood estimate of data-model parameters and hyperparameters, as this joint maximum likelihood estimate does not exist: the joint likelihood function has a pole in the limit that group-level variance goes to zero.
To put it another way, thereâs no principle that would recommend that joint maximum likelihood estimate. Itâs just a bad idea for hierarchical models.
Thereâs actually a jacobian argument for the optimize() method in CmdStanR and CmdStanPy, although the default is already FALSE.
I donât think thereâs a fully functional branch for that quite yet although I could be wrong. I think @charlesm93 and @stevebronder are working on it so they would know for sure. If there is a working implementation on a branch then itâs probably possible to get it to run through CmdStanR or CmdStanPy but not RStan or PyStan.