Right. Thanks for correcting me on that, @Bob_Carpenter. @Tommaso11397, that is an important point for Stan users to remember, constrained parameters in Stan are transformed via exponential or logistic functions and it is the estimate of this transformed variable which is computed. Maybe the details are beyond your interests and needs at this time, but the essential part is that.
I am not sure if that is the same for brms in general or in this particular case, maybe Bob or someone else can give you this information.
I don’t mind at all, but really \mathcal{L}(\theta | D), p(\theta | D), or f(\theta | D) is mostly arbitrary notation that will need to be specified explicitly most times. I personally like the \mathcal{L} notation when I am teaching given the central importance of The Likelihood. If we want to be pedantic about it, \mathcal{L}(\theta) is only correct if assume the data is fixed for all purposes – in general I would write \mathcal{L}(\theta, D) (although it may be a slight abuse of the notation if we are using \mathcal{L} as the sum/product over all data points – but so is \mathcal{L}(\theta) = p(y| \theta) ). I think it is useful to think of it as a joint distribution in both parameter and data space, and \mathcal{L}(\theta | D) as the probability of the parameters given the data – maybe it’s not typical in some circles, but it’s typical enough to be in the Wikipedia definition of Likelihood function.
Also, \mathcal{L}(\theta | D) is not a density when defined over several data points, for which it is shorthand for a sum/product of densities, but for any given data point d_i, \mathcal{L}(\theta | d_i) is indeed a density, so all of the above applies, we only need to sum the log-likelihoods afterwards for the purposes of MCMC algorithms.
It may not be the best method, but it is wrong to say it is wrong. Deterministic Optimization in general is mostly hill-climbing algorithms applied to the likelihood surface, so you can absolutely use MCMC to get a MAP or MLE; in fact, that is exactly what many simulated annealing algorithms are doing (although I agree it may jump around the mode too much, but I think “nowhere near” is probably an exageration).
Of course that is true, but also of course p(y) is constant, so it doesn’t affect the inference.
No, but if the priors are over an unbounded interval, it is the only way of implementing it without assigning an infinitesimal density is acknowledging p(\theta) will be the same for any value and just ignore it. For “flat priors” it will not be constant, but the probability will be zero outside of some range, and in both cases it would be a density, either because you divide by the size of the interval, or using the shortcut above for infinite intervals.
I agree, some of the issues mentioned here (particularly variable transformation, if I had to guess) are causing the discrepancy. I also agree there is some nuance, abuse of notation, and confusion that may need to be treated more strictly, but for the purpose of @Tommaso11397’s problem, most of that is irrelevant. I’d say try:
- Unconstrainining the variable in the Stan code (the more correct way would be making the Jacobian correction, but if the net effect is just to get more rejections due to errors in the support of the gaussian variance parameter, it’s easier);
- Making sure that
brmsreproduces what you are doing in the MLE with the other package (or that what you are doing there is actually what you want).