Marginalizing Continuous Variables


This relates to a previous thread, but it’s a separate questions so I wanted to put it somewhere else.

The original I have is below. In this model I treat X as exogenous. However, it’s not really exogenous. The mean and covariance are exogenous and then I simulate X from a multivariate normal and pass that X to Stan. This isn’t so much an issue when T is small, but for larger T Stan tends to get slower.

Naively, I think I want something like the second model where I simulate X given some mu and sigma. The problem with this is that X is not identified.

Is the original model already effectively marginalizing out X? Do I have any any other options to remove X when T is large?

Original Model

data {
     int<lower=0> N;
     int<lower=0> T;
     matrix[T, N] X;
transformed data {
     vector[T] ones_T;

     ones_T = rep_vector(1, T);
parameters {
    simplex[N] w;
    real<lower=0> sigma;
model {
    ones_T ~ normal(X * w, sigma);

Naive model that I ideally would want

data {
     int<lower=0> N;

     vector[N] mu_X;
     cov_matrix[N] sigma_X;
parameters {
    vector[N] X;
    simplex[N] w;
    real<lower=0> sigma;
model {
    X ~ multi_normal(mu_X, sigma_X);
    1 ~ normal(X * w, sigma);


I can never remember what “exogenous” means, but I don’t think that matters here.

What you’re doing here builds in measurement uncertainty on the X conditioned on mu_X and sigma_X. It will presumably lead to more uncertainty in the posterior.

For efficiency, you just want to use 1 ~ in both models—scalars automatically get broadcast (replicated) as many times as necessary.

Why are the weights a simplex?


On exogenous, I suppose what I mean is that it is something from outside the model.

I hadn’t tried 1, thanks for the tip.

On weights being a simplex, I should probably explain a bit more what I’m doing. In finance, there’s a paper [1] on the sampling error of mean-variance portfolio weights. It shows that a mean-variance optimization in portfolio theory is equivalent to a regression. You can then use regression to get the sample distributions of the weights. The weights in this case are the allocations to each asset. Typically in finance, the weights are constrained to be positive and sum to 1. Hence, a simplex prior made the most sense. I had tried other options later as a check, but the results tended to not make as much sense.

In the original paper, the X would be something like the returns to some stocks as they occurred in the past. However, that’s not my preference. There’s some quirks to finance data, so it’s generally better to do analysis on a forward-looking basis rather than backward-looking.

So I used Stan to estimate the mean and covariance. I then simulated the posterior predictive distribution. To get the results to make sense, I needed to scale the likelihood function. For instance, if Stan outputs 1200 simulations from the posterior predictive of the original stock returns, then the model statement would be
target += (T / 1200) * normal_lpdf(1 | X * w, sigma);
though I had replace (T/1200) with a real using transformed data. I couldn’t get it down to the case where it only fits 1 observation, but I was getting sensible results this way.

So, I think that had resolved my issues, though I’m still not entirely sure if I can say that I have marginalized out mu_X and sigma_X.



As an aside (and prefacing that I’m not that knowledgeable about Bayesian optimization), Bayesian optimization typically focuses on putting a prior over the objective function. It then uses function evaluations to update the distribution of the objective function. This way you don’t need to know the gradients. In some sense, I’m doing a quadratic optimization here, even though it’s formulated as a regression problem. However, the difference is that I’m putting a prior is on the distribution of the arguments to the optimization, i.e. the weights vector above. As far as I can tell, this is different from the typical approach in Bayesian decision theory (which is how Bayesian portfolio theory is usually framed). In other words, I am treating the decision as a random variable.


Yes, you have. MCMC samples jointly over all the parameters in the posterior, so you can just drop the ones you don’t care about to marginalize.

Decisions are treated as random varibles in Bayesian analyses, as they’re functions of parameters, which are random variables.


Decisions are treated as random varibles in Bayesian analyses, as they’re functions of parameters, which are random variables.

I’m not an expert, but the way I usually see it presented is that you generate the posterior predictive distribution and effectively treat that as fixed. You then use get some utility function that is a function of the posterior predictive and your decision variables and optimize.

For instance, see Equation 8 in:


That’s the function of parameters. It may even be random (require an RNG)

That function is just another function of random variables.

The result is a distribution over the quantities of interest—like some utility. Then you have to decide what to optimize—often it’s some kind of posterior expectation, which has to be done outside of Stan’s language as you only have access to parameters, not things like their expectations inside a program.


Yes, you would calculate the expected utility using the posterior predictive distribution. That would mean that given weights, I could use the posterior predictive distribution to compute the distribution of utility.

In addition, if I use a quadratic optimization with linear equality constraints, then there is a well-known analytic solution. Given the posterior distribution, I can compute the weight vectors and get the distribution of weights. In this sense, I see your point that the decisions are a random variable. The expected value of this weight distribution should be more-or-less equal the normal optimization result in the limit.

However, if you have inequality constraints, such as constraining the weights to be positive, then you can’t get an analytic solution for meaningfully large problems. Further, if you do the same procedure as above (use posterior distribution and optimize to get weights for each simulation), then the expected value of this weight distribution will not equal the normal optimization result because of non-linearity in the relationship between the weights and the posterior distributions.

So the idea here is to re-code the optimization as a regression (what the other guy already did) and then put a prior on the decision directly (what I added). In my tests, the results weren’t exactly like the procedure in the above paragraph, but it was about as close as you could expect.


Thanks—that was the piece of the puzzle I was missing. I’m used to simple hello-world type comparisons, like lambda[1] > lambda[2] for a couple of Poisson parameters or something. If choosing optimal behavior given parameters is itself hard, then you need to do soemthing more complicated.

Stan doesn’t actually let you code optimizations over sampler output directly, but we do that internally all the time with nested autodiff. Given that derivatives distribute through expectations, we can do Monte Carlo calculations for the gradients and use gradient-based optimizers.