# Alternatives to marginalization

Hi!

I have a model that I am struggling to implement well in Stan, and would be grateful for any advice.

Theoretically, this is a case where it could make sense to condition a part of the model on a function of some parameters (which actually seems to work quite well in JAGS.) However, this does not work with Hamiltonian MC (and I know it may generally be a bad idea). I think your general advice is to aim for marginalization in such circumstances, but that does not seem possible here, as I would typically have about 5000 binary conditions/parameters to marginalize. Thus, I am wondering if there is a better alternative. Is there an efficient way of achieving something akin to marginalization of a large number of latent binary parameters?

My model (see code below) is a mixture model with two components and my issue arises in the component p[i,j]. p[i,j] should itself be taken from a different function depending on the binary condition l[i,j], which could be operationalized and modeled in several ways. l[i,j] can essentially be observed as V[i] > Y[i,j], but the data will inevitably contain error, so setting l[i,j] = V[i] > Y[i,j] results in endogeneity/post-treatment bias. l[i,j] could be plausibly be estimated as the conditional statement V[i] > (alpha[i] + beta[i] * theta[j]), but that would not work with HMC. I have been experimenting with a logistic transformation: l[i,j] = (1 / (1 + exp(-(V[i]-(alpha[i] + beta[i] * theta[j]))*10))), which works, but is theoretically unattractive and tends to fit certain patterns of noise when using real data.

I am attaching some downsized fake data and an example model. The true gamma parameters in the attached data are all zero. (The model is doing some very explicit transformations of the parameters to clarify the issues at hand. It can probably also be sped up by vectorization(?).)

To sum up: Is there a way to treat l as a vector of latent binary parameters in Stan?

Any help would be greatly appreciated!

``````data {
int<lower=0> J; 					// n of clusters
int<lower=0> N; 					// n of individuals
int B;							// scale bound
int<lower=-B,upper=B> Y[N,J]; 	// outcome of interest
real<lower=-B,upper=B> U[N,J];	// predictive info
int<lower=-B,upper=B> V[N]; 		// predictive info
}

parameters {
real theta[J];					// latent factor
real alpha[N];					// individual intercept (in the first mixture component)
real beta[N];						// individual coef (in the first mixture component)
real<lower=0> phi; 				// sd of alpha
real<lower=0> bhi;				// sd of beta
real<lower=0> sigma;				// sd of errors in y
real<lower=0,upper=1> gamma[N];	// weight between two mixture components
real<lower=0> gammaa;				// hyperparameter
real<lower=0> gammab;				// hyperparameter
}

transformed parameters {
real p[N,J];
real<lower=0,upper=1> l[N,J]; 	// This should be an integer condition
for (i in 1:N) {
for (j in 1:J) {
// It could make sense with:
// l[i,j] = V[i] > (alpha[i] + beta[i] * theta[j]);
// Or even some version of this (though it would give post-treatment bias):
// l[i,j] = V[i] > Y[i,j]
// This works, but it is theoretically unattractive (and fits noise in real data):
l[i,j] = (1 / (1 + exp(-(V[i]-(alpha[i] + beta[i] * theta[j]))*10)));
p[i,j] = (l[i,j] * (U[i,j]*V[i] + B*U[i,j] - B) + (1-l[i,j]) * (U[i,j]*V[i] - B*U[i,j] + B));
}
}
}

model {
alpha ~ normal(0, phi);
phi ~  cauchy(0,B);
beta ~ normal(1, bhi);
bhi ~  cauchy(0,1);
gamma ~ beta(gammaa, gammab);
gammaa ~ gamma(1.5,.5);
gammab ~ gamma(1.5,.5);
theta ~ normal(0,10);
sigma ~ cauchy(0,10);

for (i in 1:N)
for (j in 1:J)
Y[i,j] ~ normal((1-gamma[i])*(alpha[i] + beta[i] * theta[j]) + gamma[i]*p[i,j], sigma);
}
``````

model.stan (1.7 KB)
data.R (1.8 KB)
fit model.R (445 Bytes)

Update:
I may be about to solve this, using some mixture tricks from the manual. Will post any further questions.

1 Like

I may be about to solve this, using some mixture tricks from the manual

I’m curious what you end up doing. Lemme know

Not directly. If they are all dependent on each other, the marginalization will be intractable. You see this in problems like variable selection.

And no, you can’t fake it by doing selection like this on a continuous parameter. It breaks the continuous differentiability assumption in the HMC sampler.

I should’ve added that you can look at the latent discrete parameter chapter of the manual. Even if you can’t have discrete parameters, you can work with them in expectation, which is much more efficient statistically and computationally.

Looking more at the manual and doing a bit more thinking, I think marginalization (as a finite mixture with two components) may be a feasible option here after all. At least I’ve specified a model that seems to work well both with fake and real data.

To recap the problem, I have a lot of binary parameters/conditions to estimate or marginalize (actually one per observation). The key question seems to be how much structure to impose on these and how, i.e. how many mixing proportions to estimate, or what to do with the priors. I’ve landed on (2*B+1, which typically means 11) as the number of mixing proportions for about 4000 observations. Then I’ve just left the mixing proportion priors at beta(1,1), as they should be swamped by the data. This model seems to capture the main patterns in the data, sample fast, converge nicely, and recover latent parameteres in fake data.

This means I may go with this model, so the next step is to do any possible optimizing of the code. I would be very grateful for any comments on the code below. Does it look reasonable? Can it be improved?

``````data {
int<lower=1> N; 					// n of individuals
int<lower=1> J; 					// n of items
int<lower=1> B;					// scale bound
int<lower=-B,upper=B> Y[N,J]; 	// outcome of interest
int<lower=-B,upper=B> V[N]; 		// predictive info
real<lower=0,upper=1> U[N,J];		// predictive info
}

parameters {
real theta[J];					// latent factor
real alpha[N];					// individual intercept
real beta[N];						// individual coef
real<lower=0> phi; 				// sd of alpha
real<lower=0> psi;				// sd of beta
real<lower=0> sigma;				// sd of error in y
real<lower=0,upper=1> gamma[N];	// weight
real<lower=1> gammaa;				// hyperparameter
real<lower=1> gammab;				// hyperparameter
real<lower=0,upper=1> lambda[2*B+1];  // mixing proportions
}

model {
alpha ~ normal(0, phi);
phi ~  cauchy(0,B);
beta ~ normal(1, psi);
psi ~  cauchy(0,B);
gamma ~ beta(gammaa, gammab);
gammaa ~ gamma(1.5,.5);
gammab ~ gamma(1.5,.5);
theta ~ normal(0,10);
sigma ~ cauchy(0,10);
lambda ~ beta(1, 1);

for (i in 1:N)
for (j in 1:J) {
target += log_sum_exp(
(log(lambda[V[i]+B+1]) + normal_lpdf(Y[i,j] | (1-gamma[i])*(alpha[i] + beta[i] * theta[j]) + gamma[i]*(U[i,j]*V[i] + B*U[i,j] - B), sigma)),
(log1m(lambda[V[i]+B+1]) + normal_lpdf(Y[i,j] | (1-gamma[i])*(alpha[i] + beta[i] * theta[j]) + gamma[i]*(U[i,j]*V[i] - B*U[i,j] + B), sigma)) );
}
}

generated quantities {
matrix[N,J] log_lik;
for (i in 1:N)
for (j in 1:J) {
log_lik[i,j] = log_sum_exp(
(log(lambda[V[i]+B+1]) + normal_lpdf(Y[i,j] | (1-gamma[i])*(alpha[i] + beta[i] * theta[j]) + gamma[i]*(U[i,j]*V[i] + B*U[i,j] - B), sigma)),
(log1m(lambda[V[i]+B+1]) + normal_lpdf(Y[i,j] | (1-gamma[i])*(alpha[i] + beta[i] * theta[j]) + gamma[i]*(U[i,j]*V[i] - B*U[i,j] + B), sigma)) );
}
}``````

That’s the key to actually being able to marginalize efficiency. You can move them in algebraically where there’s just a single sum per data point, which is efficient enough.

For efficiency, you usually want to use non-centered parameterizations for parameters like `alpha` and `gamma` (I’d suggest an alternative name for “gamma”—it’s confusing here).

The manual suggests alternative parameterizations for the beta which may be more natural in terms of a mean (between 0 and 1) and total count; essentially (a / (a + b)) and (a + b).

For efficiency, you should only compute terms like `(1-gamma[i])*(alpha[i] + beta[i] * theta[j])` only once and reuse them .

It’d probably be worthwhile to compute `B * U` as a scalar-matrix multiply and save it. Anything that involves a scalar and a matrix is more efficient than a loop.

We have a `log_mix` function that’ll simplify what you’re doing to

``````target += log_mix(lambda[V[i] + B + 1], normal_lpdf( ... ), normal_lpdf(...));
``````
1 Like

Oh, and you can also define `log_lik` as a transformed parameter and then use

``````target += sum(log_lik);
``````

in the model block. Then you don’t have to compute it all twice.

You can also compute the probability of each mixture component if you want. It’s those terms in the log likelihood normalized rather than log-sum-exp-ed.

1 Like