Alternatives to marginalization

Hi!

I have a model that I am struggling to implement well in Stan, and would be grateful for any advice.

Theoretically, this is a case where it could make sense to condition a part of the model on a function of some parameters (which actually seems to work quite well in JAGS.) However, this does not work with Hamiltonian MC (and I know it may generally be a bad idea). I think your general advice is to aim for marginalization in such circumstances, but that does not seem possible here, as I would typically have about 5000 binary conditions/parameters to marginalize. Thus, I am wondering if there is a better alternative. Is there an efficient way of achieving something akin to marginalization of a large number of latent binary parameters?

My model (see code below) is a mixture model with two components and my issue arises in the component p[i,j]. p[i,j] should itself be taken from a different function depending on the binary condition l[i,j], which could be operationalized and modeled in several ways. l[i,j] can essentially be observed as V[i] > Y[i,j], but the data will inevitably contain error, so setting l[i,j] = V[i] > Y[i,j] results in endogeneity/post-treatment bias. l[i,j] could be plausibly be estimated as the conditional statement V[i] > (alpha[i] + beta[i] * theta[j]), but that would not work with HMC. I have been experimenting with a logistic transformation: l[i,j] = (1 / (1 + exp(-(V[i]-(alpha[i] + beta[i] * theta[j]))*10))), which works, but is theoretically unattractive and tends to fit certain patterns of noise when using real data.

I am attaching some downsized fake data and an example model. The true gamma parameters in the attached data are all zero. (The model is doing some very explicit transformations of the parameters to clarify the issues at hand. It can probably also be sped up by vectorization(?).)

To sum up: Is there a way to treat l as a vector of latent binary parameters in Stan?

Any help would be greatly appreciated!

data {
  int<lower=0> J; 					// n of clusters
  int<lower=0> N; 					// n of individuals
  int B;							// scale bound
  int<lower=-B,upper=B> Y[N,J]; 	// outcome of interest
  real<lower=-B,upper=B> U[N,J];	// predictive info
  int<lower=-B,upper=B> V[N]; 		// predictive info
}

parameters {
  real theta[J];					// latent factor
  real alpha[N];					// individual intercept (in the first mixture component)
  real beta[N];						// individual coef (in the first mixture component)
  real<lower=0> phi; 				// sd of alpha
  real<lower=0> bhi;				// sd of beta
  real<lower=0> sigma;				// sd of errors in y
  real<lower=0,upper=1> gamma[N];	// weight between two mixture components
  real<lower=0> gammaa;				// hyperparameter
  real<lower=0> gammab;				// hyperparameter
}

transformed parameters { 
  real p[N,J];
  real<lower=0,upper=1> l[N,J]; 	// This should be an integer condition
  for (i in 1:N) {
    for (j in 1:J) {
      // It could make sense with: 
		// l[i,j] = V[i] > (alpha[i] + beta[i] * theta[j]); 
	// Or even some version of this (though it would give post-treatment bias): 
		// l[i,j] = V[i] > Y[i,j] 
	// This works, but it is theoretically unattractive (and fits noise in real data):
      l[i,j] = (1 / (1 + exp(-(V[i]-(alpha[i] + beta[i] * theta[j]))*10))); 
      p[i,j] = (l[i,j] * (U[i,j]*V[i] + B*U[i,j] - B) + (1-l[i,j]) * (U[i,j]*V[i] - B*U[i,j] + B));
    }
  }
}

model {
  alpha ~ normal(0, phi); 
  phi ~  cauchy(0,B);
  beta ~ normal(1, bhi); 	
  bhi ~  cauchy(0,1);  
  gamma ~ beta(gammaa, gammab);
  gammaa ~ gamma(1.5,.5);
  gammab ~ gamma(1.5,.5);
  theta ~ normal(0,10);  
  sigma ~ cauchy(0,10);  

  for (i in 1:N)
    for (j in 1:J)     
      Y[i,j] ~ normal((1-gamma[i])*(alpha[i] + beta[i] * theta[j]) + gamma[i]*p[i,j], sigma); 
}

model.stan (1.7 KB)
data.R (1.8 KB)
fit model.R (445 Bytes)

Update:
I may be about to solve this, using some mixture tricks from the manual. Will post any further questions.

1 Like

I may be about to solve this, using some mixture tricks from the manual

I’m curious what you end up doing. Lemme know

Not directly. If they are all dependent on each other, the marginalization will be intractable. You see this in problems like variable selection.

And no, you can’t fake it by doing selection like this on a continuous parameter. It breaks the continuous differentiability assumption in the HMC sampler.

I should’ve added that you can look at the latent discrete parameter chapter of the manual. Even if you can’t have discrete parameters, you can work with them in expectation, which is much more efficient statistically and computationally.

Thanks for your answers!

Looking more at the manual and doing a bit more thinking, I think marginalization (as a finite mixture with two components) may be a feasible option here after all. At least I’ve specified a model that seems to work well both with fake and real data.

To recap the problem, I have a lot of binary parameters/conditions to estimate or marginalize (actually one per observation). The key question seems to be how much structure to impose on these and how, i.e. how many mixing proportions to estimate, or what to do with the priors. I’ve landed on (2*B+1, which typically means 11) as the number of mixing proportions for about 4000 observations. Then I’ve just left the mixing proportion priors at beta(1,1), as they should be swamped by the data. This model seems to capture the main patterns in the data, sample fast, converge nicely, and recover latent parameteres in fake data.

This means I may go with this model, so the next step is to do any possible optimizing of the code. I would be very grateful for any comments on the code below. Does it look reasonable? Can it be improved?

data {
  int<lower=1> N; 					// n of individuals
  int<lower=1> J; 					// n of items
  int<lower=1> B;					// scale bound
  int<lower=-B,upper=B> Y[N,J]; 	// outcome of interest
  int<lower=-B,upper=B> V[N]; 		// predictive info
  real<lower=0,upper=1> U[N,J];		// predictive info
}

parameters {
  real theta[J];					// latent factor
  real alpha[N];					// individual intercept 
  real beta[N];						// individual coef
  real<lower=0> phi; 				// sd of alpha
  real<lower=0> psi;				// sd of beta
  real<lower=0> sigma;				// sd of error in y
  real<lower=0,upper=1> gamma[N];	// weight
  real<lower=1> gammaa;				// hyperparameter
  real<lower=1> gammab;				// hyperparameter
  real<lower=0,upper=1> lambda[2*B+1];  // mixing proportions
}

model {
  alpha ~ normal(0, phi); 
  phi ~  cauchy(0,B);
  beta ~ normal(1, psi); 	
  psi ~  cauchy(0,B);  
  gamma ~ beta(gammaa, gammab);
  gammaa ~ gamma(1.5,.5);
  gammab ~ gamma(1.5,.5);
  theta ~ normal(0,10);  
  sigma ~ cauchy(0,10);  
  lambda ~ beta(1, 1);
  
  for (i in 1:N)
    for (j in 1:J) {
	  target += log_sum_exp(
		  (log(lambda[V[i]+B+1]) + normal_lpdf(Y[i,j] | (1-gamma[i])*(alpha[i] + beta[i] * theta[j]) + gamma[i]*(U[i,j]*V[i] + B*U[i,j] - B), sigma)),
		(log1m(lambda[V[i]+B+1]) + normal_lpdf(Y[i,j] | (1-gamma[i])*(alpha[i] + beta[i] * theta[j]) + gamma[i]*(U[i,j]*V[i] - B*U[i,j] + B), sigma)) );
  }
}

generated quantities {
  matrix[N,J] log_lik;
    for (i in 1:N)
      for (j in 1:J) {
	    log_lik[i,j] = log_sum_exp( 
		    (log(lambda[V[i]+B+1]) + normal_lpdf(Y[i,j] | (1-gamma[i])*(alpha[i] + beta[i] * theta[j]) + gamma[i]*(U[i,j]*V[i] + B*U[i,j] - B), sigma)),
		  (log1m(lambda[V[i]+B+1]) + normal_lpdf(Y[i,j] | (1-gamma[i])*(alpha[i] + beta[i] * theta[j]) + gamma[i]*(U[i,j]*V[i] - B*U[i,j] + B), sigma)) );
  }
}

That’s the key to actually being able to marginalize efficiency. You can move them in algebraically where there’s just a single sum per data point, which is efficient enough.

For efficiency, you usually want to use non-centered parameterizations for parameters like alpha and gamma (I’d suggest an alternative name for “gamma”—it’s confusing here).

The manual suggests alternative parameterizations for the beta which may be more natural in terms of a mean (between 0 and 1) and total count; essentially (a / (a + b)) and (a + b).

For efficiency, you should only compute terms like (1-gamma[i])*(alpha[i] + beta[i] * theta[j]) only once and reuse them .

It’d probably be worthwhile to compute B * U as a scalar-matrix multiply and save it. Anything that involves a scalar and a matrix is more efficient than a loop.

We have a log_mix function that’ll simplify what you’re doing to

target += log_mix(lambda[V[i] + B + 1], normal_lpdf( ... ), normal_lpdf(...));
1 Like

Oh, and you can also define log_lik as a transformed parameter and then use

target += sum(log_lik);

in the model block. Then you don’t have to compute it all twice.

You can also compute the probability of each mixture component if you want. It’s those terms in the log likelihood normalized rather than log-sum-exp-ed.

1 Like