Calculate Entropy, Conditional Entropy, or other Information Theory Metrics from Stan Model

Hi all!

I have a question related to the application of Stan results to downstream information theoretic metrics including entropy, conditional entropy, and information gain.

Here’s the background - I have a feature vector of response variables y that is say a measurement of an individual’s stature and my target or conditioning variable say x is the country that individual comes from. My ultimate goal is to learn the information gain i.e., how much more certain are we of an individual’s country of origin given their stature.

To be precise IG = H(x) - H(x|y) or the entropy of the target variable minus the conditional entropy of x given y. Given stature is continuous, entropy is equal to -\int_{x} f(x)logf(x) dx and therefore the conditional entropy is -\int_{x,y} f(x,y)logf(x|y) dxdy.

Math aside, does anyone have any suggestions on how the Stan log probability density can be used here. Either extracting the results OR possibly embedded in generated quantities?

Here I include a VERY general model that models stature as a normal distribution with a mean function and sd function. Further, X here is not the target variable from above, it is a covariate age. I assume I’d have to include the target somewhere. Also, my ultimate goal is to also get IG based on a MVN model of more than one trait (i.e., stature and weight).

data{
  int N; // # of individuals
  vector y[N]; // vector of responses per individual
  //predictors
  real X[N]; //age
}
parameters{
  real a;
  real r;
  real b;
  real s_scale;
  real kappa;
}
transformed parameters{
  real mu[N];
  real Sigma[N];
  for(i in 1:N){    
      mu[i] = a*X[i]^(1+b[k];  
      Sigma[i] = s_scale*(1+kappa*x[i]);
}
model{
a ~ normal(0,10);
	r ~ normal(0,1); 
	b ~ normal(0,10);
	kappa ~ normal(0,1); 
	s_scale ~ cauchy(0,5);
 
	y ~ normal(mu, Sigma);
}

To even begin to talk about joint and conditional entropies one will need to construct the joint probability density function over all of the variables of interest. In this case that would mean over both y and x to give \pi(y, x), but your Stan program instead defines the awkward conditional model \pi(y, \theta \mid x) where \theta are the parameters.

Consequently the first step would be to move beyond a regression model and model the covariate distribution explicitly. This would given the joint density function

\pi(y, x, \theta) = \pi(y \mid x, \theta) \, \pi(x \mid \theta) \, \pi(\theta)

which marginalizes to

\begin{align*} \pi(y, x) &= \int \mathrm{d} \theta \, \pi(y, x, \theta) \\ &= \int \mathrm{d} \theta \, \pi(y \mid x, \theta) \, \pi(x \mid \theta) \, \pi(\theta). \end{align*}

At this point you could generate exact samples with ancestral sampling,

\tilde{\theta} \sim \pi(\theta) \\ \tilde{x} \sim \pi(x \mid \tilde{\theta}) \\ \tilde{y} \sim \pi(y \mid \tilde{x}, \tilde{\theta}),

and try to construct Monte Carlo estimators of the relevant entropies. That said these Monte Carlo estimators are notoriously unstable – the variance of \log \pi(y, x) is often very large if not infinite.

Notice, however, that in this construction observed data are never used. This kind of analysis really just considers information gained relative to the prior model. If you wanted to consider information behavior relative to a posterior distribution informed by the data \tilde{y} and \tilde{x} then you’d need

\begin{align*} \pi(y, x \mid \tilde{y}, \tilde{x}) &= \int \mathrm{d} \theta \, \pi(y, x, \theta \mid \tilde{y}, \tilde{x}) \\ &= \int \mathrm{d} \theta \, \pi(y \mid x, \theta) \, \pi(x \mid \theta) \, \pi(\theta \mid \tilde{y}, \tilde{x}). \end{align*}

In Stan this would require building the joint density function

\pi(y, x, \tilde{y}, \tilde{x}, \theta)

where (\tilde{y}, \tilde{x}) are defined in the data block and y, x are predictive variables defined in the generated quantities block. One could then run Stan and use the posterior predictive samples for y and x to construct Markov chain Monte Carlo estimators of the entropies.

But again these estimators are notoriously finicky and hard to get right if the posterior samples are well-behaved!