Marginalising out many discrete latent variables / modelling augmented count data

olli0601 · May 18, 2018, 8:59pm

Hello all

I am a big fan of marginalising out latent variables as in the change point example in the manual. But this time, I have approx 100 discrete latent variables to marginalise out, which is generating computational issues. I am looking for either a computational trick that I missed in the manual or an alternative model approach. Any help would be great!

I am considering a simple augmented count model with augmented likelihood

n+m \sim Multinomial(\pi).

The observed counts n are from 100 locations, and we currently model missing counts m via a Negative Binomial

m \sim NB(n, p)

(in base R parameterisation) where p is a vector of observation probabilities. The expected value of the augmented counts, E(n+m), is n/p.

To marginalise out all 100 latent variables, I could put a cap m_{max} on m. But the dimension of the joint latent space is huge, \prod_i m_{i,max} \approx 1e135. I think this precludes marginalisation as in the change point example in the manual (15.2 p214).

Is there another trick that I could exploit for the augmented count model above?

Or maybe there is an approach for modelling missing counts under strong information on the observation probabilities p, that is more tractable in Stan?

Thank you very much for your input,
olli

Just for completeness, here is the full Stan model with no attempt of marginalising the latent variables (i.e. this won t compile, but may clarify the problem).

data {
  // counts
  int<lower=1> N; 					// number of observations, approx 100									
  int COUNTS[N]; 					// observed counts
  // sampling
  real<lower=0> P_ALPHA[N];	        // sampling probabilities prior hyperparameter alpha 
  real<lower=0> P_BETA[N];			// sampling probabilities prior hyperparameter beta									  	  
}
transformed data {
  // dirichlet hyperparameters depending on size of data
  real<lower=0> prior_weight;		
  vector<lower=0>[N] dirprior;
  prior_weight= N;                                                   // poor man casting	  
  prior_weight= 0.8/prior_weight; 	                         
  dirprior= rep_vector(prior_weight, N); 
}	
parameters {
  simplex[N_PAIRS] props;  					// target parameter
  vector<lower=0, upper=1>[N] prob_obs;		// sampling parameter
  int MISSING[N];							// 100 latent variables, to be integrated out	 
}
transformed parameters {		
  vector<lower=0, upper=1>[N] nb2_mu;	//  transformation to NegBin2 mu parameter			
  vector<lower=0, upper=1>[N] nb2_phi;		//  transformation to NegBin2 phi parameter
  int AUX_COUNTS[N];		                        // augmented data					
  nb2_mu= COUNTS .* (1-prob_obs) ./ prob_obs;
  nb2_phi= COUNTS;
  AUX_COUNTS= COUNTS + MISSING;
}
model {	  	
  prob_obs ~ beta(P_ALPHA, P_BETA); 	// prior on sampling parameter
  props ~ dirichlet_log(dirprior);			// Dirichlet prior on target parameter
  MISSING ~ neg_binomial_2_rng(nb2_mu, nb2_phi);
  AUX_COUNTS ~ multinomial(props);	  		// augmented likelihood
}

Bob_Carpenter · May 21, 2018, 2:01am

There are a couple things going on here.

One, missing counts are hard, because you need to sum over a large plausible range, which is very inefficient.

Here, it looks like the missingness is per observation N.

An alternative is to replace the counts with continuous values. I don’t quite understand your model as the multinomial in Stan is for an array of counts, so I don’t know how you can have m being negative binomial and multinomial as defined.

olli0601 · May 21, 2018, 8:00am

Hi Bob,

Thanks for your reply. Yes, I have the same intuition that summing lpmf’s over a large plausible range of missing values (and all their combinations in particular) is likely too inefficient.

A bit more on our model, which may help to come up with a solution that might work well in Stan:

We have count data n_i from i=1,...,100 locations, and the counts range from 1 to 100 although the median is small, around 10. (This makes me a bit cautious about continuous approximations, but I may be wrong.)

We also have a good understanding of how many events per location were observed. We can put an informative prior on the sampling proportion p_i. The mean prior p_i are not too small, typically around 0.2 with range 0.04-0.4. This information allows us to model how many events m_i we may have missed per location, and we currently do this with a Negative Binomial model, m_i \sim NB(n_i, p_i). This allows us to define augmented counts z_i=m_i+n_i, and we can account for sampling differences when we base inference on the augmented likelihood z \sim Multinomial( \pi ). The proportions \pi are our target parameter of interest.

Following the change point example, we need an upper bound m_{i,max} and I simply took the upper 95% quantile of each Negative Binomial under the mean prior p_i. This gives values of m_{i,max} in the range 6-250, which is not too bad in itself. It s all the combinations of missing data values across all locations that make the change point approach in Stan inefficient. Potentially we could model the missing data in a way that is more tractable in Stan, which is why I posted under modelling techniques…

Thanks again in advance for any pointers :-)

Bob_Carpenter · May 21, 2018, 11:28pm

100 sums of 100 terms is no problem if that’s the size you’re talking about. Does the model account for the truncation at 100? If there’s a max, it’s not natural count data.

The continuous value is latent. Are you worried the shape of something like a gamma distribution won’t be Poisson-like enough? I wouldn’t worry about that too much—it’s a second- or third-order effect here.

That’s a lot of overdispersion built in for z. There’s the addition of n that’s going to add overdispersion, but then n itself is going to be overidspersed with a negative binomial.

I see that you’re inflating each m_i independently.

Upper 95% isn’t enough of a bound—that’ misses 2.5% of the probability mass by definition, and it’s in the tails. But you could easily go from 0 to 200 if you only have N = 100 total.

Discrete sampling isn’t a very efficient way to do this kind of thing either.

You can break that multionomial apart. If you have

m + n ~ multinomial(theta);

that’s the same as

m ~ multinomial(theta);
n ~ multinomial(theta);`

So you can separate the missing and observed data. In fact, you can do this component by component with the m[i]. I don’t know how to marginalize out the m[i], though, and I also don’t have any suggestions for replacement distributions.

The continuous generalization is easy, because the m[i] appears in the context m[i] * log theta[i] in the log density. So I’d just try something like a gamma distribution.

olli0601 · May 22, 2018, 12:32pm

Hi Bob

Ah, here is something that I may misunderstand:

Bob_Carpenter:

You can break that multionomial apart. If you have
m + n ~ multinomial(theta);
that’s the same as
m ~ multinomial(theta);
n ~ multinomial(theta);`
So you can separate the missing and observed data. In fact, you can do this component by component with the m[i] .

I have augmented counts z=n+m where n is count data of 100 locations, and m is the vector of latent parameters (one for each location) that I am hoping to integrate out. Let s consider one of the possible combinations of latent variables and call this m^r. The corresponding augmented count is z^r=n+m^r. The usual constant in the Multinomial depends on m^r. So I think I cannot drop the constant, and therefore cannot separate the likelihood contributions as you suggest. Maybe I am just thick at this point because I don t fully understand how Stan works in all details, sorry if that s the case!

That sounds interesting, but I don t really follow this through. Could you expand a little?

Thanks again for all your help,

olli

Bob_Carpenter · May 22, 2018, 5:27pm

The key here is constant. Since we have

\displaystyle \mathsf{Multinomial}(y \mid \theta) \ \propto \ \prod_{n=1}^N (\theta_n)^{y_n},

it follows that

\mathsf{Multiomial}(y + z \mid \theta) \ \propto \ \mathsf{Multinomial}(y \mid \theta) \cdot \mathsf{Multinomial}(z \mid \theta).

Another way of saying this is that the counts are themselves sufficient statistics—the order of collection doesn’t matter.

It might be easier to think about in the binomial case (where y and z are counts and theta is a probability); the multinomial is then a natural generalization.

olli0601 · May 22, 2018, 7:48pm

Hmmm. Maybe I am thick here, but the data are augmented by a random variable (the missing counts m), i.e.

Multinomial(z|\pi)=Multinomial(n+m|\pi)= \frac{(N+M)!}{(n_1+m_1)! \dots (n_{100}+m_{100})!} \prod_{i=1}^{100} \pi_i^{n_i+m_i},

where N=\sum_{i=1}^{100} n_i , M=\sum_{i=1}^{100} m_i. So the usual constant in the Multinomial depends on the auxiliary parameter and I think cannot be dropped; I really have \theta=(\pi, m, p). Does this make sense?

Perhaps the Gamma is the way forward in Stan, but I did not really follow what you meant. Could you explain in a bit more detail?

Thanks again for all your help,
olli

Bob_Carpenter · May 22, 2018, 8:21pm

Thanks for following up! You’re right—I forgot that one of the terms was random. And I can’t see how to get it to drop out. This is very much breaking my intuition about these problems.

olli0601 · May 22, 2018, 9:08pm

Thanks, Bob.

Good we agree on this!

I will see if there is a workaround – the only way forward that I can see in Stan is to reduce the number of discrete latent variables. 1e135 summations for the log probability of the model aren t feasible. If I try to reduce the dimensionality of the latent space, how many summations max should I aim for based on your experience (if that s a well-posed question at all)?

If anyone else has a different idea, please let me know :-)

Cheers,
olli

Sam_Watson · May 22, 2018, 9:22pm

I don’t know if you’ve seen this, but I tried to do pretty much exactly this not so long ago. See Data augmentation model possible in Stan? and the comments, which are really useful (although I still need to actually work more on the model!)

olli0601 · May 23, 2018, 3:03pm

Hi Sam

No I had not, thanks for linking the two threads =J. In your case, how large are K and m_{max}?

Cheers,
olli

Sam_Watson · July 13, 2018, 12:02am

Just saw that I missed this reply. If you’re still interested I think I used k= 3 or 4 (Like 0.01 And 0.05 cutpoints) and m_max anything up to ten. But it likely depends on context. I didn’t get much change in the estimates by varying them.

olli0601 · July 17, 2018, 9:21pm

Thank you very much, Sam, that s helpful and puts things into perspective. Good luck with your Stan code! olli

Topic		Replies	Views
Latent variable model - marginalising out discrete parameter Modeling	6	634	August 23, 2020
Data augmentation model possible in Stan? Modeling	6	2295	November 27, 2017
Marginalize out missing continuous data Modeling techniques , specification	7	1145	February 25, 2023
Estimation, not marginalisation, of discrete parameters: a solution? General discrete-parameters	3	723	May 18, 2022
Guidance Needed on Comparing Models with Latent Discrete Variables When Using Marginalized Log-Likelihood Modeling	0	141	March 22, 2024

Marginalising out many discrete latent variables / modelling augmented count data

Related topics