Model with individual coefficients

I am new to stan and I have a model with individual coefficients (probably a better name exists). A simple version of it is bellow.

Let there are customers that visit our company from time to time. We are interested in modeling the inter-visit time intervals.

The intervals follow the gamma distribution:
t_{ij} \sim \mathtt{gamma}(k,\theta),
where i is the customer number and j the time interval number for the customer i
with k \sim N(0,10).

However we also assume that customers are heterogeneous, which means that every customer has its own individual parameter \theta_i and, in fact, we have the following model:
t_{ij} \sim \mathtt{gamma}(k,\theta_i),
where \theta_i comes from the inverse gamma distribution:
\frac{1}{\theta_i} \sim \mathtt{gamma}(\nu,\sigma),
with \nu,\sigma \sim N(0,10). Thus, the model has 3 parameters: k,\nu, \sigma.

The question is how to encode this model (in particular in stan) for efficient inference? The number of observations is high, e.g. 10000+.

I see two options. First addition all of \theta_i to the sampling space. It is slow and looks like it requires a large number of iteration, is not it? The second option is to numerically integrate over possible \theta's for every customers and then there are only k, \nu, and \sigma in the sampling space. Which option should I prefer, or probably, they all are incorrect? Is there any numerical-integration primitives in STAN? Probably there are some other options that I cannot see right now? Thank you!

The first option is implemented bellow:

data {
 int NCust; // Number of customers
 int Size; // Number of inter-visit intervals for all customers

 int Customers[Size]; // The customer corresponding to every interval
 real Times[Size]; // The interpurchase interval
}
parameters{
  real<lower=0> k;
  real<lower=0> nu;
  real<lower=0> sigma;
  
  vector<lower=0>[NCust] lInv;
}
model{
 k~normal(0,10);
 nu~normal(0,10);
 sigma~normal(0,10);
 lInv ~ gamma(nu,sigma);
 for(i in 1:Size) {
   Times[i] ~ gamma(k,1/lInv[Customers[i]]);
 }
}

And an incorrect but still surprisingly slow numerical integration method. For the correct integration I do need to sample randomly from a distribution, but *_rng functions are not available in the model part. Is there a way to do that? But in any case why the code below is 20 times slower than the first version even if the number of samples is 10 times less than the number of customers?

data {
 int NCust; // Number of customers
 int Size; // Number of inter-visit intervals for all customers

 int Customers[Size]; // The customer corresponding to every interval
 real Times[Size]; // The interpurchase interval
 
 int IntegralSamples; // The number of samples to be drawn from a distribution for numerical integration
}
parameters{
  real<lower=0> k;
  real<lower=0> nu;
  real<lower=0> sigma;
  
  vector<lower=0>[IntegralSamples] lInv;
}
model{
  k~normal(0,10);
  nu~normal(0,10);
  sigma~normal(0,10);
  
  lInv ~ gamma(nu,sigma);
  
  for(i in 1:Size) {
    vector[IntegralSamples] sampleValues;
    for(j in 1:IntegralSamples) {
      sampleValues[j] = gamma_lpdf(Times[i] | k, 1/lInv[j]);
    }
    target += log_sum_exp(sampleValues);
  }
}

The Stan Math Library has a good numerical integration function, but it is not exposed to the Stan language. So, currently you would have to write your own C++ to access it, but that will become unnecessary in 2.19. The way you wrote it is fine but could be accelerated by changing the last loop to

Times ~ gamma(k, inv(lInv[Customers]));
1 Like

@bgoodri, thank you very much for your helpful answer. You approach is 50% faster than my. However I am not sure if I can use this trick in my real case. I have a mixture of two distributions with individuals weights (computed based on covariates of an observation) as well. Let say

t_{ij} \sim p_i \times gamma(k_1,\theta^{(1)}_i) + (1-p_i) \times gamma(k_2,\theta^{(2)}_i),

is it possible to use your trick in my case? Thank you!

Yeah, you have to use log_mix in a loop.

2 Likes