I am new to stan and I have a model with individual coefficients (probably a better name exists). A simple version of it is bellow.
Let there are customers that visit our company from time to time. We are interested in modeling the inter-visit time intervals.
The intervals follow the gamma distribution:
t_{ij} \sim \mathtt{gamma}(k,\theta),
where i is the customer number and j the time interval number for the customer i
with k \sim N(0,10).
However we also assume that customers are heterogeneous, which means that every customer has its own individual parameter \theta_i and, in fact, we have the following model:
t_{ij} \sim \mathtt{gamma}(k,\theta_i),
where \theta_i comes from the inverse gamma distribution:
\frac{1}{\theta_i} \sim \mathtt{gamma}(\nu,\sigma),
with \nu,\sigma \sim N(0,10). Thus, the model has 3 parameters: k,\nu, \sigma.
The question is how to encode this model (in particular in stan) for efficient inference? The number of observations is high, e.g. 10000+.
I see two options. First addition all of \theta_i to the sampling space. It is slow and looks like it requires a large number of iteration, is not it? The second option is to numerically integrate over possible \theta's for every customers and then there are only k, \nu, and \sigma in the sampling space. Which option should I prefer, or probably, they all are incorrect? Is there any numerical-integration primitives in STAN? Probably there are some other options that I cannot see right now? Thank you!
The first option is implemented bellow:
data {
int NCust; // Number of customers
int Size; // Number of inter-visit intervals for all customers
int Customers[Size]; // The customer corresponding to every interval
real Times[Size]; // The interpurchase interval
}
parameters{
real<lower=0> k;
real<lower=0> nu;
real<lower=0> sigma;
vector<lower=0>[NCust] lInv;
}
model{
k~normal(0,10);
nu~normal(0,10);
sigma~normal(0,10);
lInv ~ gamma(nu,sigma);
for(i in 1:Size) {
Times[i] ~ gamma(k,1/lInv[Customers[i]]);
}
}
And an incorrect but still surprisingly slow numerical integration method. For the correct integration I do need to sample randomly from a distribution, but *_rng functions are not available in the model part. Is there a way to do that? But in any case why the code below is 20 times slower than the first version even if the number of samples is 10 times less than the number of customers?
data {
int NCust; // Number of customers
int Size; // Number of inter-visit intervals for all customers
int Customers[Size]; // The customer corresponding to every interval
real Times[Size]; // The interpurchase interval
int IntegralSamples; // The number of samples to be drawn from a distribution for numerical integration
}
parameters{
real<lower=0> k;
real<lower=0> nu;
real<lower=0> sigma;
vector<lower=0>[IntegralSamples] lInv;
}
model{
k~normal(0,10);
nu~normal(0,10);
sigma~normal(0,10);
lInv ~ gamma(nu,sigma);
for(i in 1:Size) {
vector[IntegralSamples] sampleValues;
for(j in 1:IntegralSamples) {
sampleValues[j] = gamma_lpdf(Times[i] | k, 1/lInv[j]);
}
target += log_sum_exp(sampleValues);
}
}