I have two data sets describing two related perspectives on the same phaenomenon. The first one has daily counts of articles on a topic in a region. The second has ratios of articles counts divided by a total amount of regionally limited total news coverage (i.e. total article count). I use a hierarchical model to estimate the news coverage in W periods (survey waves). I use a poisson distribution for the former (count data) and a beta distribution for the latter (ratio data - alternatively, I have considered using an exponential distribution). I get the expected results when estimating them in separate STAN models.
(shows wave means of data set 1 (left) and 2 (right))
Since each data set uses slightly different methodology and works on the base of different data sources, yet contains valuable (& potentially competing) information, I would now like to combine these. I wand to estimate a latent variable fed by information from data set 1 and 2, ideally allowing me to quantify how the two pieces of information are balanced.
I have tried multiple parametrisations and approaches, but I always end up with data set 1 dominating the result, i.e., the latent variable is (almost) identical to a model in which I only use data set 1.
Eventually, I would like to use the latent variable as a parameter in a final regression likelihood that I would include as a third likelihood in the model.
As a STAN & bayesian statistics beginner, I realise I may be on the wrong path overall, but couldn’t get further than this. Would be happy about some hint where to go!
data {
int<lower=0> N; // number of article counts (data set 1)
int<lower=0> M; // number of article ratios (data set 2)
int<lower=0> W; // number of waves (time steps)
array[N] int w_n; // mapping daily observations to wave
array[M] int w_m; // mapping daily observations to wave
array[N] int y_n; // outcome: article counts (data set 1)
vector[M] y_m; // outcome: article ratios (data set 2)
}
parameters {
vector<lower=0>[W] alpha; // latent variable: wave news coverage (data set 2)
real<lower=0> rho; // gamma scale (data set 2)
real<lower=0,upper=1> delta; // transformation factor (data set 1)
}
model {
delta ~ gamma(1, .001);
alpha ~ gamma(1.5, rho);
rho ~ gamma(1, .01);
y_n ~ poisson(alpha[w_n] ./ delta);
y_m ~ beta(1, 1/alpha[w_m]);
}
I think my general questions behind this are:
- How is it possible to combine information coming from the same data generation process that is available in different shapes (i.e. distributions) - here: poisson vs beta
- More generally, how would I “combine” two competing pieces of information? By using an additive model with weights, vs. simply multiplying their distributions, vs. the kind of latent variable approach to model the data generating process - and how can I determine the balance/“weights” of the mix of information
- If my approach was not entirely wrong, is it possible that something related to the added log likelihood when the model gets evaluated can give one of the two likelihood statements less weight? If so, how would I influence this?
I would be very thankful to be sent to the right section in the literature, I didn’t quite know what to look for.