Performance of two models that uses same but transformed data

Hello everyone, I wanted to ask for your help if possible.
I want to compare the performance of two models that assume a different distribution for the data.

These are two models that treat data differently. The first, analyzes the raw data with an assumed distribution, the second, uses another distribution to analyze the differences (since they are longitudinal data). Obviously, the latter will use a smaller data set than the first (1/2), since the second only takes the differences, and the first both points (beginning and end).

Here, using LOO-CV does not seem appropriate, since having a different N, even if the original data is the same, will not create comparable values. My question is how I can compare the performance of the two models.

I have thought about building the predicted endpoint, adding the differences to the initial point, and building an MSE-type metric to compare the predicted values with those of the model that uses raw data. Or vice versa, build the differences with the values predicted by the raw data model.
However, my doubt is if there is a better metric than the MSE, or another solution.

Thank you so much.

Here is one way to think about it. Conditioning on the location of the first data point, the differences model provides a model for the location of the second data point. Thus, one way to frame a model comparison would be to ask “does conditioning (in a particular way) on the location of the first data point improve the predictive performance for the second data point?” If this is the question that you wish to answer with your model comparison, then you can use the first model to form the log-likelihood matrix just for the second data points, and use the second model to form the log-likelihood for the differences, and ask which yields better predictive performance.

2 Likes

To make explicit what you and @jsocolar have implied, your raw data model probably has an implied difference model embedded in it, while your difference model lacks an implied raw data model. So inferring the fit of the difference from the raw data model is the way to go.

Along those lines, you may be able to calculate the implied log-likelihood for the differences from the raw data model. For example, imagine your raw data model is a bivariate normal. The sum of two normally distributed variables is

(y_1 + y_2) \sim N \big( \mu_1 + \mu_2, \sqrt{\sigma_1^2 + \sigma_2^2 + 2 \rho \sigma_1 \sigma_2} \big)

So the difference is distributed as

(y_1 - y_2) \sim N \big( \mu_1 - \mu_2, \sqrt{\sigma_1^2 + \sigma_2^2 - 2 \rho \sigma_1 \sigma_2} \big)

You could then calculate the log-likelihood of the difference, conditional on the multivariate parameters, and compare it to the (univariate) difference model. I think that would be statistically appropriate, though I admit I’ve never seen it done.

2 Likes

In my case, I need to compare the performance of a Poisson for each time point, or a Skellam’s (equivalent to the difference of two random Poissons). The problem of your approach is that I need to compute the total likelihood for the data. and not on individual points, since by definition, differences will have half the size of the total timepoints. So, total likelihood will be very likely higher in the Skellam’s model due to having less timepoints.

I think your solution is appropriate. So, comparing the likelihood only for the endpoints, with the likelihood for the differences, should be enough? What I did is saving only the pointwise likelihoods for the endpoints, and estimate the LOO CV. The estimated score looks reasonable, as it is very similar to the one from the differences model.

I might be misunderstanding something, but let me clarify how my suggestion would apply to what I think you’re describing. You have longitudinal count data, observed at two time points per case (N = number of cases; T = number of time points = 2). You have a model for the raw count data (N*2 observations), which you model as Poisson-distributed.

Y_1 \sim Poisson(\lambda_1) \\\ Y_2 \sim Poisson(\lambda_2)

You also have a model for the change (K) from time 1 to time 2 (N observations), which you model as Skellam-distributed.

(Y_1 - Y_2) = K \sim Skellam(\lambda_1, \lambda_2)

As you note, the log-likelihoods of the two models will not be comparable. The Poisson model has twice as many observations. But more importantly, the Skellam distribution is a collapsed version of the independent bivariate Poisson distributions. That is, the bivariate Poisson distribution assigns probabilities to each unique combination of Y_1 and Y_2. But the Skellam distribution assigns probabilities to sets of unique combinations that share the same difference. For example, the Skellam distribution assigns a probability to

PR(Y_1 - Y_2 = -1) = PR(Y_1 = 1, Y_2 = 2) + PR(Y_1 = 2, Y_2 = 3) + \text{ ...etc.}

whereas the Poisson distribution assigns a probability to

PR(Y_1 = 1, Y_2=2)

Since PR(Y_1 = 1, Y_2=2) is contained by PR(Y_1 - Y_2 = -1), the Skellam distribution will by definition generate a larger log-likelihood than the Poisson distribution, assuming the same data and parameter values.

In this case, the two models share the same parameters (\lambda_1 and \lambda_2), so you can easily calculate the log-likelihood of both models simultaneously. You can run the model twice, selecting one and then the other model for the purpose of inference, and the compare the appropriate log-likelihoods. See the rough, incomplete model code below as an example.

data{
    int N;
    int<lower = 0, upper = 1> P; // Use poisson model?
    array[N,2] int<lower = 0> Y;
}
transformed data{
    array[N] int K;

    for(n in 1:N){
        K[n] = Y[n,1] - Y[n,2];
    }
}
parameters{
    real<lower = 0> lambda1;
    real<lower = 0> lambda2;
}
transformed parameters{
    matrix[N,2] log_lik_poisson;
    vector[N] log_lik_skellam;

    for(n in 1:N){
        log_lik_poisson[n,1] = poisson_lpmf(Y[n,1] | lambda1);
        log_lik_poisson[n,2] = poisson_lpmf(Y[n,2] | lambda2);
        log_lik_skellam[n] = skellam_lpmf(K[n] | lambda1, lambda2);
    }
}
model{
    if(P == 1){
        target += log_lik_poisson;
    } else{
        target += log_lik_skellam
    }
}
1 Like

I see that in the code you calculate a bivariate Poisson and a Skellam. However, it is not fully adequate to compare my models.

First of all. In my Poisson model I had made 2 settings: The first consisted of a Poisson Mixed Models, Thus, in this way the correlation of the timepoints for the same individual is taken into account. The second was to put the initial timepoint as a covariate, thus having a single datapoint per individual and a Poisson regression with single parameter \lambda_2.

Taking this into account, we see that the second setting could be comparable with the Skellam, since it has the same number of datapoints, right?
For the first setting, taking into account @jsocolar’s comment, the conditioned likelihood would simply be the sum of the log likelihoods for the endpoint estimate taking into account the initial point, so this likelihood would be comparable with the Skellam model, right? In fact, having tested it, I get reasonable values. Thank you for your valuable help.

I’m not sure. It all sounds about right, but if you could share your models then that might clarify things (for me at least).

Edit: I think your motivation for wanting to use the difference model might matter here too. My assumption was that you’d prefer to use the difference moel either be cause it’s what you’re primarily interested in or it makes things simpler because you’re predicting one independent outcome rather than two correlated outcomes. (though I’m not sure the Skellam distribution makes it easier, since it still depends on the means/variances of each observation in the pairs). So in that case, the question would be “am I losing anything by just modelling the differences?” Again, for the Skellam case, I think the answer is yes, you do lose something by predicting differences rather than the raw scores, because the raw scores tell you something about the probability of the differences. And, since the parameters for the Skellam and bivariate Poissons are the same, then it should actually be easier to fit the Poisson model.

On the other hand, if your motivation is something like, “If I only have differences, how much worse will the model fit?” then directly comparing different scenarios seems more important.

So to put things in context, the aim of the proposed thesis from my supervisor was to compare different Bayesian models for longitudinal data, in medical context, using a specific data set just as an example. So for example, what is the estimated effect of a treatment along time in decreasing the number of seizures due to epilepsy. I presented my work with different models, as mixed Poisson, mixed NB, Skellam… But he wanted me to say clearly which is the best. It was not enough to explain and fit all those models, but he wanted me to choose which model must be used for any researcher in the field.

That was the motivation to use LOO-CV to compare all the settings and say which has teh best fit, but the problem arose when comparing count models and those differenced (Skellam). I think the conditioned likelihood is the solution to his issue.

Reading original paper from Ntzoufras, the assumption of the distribution is that each timepoint is a sum of a Poisson random variable, and an unknown distributed random variable, that it is constant through time. Skellam distribution has the advantage of getting rid of a nuisance random variable that was causing overdispersion or other undesirable effects, thanks to differencing. So I don’t think that this model can be worse than the others, at least for predicting differences.

That’s really helpful. Reading through Karlis and Ntzoufras 2006, I can see how the Skellam distribution has some desirable properties in certain situations, particularly when you expect the observations to be correlated. Specifically, when the paired observations are seen as two independent Poissons, each added to a shared Poisson, then the difference between the two becomes Skellam-distributed around the independent pieces. If you model the two as independent Poissons (even conditional on some shared mean, as in a mixed model), the variance will be too large.

I could be wrong, but I’m not convinced that the Skellam-distribution will always perform better than independent, bivariate Poisson distribution–though maybe it does for the particular context you’re working in. Given that this is part of your thesis work, I’ll hold off on elaborating further. But I think that you’ll want to look at different values for \theta_1, \theta_2, and \theta_3 in Equation 1 of Karlis and Ntzoufras 2006. I also think you could directly code the Equation 1 in Stan for the distribution of positively correlated Poissons, which may be a helpful comparison to the bivariate, independent Poisson and Skellam.

But back to your actual, original question.

The Poisson mixed model implies a model for the differences. So you should still be able to generate the log-likelihood for the Skellam distribution simultaneously. See the modified (still incomplete) model below.

data{
    int N;
    int<lower = 0, upper = 1> P; // Use poisson model?
    array[N,2] int<lower = 0> Y;
}
transformed data{
    array[N] int K;

    for(n in 1:N){
        K[n] = Y[n,1] - Y[n,2];
    }
}
parameters{
    real log_lambda0;
    vector[N] log_lambda_i;
    real delta;
    real<lower = 0> gamma;
}
transformed parameters{
    matrix[N,2] log_lik_poisson;
    vector[N] log_lik_skellam;
    vector[N] lambda1 = exp(log_lambda0 + log_lambda_i - 0.5 * delta);
    vector[N] lambda2 = exp(log_lambda0 + log_lambda_i + 0.5 * delta);

    for(n in 1:N){
        log_lik_poisson[n,1] = poisson_lpmf(Y[n,1] | lambda1);
        log_lik_poisson[n,2] = poisson_lpmf(Y[n,2] | lambda2);
        log_lik_skellam[n] = skellam_lpmf(K[n] | lambda1, lambda2);
    }
}
model{
    log_lambda_i ~ normal(0, gamma);

    if(P == 1){
        target += log_lik_poisson;
    } else{
        target += log_lik_skellam
    }
}

Given the paper above, I need to revise what I said before…

The parameters only mean the same thing when the covariance is 0 (i.e. when \lambda_3 = 0), see the remark at the bottom of page 1887. In other words, you can generate the implied Skellam distribution from an independent, bivariate Poisson distribution. And you can generate the implied independent, bivariate Poisson distribution from a Skellam distribution. But you can’t generate the correlated, bivariate Poisson distribution from the Skellam distribution. So if you are assuming the model described by Equation 1, then you can’t go both ways when comparing log-likelihoods. You can only go from independent, bivariate Poisson to Skellam.

Just noticing this. So does your model look something like this, where the count at time 2 is modelled as Poisson and predicted by the count at time 1?

data{
    int N;
    array[N] int Y1;
    array[N] int Y2;
}
parameters{
    real delta;
}
model{
    Y2 ~ poisson(Y1 + delta);
}

If so, then you are correct that you now only have N observations being predicted. But I’d be curious to see if this is equivalent to the Skellam model, especially given the point about correlated Poissons. It might be, but I haven’t seen that derivation.