Loo and "logo" combination. Two data types, one model

Here is a simple model that mimics the problem I am struggling with. NegBin indicates Negative Binomial defined by mean and shape parameter of the associated gamma distribution.

\eqalign{ & {k_i}\sim{\rm{NegBin}}\left( {{\mu _{g\left( i \right)}},\beta {\mu _{g\left( i \right)}}} \right) \cr & {x_g}\sim{\rm{Poisson}}\left( {q\mu _g^c} \right) \cr & {\mu _g} = \exp (M + {E_g}) \cr & {E_g} \sim {\rm{Normal}}\left( {0,S} \right) \cr & {\rm{Some more priors}} \cr}

Here, g\left( i \right) indicates the group to which observation i belong. For each group g, there are both the single data observation of type x and several observations of type k. There is a potentially non-linear relationship between the expected value for the two data types, modelled by scaling parameter q and power parameter c.

Say I wish to test if c \neq 1 through model comparison. A straightforward approach would be to treat all observations as equal and compare models by PSIS-LOO-CV. There are two reasons why I don’t want to do this. First, in my real example, there are ~1,000,000 observations, leading to very large files and potentially a large number of re-runs for exact LOO, even if, say, 0.1% of observations are flagged as unreliable by the Pareto Gods (k>0.7). Also, I am not interested in model performance at the level of individual observations of k.

In the focal model, I can derive an expression for sum of all k belonging to a specific group:
\sum\limits_{i \in g} {{k_i}} \sim {\rm{NegBin}}\left( {N{\mu _{g\left( i \right)}},N\beta {\mu _{g\left( i \right)}}} \right)
So, an alternative could be to instead include the likelihood of \sum\limits_{i \in g} {{k_i}} . But is this any better? And does it violate any assumptions of the PSIS approximation of leave one (group) out? I am a bit at loss for how to think about this problem and would appreciate any pointers.

Here are two figures (sorry about the ugly legend) plotting the pointwise log predictive difference for the two models. Left of the dashed lines are k and \sum\limits_{i \in g} {{k_i}} , respectively, and right of the dashed lines are the corresponding x observations. As you can see, the two data types differ on the scale at which models differ in predictive performance. I think this should matter, at least for the SE of loo, but I am not quite sure how…


Below is stan code and r script to generate data and run the analysis.


Stan code:

data {
  int<lower=0> N;
  int<lower=0> n[N];
  int<lower=0> nobs;
  int<lower=1, upper=N> G[nobs];
  int<lower=0> k[nobs];
  int<lower=0> x[N];
  int<lower=0,upper=1> incc;
  int<lower=0> ksum[N];
  
}


parameters {
  vector[N] E_raw;
  real<lower=0> beta;
  real<lower=0> S;
  real M;
  real<lower=0> c[incc ? 1 :0];
  real<lower=0> q;
  
}

transformed parameters{
  real mu[N];
  vector[N] E;
  real<lower=0> cc;
  
  if (incc){
    cc=c[1];
    
  }else{
    
    cc=1;
  }
  E=E_raw*S;
  for (g in 1:N){
    
    mu[g]=exp(M+E[g]);
  }
  
}


model {
  for(g in 1:N){
    
    x[g]~poisson(mu[g]^cc*q);
  }
  
  for (i in 1:nobs){
    
    k[i]~neg_binomial_2(mu[G[i]],mu[G[i]]*beta);
    
  }
  
  E_raw~normal(0,1);
  S~cauchy(0,2.5);
    q~lognormal(0,5);
    if(incc){
      c[1]~lognormal(0,5);
      
    }
  

}

generated quantities{
  real loglik_k[nobs];
  real loglik_ksum[N];
  real loglik_x[N];
  
  for(i in 1:nobs){
    loglik_k[i] = neg_binomial_2_lpmf(k[i] | mu[G[i]],mu[G[i]]*beta);
    
  }
  
  for (g in 1:N){
    loglik_x[g] = poisson_lpmf(x[g] | mu[g]^cc*q);
    loglik_ksum[g] = neg_binomial_2_lpmf(ksum[g] | n[g]*mu[g],mu[g]*beta*n[g]);
  }
  

  
  
} 


R script

library(rstan)
library(tidyverse)
library( abind)
library(loo)
library(tidyverse)

set.seed=1
N=50 #Number of groups
meann=10 #average number of observations per group
M=5 #log-mean of mu
S=1  #standard dev between groups

beta=1.5
c=.7
q=.5

logmu=rnorm(N,M,S)
mu=exp(logmu)

n=rpois(N,meann)
nobs=sum(n)
G=rep(1:N,n)
k=rnbinom(n=nobs,mu=mu[G],size=beta*mu[G])

x=rpois(N,mu^c*q)

ksum=rep(0,N)

for(i in 1:nobs){
  
  ksum[G[i]]=ksum[G[i]]+k[i]
}




#settings to use parallel computing
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)



#make data list
data=list(N=N,n=n,ksum=ksum,G=G,nobs=nobs,k=k,x=x,incc=TRUE)

#run stan for non-linear response
fit <- stan("lgo_examle_stan.stan",data=data,iter = 10000,warmup = 2000)

#loo stuff
ll_k=rstan::extract(fit, "loglik_k", permuted = FALSE, inc_warmup = FALSE,include = TRUE)
ll_x=rstan::extract(fit, "loglik_x", permuted = FALSE, inc_warmup = FALSE,include = TRUE)
ll_ksum=rstan::extract(fit, "loglik_ksum", permuted = FALSE, inc_warmup = FALSE,include = TRUE)
ll=abind(ll_k,ll_x,along=3)
llg=abind(ll_ksum,ll_x,along=3)
LOOlist=list()
LOOlistg=list()
LOOlist[[1]]=loo(ll)
LOOlistg[[1]]=loo(llg)

#make new data list
data=list(N=N,n=n,ksum=ksum,G=G,nobs=nobs,k=k,x=x,incc=FALSE)

#run stan for linear response
fit2 <- stan("lgo_examle_stan.stan",data=data,iter = 10000,warmup = 2000)

#loo stuff
ll_k2=rstan::extract(fit2, "loglik_k", permuted = FALSE, inc_warmup = FALSE,include = TRUE)
ll_x2=rstan::extract(fit2, "loglik_x", permuted = FALSE, inc_warmup = FALSE,include = TRUE)
ll_ksum2=rstan::extract(fit2, "loglik_ksum", permuted = FALSE, inc_warmup = FALSE,include = TRUE)
ll2=abind(ll_k2,ll_x2,along=3)

llg2=abind(ll_ksum2,ll_x2,along=3)

LOOlist[[2]]=loo(ll2)
LOOlistg[[2]]=loo(llg2)



LOODiff=LOOlist[[1]]$pointwise[,1]-LOOlist[[2]]$pointwise[,1]
Letters=c(LETTERS,letters)
df=tibble(Diff=LOODiff,
          group=Letters[c(G,1:N)],
          ind=1:(nobs+N))

p<- ggplot(df, aes(y=Diff, x=ind, color=group)) + 
  geom_point() +
  xlab("Obs") +
  ylab("Pointwise Differnce") +
  geom_hline(yintercept=0) +
  geom_vline(xintercept = nobs+0.5,linetype="dashed") + 
  scale_colour_discrete()
print(p)
ggsave("PWDiff1.png", plot = last_plot(),
       scale = 1, width = 16, height = 10, units = "cm",
       dpi = 600)

LOODiffg=LOOlistg[[1]]$pointwise[,1]-LOOlistg[[2]]$pointwise[,1]

df=tibble(Diff=LOODiffg,
          group=Letters[c(1:N,1:N)],
          ind=1:(N+N))

p2<- ggplot(df, aes(y=Diff, x=ind, color=group)) + 
  geom_point() +
  xlab("Obs") +
  ylab("Pointwise Differnce") +
  geom_hline(yintercept=0) +
  geom_vline(xintercept = N+0.5,linetype="dashed") + 
  scale_colour_discrete()
print(p2)
ggsave("PWDiff2.png", plot = last_plot(),
       scale = 1, width = 16, height = 10, units = "cm",
       dpi = 600)

You could examine the posterior of c directly.

I hope you can examine posterior of c directly, but you can think that each log_lik value or sum of some of the values correspond to what you are leaving out.

I didn’t have time to check the code to check whether you are plotting something sensible, but if you are, then there doesn’t seem to be much difference. How does the posterior of c look like?

Thanks, Aki! So, if I understand you correctly, I should be able to simply treat as log prediction density over the left-out observations \sum\limits_{i \in g} \rm{log}\left( {\rm{NegBin}\left( {{k_i}|{\mu _{g\left( i \right)}},\beta {\mu _{g\left( i \right)}}} \right)} \right).
Is that correct? If I do that, I get similar results using loo_compare to when I treat individual k_i as observations:

loo_compare(LOOlist)
       elpd_diff se_diff
model1   0.0       0.0  
model2 -31.4      12.5  

vs.

loo_compare(LOOlistgind)
       elpd_diff se_diff
model1   0.0       0.0  
model2 -36.2      12.3  

However, the latter LOO-CV, which is the LOO-CV based on grouped observations, suffers from very bad Pareto k. In fact, all pointwise estimates that take the sum of several k_i are flagged as either bad or very bad. Conversely, if I treat the k part of the model as individual observations, I get no Pareto k problems. Here are figures with the pointwise loo difference where color corresponds to Pareto k. Again, left of the dashed line are the observations (grouped or individual) derived from the k part of the model. The right side contains the corresponding x part.

k_i as individual observations

groups of k_i as observations

R code to gernate data and run model

#Generating some data for N groups
library(rstan)
library(tidyverse)
library( abind)
library(loo)
library(tidyverse)

set.seed(1)
N=50
meann=10
M=5
S=1

beta=1.5
c=.7
q=.5

logmu=rnorm(N,M,S)
mu=exp(logmu)

n=rpois(N,meann)
nobs=sum(n)
G=rep(1:N,n)
k=rnbinom(n=nobs,mu=mu[G],size=beta*mu[G])

x=rpois(N,mu^c*q)

ksum=rep(0,N)

for(i in 1:nobs){
  
  ksum[G[i]]=ksum[G[i]]+k[i]
}




#settings to use parallel computing
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)



#make data list
data=list(N=N,n=n,ksum=ksum,G=G,nobs=nobs,k=k,x=x,incc=TRUE)

#run stan for non-linear response
fit <- stan("lgo_examle_stan.stan",data=data,iter = 10000,warmup = 2000)
#make new data list
data=list(N=N,n=n,ksum=ksum,G=G,nobs=nobs,k=k,x=x,incc=FALSE)

#run stan for linear response
fit2 <- stan("lgo_examle_stan.stan",data=data,iter = 10000,warmup = 2000)

#loo stuff
ll_k=rstan::extract(fit, "loglik_k", permuted = FALSE, inc_warmup = FALSE,include = TRUE)
ll_x=rstan::extract(fit, "loglik_x", permuted = FALSE, inc_warmup = FALSE,include = TRUE)
ll_ksum=rstan::extract(fit, "loglik_ksum", permuted = FALSE, inc_warmup = FALSE,include = TRUE)
ll_ksumind=rstan::extract(fit, "loglik_ksumind", permuted = FALSE, inc_warmup = FALSE,include = TRUE)
ll=abind(ll_k,ll_x,along=3)
llg=abind(ll_ksum,ll_x,along=3)
llgind=abind(ll_ksumind,ll_x,along=3)
LOOlist=list()
LOOlistg=list()
LOOlistgind=list()
LOOlist[[1]]=loo(ll)
LOOlistg[[1]]=loo(llg)
LOOlistgind[[1]]=loo(llgind)


ll_k2=rstan::extract(fit2, "loglik_k", permuted = FALSE, inc_warmup = FALSE,include = TRUE)
ll_x2=rstan::extract(fit2, "loglik_x", permuted = FALSE, inc_warmup = FALSE,include = TRUE)
ll_ksum2=rstan::extract(fit2, "loglik_ksum", permuted = FALSE, inc_warmup = FALSE,include = TRUE)
ll_ksumind2=rstan::extract(fit2, "loglik_ksumind", permuted = FALSE, inc_warmup = FALSE,include = TRUE)
ll2=abind(ll_k2,ll_x2,along=3)

llg2=abind(ll_ksum2,ll_x2,along=3)

llgind2=abind(ll_ksumind2,ll_x2,along=3)
LOOlist[[2]]=loo(ll2)
LOOlistg[[2]]=loo(llg2)
LOOlistgind[[2]]=loo(llgind2)


#plotting
LOODiff=LOOlist[[1]]$pointwise[,1]-LOOlist[[2]]$pointwise[,1]
Letters=c(LETTERS,letters)
df=tibble(Diff=LOODiff,
          group=Letters[c(G,1:N)],
          mean_influence_pareto_k=(LOOlist[[2]]$pointwise[,"influence_pareto_k"]+LOOlist[[1]]$pointwise[,"influence_pareto_k"])/2,
          ind=1:(nobs+N))
limits=c(min(df$mean_influence_pareto_k), max(df$mean_influence_pareto_k))
limits[1]=min(c(limits[1],0))
limits[2]=max(c(limits[2],1))


p<- ggplot(df, aes(y=Diff, x=ind, color=mean_influence_pareto_k)) + 
  geom_point() +
  xlab("Obs") +
  ylab("Pointwise Differnce") +
  geom_hline(yintercept=0) +
  geom_vline(xintercept = nobs+0.5,linetype="dashed") + 
  scale_colour_gradientn(limits=limits,colours = c("red","red","yellow","green","darkblue","darkblue"),
                         values = c(limits[2],1.0,0.7,0.5,0,limits[1])/(limits[2]-limits[1])-limits[1]) 
print(p)
ggsave("PWDiff1.png", plot = last_plot(),
       scale = 1, width = 16, height = 10, units = "cm",
       dpi = 600)




LOODiffgind=LOOlistgind[[1]]$pointwise[,1]-LOOlistgind[[2]]$pointwise[,1]

df=tibble(Diff=LOODiffgind,
          group=Letters[c(1:N,1:N)],
          mean_influence_pareto_k=(LOOlistgind[[2]]$pointwise[,"influence_pareto_k"]+LOOlistgind[[1]]$pointwise[,"influence_pareto_k"])/2,
          ind=1:(N+N))

limits=c(min(df$mean_influence_pareto_k), max(df$mean_influence_pareto_k))
limits[1]=min(c(limits[1],0))
limits[2]=max(c(limits[2],1))

p3<- ggplot(df, aes(y=Diff, x=ind, color=mean_influence_pareto_k)) + 
  geom_point() +
  xlab("Obs") +
  ylab("Pointwise Differnce") +
  geom_hline(yintercept=0) +
  geom_vline(xintercept = N+0.5,linetype="dashed") + 
  scale_colour_gradientn(limits=limits,colours = c("red","red","yellow","green","darkblue","darkblue"),
                         values = c(limits[2],1.0,0.7,0.5,0,limits[1])/(limits[2]-limits[1])-limits[1]) 

print(p3)
ggsave("PWDiffind2.png", plot = last_plot(),
       scale = 1, width = 16, height = 10, units = "cm",
       dpi = 600)

data {
  int<lower=0> N;
  int<lower=0> n[N];
  int<lower=0> nobs;
  int<lower=1, upper=N> G[nobs];
  int<lower=0> k[nobs];
  int<lower=0> x[N];
  int<lower=0,upper=1> incc;
  int<lower=0> ksum[N];
  
}


parameters {
  vector[N] E_raw;
  real<lower=0> beta;
  real<lower=0> S;
  real M;
  real<lower=0> c[incc ? 1 :0];
  real<lower=0> q;
  
}

transformed parameters{
  real mu[N];
  vector[N] E;
  real<lower=0> cc;
  
  if (incc){
    cc=c[1];
    
  }else{
    
    cc=1;
  }
  E=E_raw*S;
  for (g in 1:N){
    
    mu[g]=exp(M+E[g]);
  }
  
}


model {
  for(g in 1:N){
    
    x[g]~poisson(mu[g]^cc*q);
  }
  
  for (i in 1:nobs){
    
    k[i]~neg_binomial_2(mu[G[i]],mu[G[i]]*beta);
    
  }
  
  E_raw~normal(0,1);
  S~cauchy(0,2.5);
    q~lognormal(0,5);
    if(incc){
      c[1]~lognormal(0,5);
      
    }
  

}

generated quantities{
  real loglik_k[nobs];
  real loglik_ksum[N];
  real loglik_ksumind[N];
  real loglik_x[N];
  for (g in 1:N){
    loglik_x[g] = poisson_lpmf(x[g] | mu[g]^cc*q);
    loglik_ksum[g] = neg_binomial_2_lpmf(ksum[g] | n[g]*mu[g],mu[g]*beta*n[g]);
    loglik_ksumind[g]=0;
  }
  for(i in 1:nobs){
    loglik_k[i] = neg_binomial_2_lpmf(k[i] | mu[G[i]],mu[G[i]]*beta);
    loglik_ksumind[G[i]] += loglik_k[i];
    
  }
  
  
} 

Also, I agree that in this case it maybe would be sufficient to just examine the posterior distribution of c, but the real problem I’m mimicking doesn’t always contain one model as a special case of the other.

Looks like it, but as your code is complex and I had only a few minutes, I was not able to confirm this happens in the code.

That is possible.

That is likely to happen. More you remove, more the posterior changes which makes importance sampling difficult. This is even more likely to happen if you remove all observations directly influencing some parameter (e.g. removing whole group and having one more parameters for each group). See e.g. Roaches cross-validation demo

Thanks Aki for the input. I suppose I will have to try with the real model and see how PSIS-LOGO-CV (not sure if that’s the technical term) behaves. I fear it will be similar.

In your experience, do you think it is worth trying k-fold CV with PSIS where each observation is the sum of the likelihood over e.g. 1% of the data, in the example above selected from both k and x. Or is it safer to only include some manageable proportion of the full data for LOO? I’ve stayed away from this approach because it should reduce the chances of demonstrating a clear difference in predictive performance of models.

Edit: I tried the kfold approach on the above toy example and found that the analysis gave many Pareto k warnings with few (say 10) kfold groups, but these warnings disappear if I increase the number of kfold groups (e.g. 100). I think this is expected given Aki’s comment above. The results using loo_compare are very similar to the results considering individual observations.

It completely depends on how much the posterior changes, and that is likely to depend on the structure of the model and data.

Thanks for reporting this.