Increasing Parameters Decreases Likelihood?

Hiya folks,
I ran a set of CJS models (more or less from Yackulic et al. 2020) including a CJS with a time-independent phi and one with a time-dependent phi. I stored the point wise log-likelihoods for the sake of model comparison using loo, but I also wanted to calculate the overall log likelihoods (ultimately to calculate a % variance explained by some annual climate variable, similar to the ANODEV in Program MARK). I summed up the point wise log-likelihoods from the time-independent phi model to get an overall log-likelihood per MCMC iteration, and repeated this for the time-dependent phi model, and discovered that the likelihood for the time-independent model was often higher than that for the time-dependent model for many of the species I had fit models to. This seems odd to me as all else equal, I would think that having a phi per year as opposed to a single phi should increase the likelihood. Could anyone explain how this could happen… or where in my code I’ve messed up to produce this result?

Here is the time-independent model [phi(.)p(.)]:


data{
  int<lower = 1> NsumCH; // number of capture histories
  int<lower = 1> n_occasions; // number of capture occasions
  int<lower = 1, upper = 2> sumCH[NsumCH, n_occasions]; // capture histories (1,2)
  int<lower = 1> sumf[NsumCH]; // first capture occasion
  int<lower = 1> sumFR[NsumCH]; // frequency of each capture history
  int<lower = 1> n_missing; // number of unsampled years
  int<lower = 1> missing[n_missing]; // unsampled years
  int<lower = 1> n_observed; // number of sampled years
  int<lower = 1> observed[n_observed]; // sampled years
  vector[n_occasions-1] temp; // temperature each year
  vector[n_occasions-1] prec; // precipitation each year
  int n_lik; // number of data to calculate a likelihood for
  int<lower = 0, upper = 1> CH[NsumCH, n_occasions]; // capture histories (1,0)
}

parameters{
  real<lower=0,upper=1> phi; // survival probability
  real<lower = 0, upper = 1> real_p; // detection probability for all years
}

transformed parameters{
  simplex[2] tr[2,n_occasions - 1]; // survival likelihoods for marginalization
  simplex[2] pmat[2,n_occasions - 1]; // detection likelihoods for marginalization
  // real mu[n_occasions-1];
  real<lower = 0, upper = 1> p[n_occasions - 1]; // detection probability with unsampled years = 0

  for(i in missing){
    p[i] = 0; // set unsampled p to 0
  }
  
  for(i in observed){
    p[i] = real_p; // fill in sampled p with a real estimate
  }
  
  for(k in 1:n_occasions - 1){
    tr[1,k,1] = phi;
    tr[1,k,2] = (1 - phi);
    tr[2,k,1] = 0;
    tr[2,k,2] = 1;
    
    pmat[1,k,1] = p[k];
    pmat[1,k,2] = (1 - p[k]);
    pmat[2,k,1] = 0;
    pmat[2,k,2] = 1;
  }
  }
  
model{
    vector[2] pz[n_occasions]; // marginalized likelihoods
    
    for(i in 1:NsumCH){  
      pz[sumf[i],1] = 1;
      pz[sumf[i],2] = 0;
      
      for(k in (sumf[i] + 1):n_occasions){ 
        pz[k,1] = pz[k-1,1] * tr[1,k-1,1] * pmat[1,k-1,sumCH[i,(k)]];
        pz[k,2] = (pz[k-1, 1] * tr[1,k-1,2] + pz[k-1, 2]) * pmat[2,k-1,sumCH[i,(k)]];
      }  
      
      target += sumFR[i] * log(sum(pz[n_occasions])); 
      
    }
    
}


generated quantities {
  vector[n_lik] log_lik;
  int counter;
  counter = 1;
  for(i in 1:NsumCH){ // doesn't include critters observed first on last occasion
    if (sumf[i] >= n_occasions){
    }
    else{
          for(j in sumf[i]+1:n_occasions){
      for(k in 1:sumFR[i]){
        log_lik[counter] = bernoulli_lpmf(CH[i,j]|phi*p[j-1]);
        counter += 1;
      }
    }
    }
    }
}

and here is the time-dependent model [phi(t)p(.)]:


data{
  int<lower = 1> NsumCH; // number of capture histories
  int<lower = 1> n_occasions; // number of capture occasions
  int<lower = 1, upper = 2> sumCH[NsumCH, n_occasions]; // capture histories (1,2)
  int<lower = 1> sumf[NsumCH]; // first capture occasion
  int<lower = 1> sumFR[NsumCH]; // frequency of each capture history
  int<lower = 1> n_missing; // number of unsampled years
  int<lower = 1> missing[n_missing]; // unsampled years
  int<lower = 1> n_observed; // number of sampled years
  int<lower = 1> observed[n_observed]; // sampled years
  vector[n_occasions-1] temp; // temperature each year
  vector[n_occasions-1] prec; // precipitation each year
  int n_lik; // number of data to calculate a likelihood for
  int<lower = 0, upper = 1> CH[NsumCH, n_occasions]; // capture histories (1,0)
}

parameters{
  real<lower=0,upper=1> phi[n_occasions-1]; // survival probability
  real<lower = 0, upper = 1> real_p; // detection probability for all years
}

transformed parameters{
  simplex[2] tr[2,n_occasions - 1]; // survival likelihoods for marginalization
  simplex[2] pmat[2,n_occasions - 1]; // detection likelihoods for marginalization
  // real mu[n_occasions-1];
  real<lower = 0, upper = 1> p[n_occasions - 1]; // detection probability with unsampled years = 0

  for(i in missing){
    p[i] = 0; // set unsampled p to 0
  }
  
  for(i in observed){
    p[i] = real_p; // fill in sampled p with a real estimate
  }
  
  for(k in 1:n_occasions - 1){
    tr[1,k,1] = phi[k];
    tr[1,k,2] = (1 - phi[k]);
    tr[2,k,1] = 0;
    tr[2,k,2] = 1;
    
    pmat[1,k,1] = p[k];
    pmat[1,k,2] = (1 - p[k]);
    pmat[2,k,1] = 0;
    pmat[2,k,2] = 1;
  }
  }
  
model{
    vector[2] pz[n_occasions]; // marginalized likelihoods
    
    for(i in 1:NsumCH){  
      pz[sumf[i],1] = 1;
      pz[sumf[i],2] = 0;
      
      for(k in (sumf[i] + 1):n_occasions){ 
        pz[k,1] = pz[k-1,1] * tr[1,k-1,1] * pmat[1,k-1,sumCH[i,(k)]];
        pz[k,2] = (pz[k-1, 1] * tr[1,k-1,2] + pz[k-1, 2]) * pmat[2,k-1,sumCH[i,(k)]];
      }  
      
      target += sumFR[i] * log(sum(pz[n_occasions])); 
      
    }
    
}


generated quantities {
  vector[n_lik] log_lik;
  int counter;
  counter = 1;
  for(i in 1:NsumCH){ // doesn't include critters observed first on last occasion
    if (sumf[i] >= n_occasions){
    }
    else{
          for(j in sumf[i]+1:n_occasions){
      for(k in 1:sumFR[i]){
        log_lik[counter] = bernoulli_lpmf(CH[i,j]|phi[j-1]*p[j-1]);
        counter += 1;
      }
    }
    }
    }
}

Thanks in advanced!
-Jeremy

If the difference in likelihood is modest, rather than extreme, then results like this are possible in absence of errors/bugs due to the influence of the prior. In maximum likelihood inference, assuming that the fitting routine is indeed finding the mode, it is clearly always the case that increasing model flexibility, while including the original model as a special case, can only increase the likelihood at the maximum likelihood estimate. I imagine this is the intuition that is leading you to be confused about how your result can arise.

Suppose that it’s the case that the maximum likelihood in your flexible model is not much greater than the likelihood in your simple model, but that your prior in the flexible model also covers a lot of model configurations where the likelihood is lower. Assuming that the prior isn’t extremely strong, configurations where the likelihood is much lower should get ruled out by the likelihood term. But configurations where the likelihood is a bit lower do not get ruled out by the likelihood term. Now suppose that it just so happens that among likelihood configurations that are not much lower than the likelihood in the simple model, the prior in the flexible model places 99% of the prior probability mass over configurations that have lower likelihood. In this setting, you might well find that the flexible model tends to yield lower likelihoods than the simple one.