Discontinuity with circular smooth (GAM)

For a somewhat sensitive project I am trying to model concentrations of several unnamed contaminants in a number of samples that have been taken at various times over the course of about two years. I am currently using a log-normal distribution for the concentrations (I’ve also tried gamma and Weibull; gamma yields similar results, with Weibull I can’t eliminate various sampling problems). There is reason to expect strong seasonal fluctuations in these concentrations so I am including a circular smooth for day_of_the_year/365 (i.e., range 1/365-1). The full (partially anonymized) model formula is:

my.form <- bf(concentration ~ 1 + 
                      s(var1, bs ='cc', by = a.cont) +
                      s(var2, by = var3, bs = 'cs') +
                      s(day_of_year, bs = 'cc') +
                      var3 + var4 + (1 | a/b/c/d/e) +
                      (1 | contaminant),
                 sigma ~ 0 + (1 | contaminant) +
                      s(day_of_year, bs = 'cc') +
                      s(var1, bs = 'cc', by = contaminant),
                      family = lognormal())

I would expect the smooth to be circular, so that the end of a yearly cycle connects smoothly to the beginning of the next cycle. However, I am seeing a considerable discontinuity in the posterior predictions (posterior_epred) at this point instead (the line represents the median, the ribbon is 2.5th-97.5th percentile):

discontinuity

Note that the variable on the x-axis in this plot is not day_of_year: in this plot, several predictor variables vary along with the x-axis, but day_of_year is the only one that has a discontinuity at the jump (going from 1 to 1/365).

The (anonymized and rather convoluted) code to produce this plot is:

library(brms)
library(ggplot2)
library(posterior)
library(lubridate)
library(bayesplot)

plotcase <- function(mod = all.full.7, target_case = 1, identifier = "all.full.7"){
    # select the target case from the data
    tarcase <- cases[target_case,]
    tarcase$som <- NULL
    
    # drop the sum of the contaminants
    nosumcases <- longcases[longcases$contaminant != "sum",]
    nosumcases$contaminant <- droplevels(nosumcases$contaminant)
    
    # define the range of days (of the year) to plot for and create the corresponding data.frame
    dayrange <- seq(from = 0.01, to = 0.99, length.out = 50)
    varyday <- tarcase[rep(1, each = length(dayrange)), ]
    varyday$day_of_year <- dayrange
    varyday$sim <- 'day'
    nday <- nrow(varyday)
    varyday <- varyday[rep(1:nday, each = length(unique(nosumcases$contaminant))),]
    varyday$contaminant <- rep(unique(nosumcases$contaminant), times = nday)
    varyday$a.cont <- paste(varyday$a, varyday$contaminant, sep = ".")
    testday <- posterior_epred(object = mod, newdata = varyday)
    varyday$median <- apply(testday, 2, median) # note that this is now the median, not the mean
    varyday$low <- apply(testday, 2, quantile, 0.025)
    varyday$high <- apply(testday, 2, quantile, 0.975)
    varyday$x <- 365*varyday$day_of_year 
    
    # define the var2 range to plot for and create the corresponding data.frame
    if(tarcase$var3 == 'A'){
        var2range <- seq(from = (0 - meanvar2)/sdvar2, to = (15*365 - meanvar2)/sdvar2, by = 30/sdvar2)
    } else if(tarcase$var3 == 'B'){
        var2range <- seq(from = (0 - meanvar2)/sdvar2, to = (5*365 - meanvar2)/sdvar2, by = 30/sdvar2)
    }
    varyvar2 <-tarcase[rep(1, each = length(var2range)), ]
    varyvar2$st.var2 <- var2range
    varyvar2$sim <- 'var2'
    nvar2 <- nrow(varyvar2)
    varyvar2 <- varyvar2[rep(1:nvar2, each = length(unique(nosumcases$contaminant))),]
    varyvar2$contaminant <- rep(unique(nosumcases$contaminant), times = nvar2)
    varyvar2$a.cont <- paste(varyvar2$a, varyvar2$contaminant, sep = ".")
    testvar2 <- posterior_epred(object = mod, newdata = varyvar2)
    varyvar2$median <- apply(testvar2, 2, median) # note that this is now the median, not the mean
    varyvar2$low <- apply(testvar2, 2, quantile, 0.025)
    varyvar2$high <- apply(testvar2, 2, quantile, 0.975)
    varyvar2$x <- (meanvar2 + sdvar2 * varyvar2$st.var2)/365
    
    # define the range of var5 to plot for and create the corresponding data.frame
	# THIS VARIABLE IS NOT IN THE CURRENT MODEL (but it is in the data)
    var5range <- seq(from = min(cases$st.var5), to = max(cases$st.var5), length.out = 50)
    varyvar5 <- tarcase[rep(1, each = length(var5range)), ]
    varyvar5$st.var5 <- var5range
    varyvar5$sim <- 'var5'
    nvar5 <- nrow(varyvar5)
    varyvar5 <- varyvar5[rep(1:nvar5, each = length(unique(nosumcases$contaminant))),]
    varyvar5$contaminant <- rep(unique(nosumcases$contaminant), times = nvar5)
    varyvar5$a.cont <- paste(varyvar5$a, varyvar5$contaminant, sep = ".")
    testvar5 <- posterior_epred(object = mod, newdata = varyvar5)
    varyvar5$median <- apply(testvar5, 2, median) # note that this is now the median, not the mean
    varyvar5$low <- apply(testvar5, 2, quantile, 0.025)
    varyvar5$high <- apply(testvar5, 2, quantile, 0.975)
    varyvar5$x <- meanvar5 + varyvar5$st.var5 * sdvar5

    # define the range of var1 to plot for and create the corresponding data.frame
    relrange <- seq(from = min(cases$var1), to = max(cases$var1), length.out = 50)
    print(range(cases$var1))
    varyrel <-tarcase[rep(1, each = length(relrange)), ]
    varyrel$var1 <- relrange
    #print(varyrel$var1)
    varyrel$sim <- 'rel'
    nrel <- nrow(varyrel)
    varyrel <- varyrel[rep(1:nrel, each = length(unique(nosumcases$contaminant))),]
    varyrel$contaminant <- rep(unique(nosumcases$contaminant), times = nrel)
    varyrel$a.cont <- paste(varyrel$a, varyrel$contaminant, sep = ".")
    testrel <- posterior_epred(object = mod, newdata = varyrel)
    varyrel$median <- apply(testrel, 2, median) # note that this is now the median, not the mean
    varyrel$low <- apply(testrel, 2, quantile, 0.025)
    varyrel$high <- apply(testrel, 2, quantile, 0.975)
    varyrel$x <- meanrel + varyrel$var1 * sdrel
    print(range(varyrel$x))
    print(summary(varyrel))
    
	# define the range around the actual date to plot
    stablerange <- seq(-180, 180, by = 1)
    varystable <- tarcase[rep(1, each = length(stablerange)), ]
    varystable$enddate <- varystable$enddate + stablerange
    varystable$day_of_year <- yday(varystable$enddate)/365
    varystable$var2 <- as.numeric(varystable$enddate - varystable$startdate)
    varystable$st.var2 <- (varystable$var2 - meanvar2)/sdvar2
    fds <- var5dates[[grep(tarcase$locatie[1], names(var5dates))]]
    for(k in 1:nrow(varystable)){
        fds <- fds[!is.na(fds)]
        varystable$closevar5[k] <- (varystable$enddate[k] - fds)[which.min(abs(fds - varystable$enddate[k]))]
        fds <- fds[fds < varystable$enddate[k]]
        fds <- max(fds, na.rm = TRUE)
        varystable$days_since_var5[k]  <- as.numeric(varystable$enddate[k] - fds)
    }
    varystable$closevar5[varystable$closevar5 > max(cases$closevar5)] <- max(cases$closevar5)
    varystable$closevar5[varystable$closevar5 < min(cases$closevar5)] <- min(cases$closevar5)
    varystable$enddate <- tarcase$enddate
    varystable$st.var5 <- (varystable$days_since_var5 - meanvar5)/sdvar5
    varystable$var1 <- (varystable$closevar5 - meanrel)/sdrel
    varystable$sim <- 'stable'
    nstable <- nrow(varystable)
    varystable <- varystable[rep(1:nstable, each = length(unique(nosumcases$contaminant))),]
    varystable$contaminant <- rep(unique(nosumcases$contaminant), times = nstable)
    varystable$a.cont <- paste(varystable$a, varystable$contaminant, sep = ".")
    teststable <- posterior_epred(object = mod, newdata = varystable)
    varystable$median <- apply(teststable, 2, median) # note that this is now the median, not the mean
    varystable$low <- apply(teststable, 2, quantile, 0.025)
    varystable$high <- apply(teststable, 2, quantile, 0.975)
    varystable$x <- (meanvar2 + sdvar2 * varystable$st.var2)/365
    
    daycase <- nosumcases[nosumcases$id == tarcase$id,]
    var2case <- nosumcases[nosumcases$id == tarcase$id,]
    var5case <- nosumcases[nosumcases$id == tarcase$id,]
    relcase <- nosumcases[nosumcases$id == tarcase$id,]
    stablecase <- nosumcases[nosumcases$id == tarcase$id,]
    daycase$sim <- 'day' 
    daycase$x <- 365*daycase$day_of_year
    var2case$sim <- 'var2'
    var2case$x <- (meanvar2 + sdvar2 * var2case$st.var2)/365
    var5case$sim <- 'var5'
    var5case$x <- meanvar5 + var5case$st.var5 * sdvar5
    relcase$sim <- 'rel'
    relcase$x <- meanrel + var5case$var1 * sdrel
    stablecase$sim <- 'stable'
    stablecase$x <- (meanvar2 + sdvar2 * stablecase$st.var2)/365
    pc <- rbind(daycase, var2case, var5case, relcase, stablecase)
    
    varydf <- rbind(varyday, varyvar2, varyvar5, varyrel, varystable)
    caseplot <- ggplot(data = varydf, aes(x = x, y = median, color = sim, fill = sim)) +
        geom_ribbon(aes(ymin = low, ymax = high), color = NA, alpha = 0.3) +
        geom_line(color = 'black') + scale_y_log10() +
        facet_wrap(contaminant ~ sim, scales = 'free', ncol = 5) +
        geom_point(data = pc, aes(y = conc)) + 
        guides(color ='none', fill = 'none') +
        labs(title = paste(tarcase$var3,
                           round(tarcase$var2/365, 1),
                           tarcase$d,
                           tarcase$enddate,
                           tarcase$var1,
                           tarcase$var5, sep = ' - ')) +
        theme(
            text = element_text(size = 10),
            axis.line = element_line(colour = "black"),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
            panel.background = element_rect(fill = "transparent",colour = NA),
            plot.background = element_rect(fill = "transparent",colour = NA),
            plot.margin = margin(1, 2, 1, 1, "cm"))
    ggsave(filename = paste("/home/plots/cases/", identifier, "case", target_case, ".png", sep = ""),
           plot = caseplot, device = png(), width = 30, height = 20, units = 'cm')
    dev.off()
    return(varyday)
}

This creates a 4x5 grid of plots for a single case from the data, where each row corresponds to a different contaminant. For each contaminant, the five plots correspond to:

    • Variation in var2 (keeping the remaining variables constant)
    • Variation in day_of_year (keeping the remaining variables constant)
    • Variation in var5 (keeping the remaining variables constant) # not relevant in this model
    • Variation in var1 (keeping the remaining variables constant)
    • Variation in all of the above

So a single row looks like this:

As can be deduced from the code, var1 and var5 are different ways of modeling time relative to a particular event. Var2 is essentially an age variable. Since var1, var2, var5 and day_of_year are all temporal variables, one cannot quite stay constant while the others vary. However, if I vary only day_of_year, I still observe a jump between the beginning and end of the smooth (colors are different contaminants). This can be seen from the second panel of the above figure, but it is perhaps more evident when plotted separately:

for(k in 1:1){#nrow(cases)){
    vary_day_of_year <- plotcase(mod = rel2.mod, target_case = k, identifier = "rel2")
}
ggplot(data = vary_day_of_year, aes(x = x, y = median, group = contaminant, color = contaminant)) +
 geom_line() + scale_y_log10()

cc_smooth

I initially did not have a separate smooth for the day_of_year in the formula for sigma, but this appears to make no difference (which is understandable to me in hindsight).

I would be most grateful for any pointers to understanding what causes this discontinuity and/or how to eliminate it.

If you’re able to show us the code you used to produce the plots (or an anonymized version/sketch), it would help rule out a variety of potential sources of an issue like this one.

Thanks for being willing to look at this. I’ve added code for the plots to the post above.

In brief, I’m taking a single case from the data and plotting (counterfactual) estimates of what would have happened if we would have had a different end date. I hope this makes any sense…

I don’t know for sure what’s happening, but I’ll take a guess and you can tell me if I might be right or if I’m off-base (I haven’t worked with cyclic smooths before).

My guess is that there’s some window in late december and/or early january when you don’t have any data, and when you pass bs = "cc", what you’re getting is a basis that is cyclic not by joining 31 December to 1 January, but rather by joining the latest day-of-year in your data to the earliest day-of-year in your data. That would explain the small overshoot your observe.

Perhaps @ucfagls would be willing to chime in about how to do this properly. I note that in his post here Modelling seasonal data with GAMs, the cyclical constraint appears to enforce equality between January and December temps and derivatives (if I’m squinting at the figures right), and I’m not exactly sure what the best way would be to tell a model like this to leave a month’s worth of wiggle between month 12 and month 1.

Perhaps we should just split our month 1 data and assign some of it to month 13? Or if we only have one month 1 observation, we could assign it to both month 1 and month 13 with weight 0.5? Or maybe we can leave our data alone but manually specify that there should be a knot at 13?

1 Like

Thanks a lot, of course that’s it exactly! There is no data between December 10 and January 5.

I’ve tested this now by defining another date - where I have multiple observations - as day_of_year = 1, and assigning some of those observations to day_of_year = 0. This does indeed eliminate the discontinuity.

Of course I’d still be interested to hear whether @ucfagls has suggestions as to what would be the best way to do this!

Yes; if you don’t specify knots, then mgcv will place the boundary knots at the limits of the data and that will force the value and first couple of derivatives to be equal at this point, which likely isn’t useful if your data doesn’t cover the full range of the covariate. If you want to you can force the smooth to join beyond the data limits, but beware doing too much extrapolation beyond support of the data. I mention below a way that typically works well for me. Or you could just not use the cyclic smooth. If you need to get estimates for the entire year, this isn’t an option.

What you can do is to specify the end points of the smooth through the knots argument, and extend the boundary knots slightly beyond the data, rather than modify or weight the data.

I tend to leave day of year unscaled, so I would use knots = list(day_of_year = c(0.5, 365.5)) (assuming 365 days per year as that’s what the OP specified explicitly). The way I think of this is that day_of_year = 1 is noon on Jan 1, such that day_of_year = 0.5 is the start of Jan 1 and day_of_year = 365.5 is midnight on Dec 31, which is when we want the end point of the smooth to meet.

For monthly data we’d do similar: knots = list(day_of_year = c(0.5, 12.5)) where we think of the integers values 1, 2, etc representing the month mid points, and the join point is the end of December and beginning of January.

2 Likes

Thanks a lot for the additional info. Unfortunately my brms installation is currently down for updates. I’ll try these suggestions as soon as it’s back up.

PS: I have been told that in Stan (and by extension brms), it’s preferable to have all variables on approximately unit scale, which is why I did rescale day_of_year. But I suppose mgcv::s (or the wrapper function brms::s) may already rescale the variable internally?

mgcv won’t do any rescaling for you. If you do rescale, you’ll need also adjust the knot locations I mentioned such that they lie very slightly outside the 0-1 range by the equivalent of 0.5 days on this new scale. I don’t know if brms does this internally as I haven’t looked too closely at the generated Stan code.

Just to clarify for @JMeekes: mgcv won’t rescale, but it will select the positions of the internal knots to be appropriate to your scale, so that the wiggliness gets rescaled in a data-dependent way, as you’d expect. There are two main reasons it’s advantageous to rescale the data in Stan. One is to enable easier prior-setting, and better behavior under default priors. I don’t think that matters here, but it would be a good idea ensure that your prior model looks reasonable in any case. For more on priors in brms GAMs, see here Better priors (non-flat) for GAMs (brms) - #4 by ucfagls

The second reason is that it is mildly advantageous to have a posterior whose scale is close to that of a standard normal, since that’s the starting point for Stan’s metric adaptation. I simply don’t know enough about how these basis functions and random effects work to say off the top of my head whether rescaling matters here or not. However, this is not a hugely important point, since Stan’s metric adaptation should be able to update to alternative scales reasonably rapidly.

2 Likes