For a somewhat sensitive project I am trying to model concentrations of several unnamed contaminants in a number of samples that have been taken at various times over the course of about two years. I am currently using a log-normal distribution for the concentrations (I’ve also tried gamma and Weibull; gamma yields similar results, with Weibull I can’t eliminate various sampling problems). There is reason to expect strong seasonal fluctuations in these concentrations so I am including a circular smooth for day_of_the_year/365 (i.e., range 1/365-1). The full (partially anonymized) model formula is:
my.form <- bf(concentration ~ 1 +
s(var1, bs ='cc', by = a.cont) +
s(var2, by = var3, bs = 'cs') +
s(day_of_year, bs = 'cc') +
var3 + var4 + (1 | a/b/c/d/e) +
(1 | contaminant),
sigma ~ 0 + (1 | contaminant) +
s(day_of_year, bs = 'cc') +
s(var1, bs = 'cc', by = contaminant),
family = lognormal())
I would expect the smooth to be circular, so that the end of a yearly cycle connects smoothly to the beginning of the next cycle. However, I am seeing a considerable discontinuity in the posterior predictions (posterior_epred) at this point instead (the line represents the median, the ribbon is 2.5th-97.5th percentile):
Note that the variable on the x-axis in this plot is not day_of_year: in this plot, several predictor variables vary along with the x-axis, but day_of_year is the only one that has a discontinuity at the jump (going from 1 to 1/365).
The (anonymized and rather convoluted) code to produce this plot is:
library(brms)
library(ggplot2)
library(posterior)
library(lubridate)
library(bayesplot)
plotcase <- function(mod = all.full.7, target_case = 1, identifier = "all.full.7"){
# select the target case from the data
tarcase <- cases[target_case,]
tarcase$som <- NULL
# drop the sum of the contaminants
nosumcases <- longcases[longcases$contaminant != "sum",]
nosumcases$contaminant <- droplevels(nosumcases$contaminant)
# define the range of days (of the year) to plot for and create the corresponding data.frame
dayrange <- seq(from = 0.01, to = 0.99, length.out = 50)
varyday <- tarcase[rep(1, each = length(dayrange)), ]
varyday$day_of_year <- dayrange
varyday$sim <- 'day'
nday <- nrow(varyday)
varyday <- varyday[rep(1:nday, each = length(unique(nosumcases$contaminant))),]
varyday$contaminant <- rep(unique(nosumcases$contaminant), times = nday)
varyday$a.cont <- paste(varyday$a, varyday$contaminant, sep = ".")
testday <- posterior_epred(object = mod, newdata = varyday)
varyday$median <- apply(testday, 2, median) # note that this is now the median, not the mean
varyday$low <- apply(testday, 2, quantile, 0.025)
varyday$high <- apply(testday, 2, quantile, 0.975)
varyday$x <- 365*varyday$day_of_year
# define the var2 range to plot for and create the corresponding data.frame
if(tarcase$var3 == 'A'){
var2range <- seq(from = (0 - meanvar2)/sdvar2, to = (15*365 - meanvar2)/sdvar2, by = 30/sdvar2)
} else if(tarcase$var3 == 'B'){
var2range <- seq(from = (0 - meanvar2)/sdvar2, to = (5*365 - meanvar2)/sdvar2, by = 30/sdvar2)
}
varyvar2 <-tarcase[rep(1, each = length(var2range)), ]
varyvar2$st.var2 <- var2range
varyvar2$sim <- 'var2'
nvar2 <- nrow(varyvar2)
varyvar2 <- varyvar2[rep(1:nvar2, each = length(unique(nosumcases$contaminant))),]
varyvar2$contaminant <- rep(unique(nosumcases$contaminant), times = nvar2)
varyvar2$a.cont <- paste(varyvar2$a, varyvar2$contaminant, sep = ".")
testvar2 <- posterior_epred(object = mod, newdata = varyvar2)
varyvar2$median <- apply(testvar2, 2, median) # note that this is now the median, not the mean
varyvar2$low <- apply(testvar2, 2, quantile, 0.025)
varyvar2$high <- apply(testvar2, 2, quantile, 0.975)
varyvar2$x <- (meanvar2 + sdvar2 * varyvar2$st.var2)/365
# define the range of var5 to plot for and create the corresponding data.frame
# THIS VARIABLE IS NOT IN THE CURRENT MODEL (but it is in the data)
var5range <- seq(from = min(cases$st.var5), to = max(cases$st.var5), length.out = 50)
varyvar5 <- tarcase[rep(1, each = length(var5range)), ]
varyvar5$st.var5 <- var5range
varyvar5$sim <- 'var5'
nvar5 <- nrow(varyvar5)
varyvar5 <- varyvar5[rep(1:nvar5, each = length(unique(nosumcases$contaminant))),]
varyvar5$contaminant <- rep(unique(nosumcases$contaminant), times = nvar5)
varyvar5$a.cont <- paste(varyvar5$a, varyvar5$contaminant, sep = ".")
testvar5 <- posterior_epred(object = mod, newdata = varyvar5)
varyvar5$median <- apply(testvar5, 2, median) # note that this is now the median, not the mean
varyvar5$low <- apply(testvar5, 2, quantile, 0.025)
varyvar5$high <- apply(testvar5, 2, quantile, 0.975)
varyvar5$x <- meanvar5 + varyvar5$st.var5 * sdvar5
# define the range of var1 to plot for and create the corresponding data.frame
relrange <- seq(from = min(cases$var1), to = max(cases$var1), length.out = 50)
print(range(cases$var1))
varyrel <-tarcase[rep(1, each = length(relrange)), ]
varyrel$var1 <- relrange
#print(varyrel$var1)
varyrel$sim <- 'rel'
nrel <- nrow(varyrel)
varyrel <- varyrel[rep(1:nrel, each = length(unique(nosumcases$contaminant))),]
varyrel$contaminant <- rep(unique(nosumcases$contaminant), times = nrel)
varyrel$a.cont <- paste(varyrel$a, varyrel$contaminant, sep = ".")
testrel <- posterior_epred(object = mod, newdata = varyrel)
varyrel$median <- apply(testrel, 2, median) # note that this is now the median, not the mean
varyrel$low <- apply(testrel, 2, quantile, 0.025)
varyrel$high <- apply(testrel, 2, quantile, 0.975)
varyrel$x <- meanrel + varyrel$var1 * sdrel
print(range(varyrel$x))
print(summary(varyrel))
# define the range around the actual date to plot
stablerange <- seq(-180, 180, by = 1)
varystable <- tarcase[rep(1, each = length(stablerange)), ]
varystable$enddate <- varystable$enddate + stablerange
varystable$day_of_year <- yday(varystable$enddate)/365
varystable$var2 <- as.numeric(varystable$enddate - varystable$startdate)
varystable$st.var2 <- (varystable$var2 - meanvar2)/sdvar2
fds <- var5dates[[grep(tarcase$locatie[1], names(var5dates))]]
for(k in 1:nrow(varystable)){
fds <- fds[!is.na(fds)]
varystable$closevar5[k] <- (varystable$enddate[k] - fds)[which.min(abs(fds - varystable$enddate[k]))]
fds <- fds[fds < varystable$enddate[k]]
fds <- max(fds, na.rm = TRUE)
varystable$days_since_var5[k] <- as.numeric(varystable$enddate[k] - fds)
}
varystable$closevar5[varystable$closevar5 > max(cases$closevar5)] <- max(cases$closevar5)
varystable$closevar5[varystable$closevar5 < min(cases$closevar5)] <- min(cases$closevar5)
varystable$enddate <- tarcase$enddate
varystable$st.var5 <- (varystable$days_since_var5 - meanvar5)/sdvar5
varystable$var1 <- (varystable$closevar5 - meanrel)/sdrel
varystable$sim <- 'stable'
nstable <- nrow(varystable)
varystable <- varystable[rep(1:nstable, each = length(unique(nosumcases$contaminant))),]
varystable$contaminant <- rep(unique(nosumcases$contaminant), times = nstable)
varystable$a.cont <- paste(varystable$a, varystable$contaminant, sep = ".")
teststable <- posterior_epred(object = mod, newdata = varystable)
varystable$median <- apply(teststable, 2, median) # note that this is now the median, not the mean
varystable$low <- apply(teststable, 2, quantile, 0.025)
varystable$high <- apply(teststable, 2, quantile, 0.975)
varystable$x <- (meanvar2 + sdvar2 * varystable$st.var2)/365
daycase <- nosumcases[nosumcases$id == tarcase$id,]
var2case <- nosumcases[nosumcases$id == tarcase$id,]
var5case <- nosumcases[nosumcases$id == tarcase$id,]
relcase <- nosumcases[nosumcases$id == tarcase$id,]
stablecase <- nosumcases[nosumcases$id == tarcase$id,]
daycase$sim <- 'day'
daycase$x <- 365*daycase$day_of_year
var2case$sim <- 'var2'
var2case$x <- (meanvar2 + sdvar2 * var2case$st.var2)/365
var5case$sim <- 'var5'
var5case$x <- meanvar5 + var5case$st.var5 * sdvar5
relcase$sim <- 'rel'
relcase$x <- meanrel + var5case$var1 * sdrel
stablecase$sim <- 'stable'
stablecase$x <- (meanvar2 + sdvar2 * stablecase$st.var2)/365
pc <- rbind(daycase, var2case, var5case, relcase, stablecase)
varydf <- rbind(varyday, varyvar2, varyvar5, varyrel, varystable)
caseplot <- ggplot(data = varydf, aes(x = x, y = median, color = sim, fill = sim)) +
geom_ribbon(aes(ymin = low, ymax = high), color = NA, alpha = 0.3) +
geom_line(color = 'black') + scale_y_log10() +
facet_wrap(contaminant ~ sim, scales = 'free', ncol = 5) +
geom_point(data = pc, aes(y = conc)) +
guides(color ='none', fill = 'none') +
labs(title = paste(tarcase$var3,
round(tarcase$var2/365, 1),
tarcase$d,
tarcase$enddate,
tarcase$var1,
tarcase$var5, sep = ' - ')) +
theme(
text = element_text(size = 10),
axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
plot.margin = margin(1, 2, 1, 1, "cm"))
ggsave(filename = paste("/home/plots/cases/", identifier, "case", target_case, ".png", sep = ""),
plot = caseplot, device = png(), width = 30, height = 20, units = 'cm')
dev.off()
return(varyday)
}
This creates a 4x5 grid of plots for a single case from the data, where each row corresponds to a different contaminant. For each contaminant, the five plots correspond to:
-
- Variation in var2 (keeping the remaining variables constant)
-
- Variation in day_of_year (keeping the remaining variables constant)
-
- Variation in var5 (keeping the remaining variables constant) # not relevant in this model
-
- Variation in var1 (keeping the remaining variables constant)
-
- Variation in all of the above
So a single row looks like this:
As can be deduced from the code, var1 and var5 are different ways of modeling time relative to a particular event. Var2 is essentially an age variable. Since var1, var2, var5 and day_of_year are all temporal variables, one cannot quite stay constant while the others vary. However, if I vary only day_of_year, I still observe a jump between the beginning and end of the smooth (colors are different contaminants). This can be seen from the second panel of the above figure, but it is perhaps more evident when plotted separately:
for(k in 1:1){#nrow(cases)){
vary_day_of_year <- plotcase(mod = rel2.mod, target_case = k, identifier = "rel2")
}
ggplot(data = vary_day_of_year, aes(x = x, y = median, group = contaminant, color = contaminant)) +
geom_line() + scale_y_log10()
I initially did not have a separate smooth for the day_of_year in the formula for sigma, but this appears to make no difference (which is understandable to me in hindsight).
I would be most grateful for any pointers to understanding what causes this discontinuity and/or how to eliminate it.