Hello,
I am trying to make a longitudinal model of color histograms and I am running into a couple problems that I hope someone can provide expertise on.
I have image data from plants where color is a phenotype of interest. The color values I am most interested in are Hue values (HSV colorspace), so they are bounded 0:360 in a circle so that 0 and 360 are extremely similar reds. Luckily, these are not Poinsettias so I am not too worried about hitting the red boundary at 0 or 360. When the color data is returned from image analysis, it is a histogram bounded 1:180 with each bin representing 2 degrees of hue and each histogram value being a number of pixels with those hues.
Sometimes the histogram follows a standard distribution for the entire experiment, gaussian or lognormal is common. When that happens I feel pretty good about modeling the circular mean over time (see Only unimodal gaussian histograms reprex).
Sometimes the histogram does not follow a standard distribution and changes a lot between treatment groups. When this comes up I have generally used pairwise comparisons at some set time or pairwise distances between the histograms, but that isn’t very satisfying and I’d like to be able to model the data across time. I have been trying to make a mixture model to capture this, but even starting with what I hoped would be a simplified version has been difficult for me as the model takes about 2 hours to run 600 iterations and did not converge (see Mixture histograms reprex).
My specific questions and examples are below.
1: Are there parts of my mixture model below that seem like they would slow it down/cause problems?
2: Is there a better way to account for the circular shape of Hue than truncating the response term?
3: Is there a way to have the model exclude a mixture component altogether if the histogram is unimodal at that time? I was planning to use a spline to model theta over time by group so that the proportion could go towards 0, but it seems like that will slow the model down significantly more.
4: Did I completely miss some style of modeling that makes more sense here?
Thank you for your time!
Mixture histograms
Apologies for long reprex, not sure what the most painless way to simulate the data is.
library(brms)
library(ggplot2)
dists_a <- list(rlnorm = list(meanlog = log(80), sdlog = 0.3),
rlnorm = list(meanlog = log(100), sdlog = 0.25),
rnorm = list(mean = 120, sd = 10),
rnorm = list(mean = 150, sd = 15),
rmixnorm = list(mean=c(150, 110), sd = c(10, 20), alpha = c(0.7, 0.3)),
rmixnorm = list(mean=c(150, 100), sd = c(10, 5), alpha = c(0.6, 0.4))
)
dists_b <- list(rlnorm = list(meanlog = log(80), sdlog = 0.3),
rlnorm = list(meanlog = log(90), sdlog = 0.25),
rnorm = list(mean = 100, sd = 10),
rnorm = list(mean = 125, sd = 15),
rnorm = list(mean = 150, sd = 10),
rnorm = list(mean = 155, sd = 7)
)
dists <- list("a" = dists_a, "b" = dists_b)
time=24
nreps = 5
min_bin <- 1
max_bin <- 180
set.seed(123)
# showing ggplot of distributions
ggplot(data=data.frame(x=1:180), aes(x=x))+
facet_grid(dist~group)+
lapply(c("a", "b"), function(grp){
lapply(1:6, function(di){
rfun <- names(dists[[grp]])[di]
rargs <- append(dists[[grp]][[di]], 1000)
names(rargs)[length(rargs)]<-"n"
v1 <- do.call(rfun, rargs)
v1[v1>max_bin]<-max_bin
v1[v1<min_bin]<-min_bin
d <- data.frame(group = grp, x = v1, dist = di)
geom_histogram(data=d, aes(x=x, fill=di), alpha=1, binwidth=1, show.legend = FALSE)
})
})+
theme_light()
# Making data
df2 <- do.call(rbind, lapply(c("a", "b"), function(grp){
do.call(rbind, lapply(1:nreps, function(rep){
do.call(rbind, lapply(1:time, function(ti){
whichDist <- ceiling(ti/4)
rfun <- names(dists[[grp]])[whichDist]
rargs <- append(dists[[grp]][[whichDist]], 1000)
names(rargs)[length(rargs)]<-"n"
v1 <- do.call(rfun, rargs)
v1[v1>max_bin]<-max_bin
v1[v1<min_bin]<-min_bin
s1<-hist(v1, breaks=seq(min_bin, (max_bin+1), 1), plot=FALSE)$counts
s1d<-as.data.frame(cbind(data.frame(grp, rep, ti), matrix(s1,nrow=1)))
colnames(s1d)<-c("group","rep", "time", paste0("sim_",min_bin:max_bin))
s1d
}))
}))
})) # data generally comes in wide format
library(data.table)
dt2 <- as.data.table(df2)
df2l <- as.data.frame(melt(dt2, id.vars = c("group", "rep", "time"), # lengthen the data
measure.vars = patterns("sim_*"), value.name = "weight"))
df2l$bin <- as.numeric(gsub("sim_", "", df2l$variable))
df2l<-df2l[df2l$weight>0,] # remove 0 weighted bins
# mixture model
mix <- mixture(gaussian, gaussian)
prior2 <- prior(student_t(3,0,5), nlpar = "A") +
prior(student_t(3,0,5), nlpar = "B")
fit2 <- brm(bf(bin|weights(weight)+trunc(lb=0, ub=180) ~ A + B*time,
#sigma1 ~ 0+time*group, # hoping to model these once the mu1&2 model works
#sigma2 ~ 0+time*group,
#theta1 ~ 0+group:time,
# autocor = ~arma(~time|rep:group,1,1),
A + B ~ 0+group,
nl = TRUE),
family = mix,
prior = prior2,
data = df2l,
iter = 600, cores = 2, chains = 2, backend = "cmdstanr", silent=0, init=0)
That model takes ~2 hours to run and looks pretty terrible.
> fit2
Family: mixture(gaussian, gaussian)
Links: mu1 = identity; sigma1 = identity; mu2 = identity; sigma2 = identity; theta1 = identity; theta2 = identity
Formula: bin | weights(weight) + trunc(lb = 0, ub = 180) ~ A + B * time
A ~ 0 + group
B ~ 0 + group
Data: df2l (Number of observations: 22383)
Draws: 2 chains, each with iter = 600; warmup = 300; thin = 1;
total post-warmup draws = 600
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
A_groupa 87.86 1.36 87.53 88.53 1.40 4 68
A_groupb 87.85 1.13 87.45 88.51 1.43 4 84
B_groupa 2.80 0.12 2.70 2.92 1.52 4 47
B_groupb 2.80 0.11 2.71 2.93 1.56 4 41
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma1 28.49 21.36 2.24 76.24 1.79 4 26
sigma2 29.45 0.83 28.77 30.78 1.59 4 20
theta1 0.02 0.03 0.00 0.07 2.18 3 12
theta2 0.98 0.03 0.93 1.00 2.23 3 11
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Warning messages:
1: Parts of the model have not converged (some Rhats are > 1.05). Be careful when analysing the results! We recommend running more iterations and/or setting stronger priors.
2: There were 106 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Only unimodal gaussian histograms
This one works okay but I am still curious if anyone knows of a better way to model this data.
# simulate some data
set.seed(123)
time=25
means <- cbind(seq(80, 130, length.out=time), seq(65, 150, length.out=time))
sds <- cbind(seq(15, 8, length.out=time), seq(20, 5, length.out=time))
min_bin <- 1
max_bin <- 180
set.seed(123)
df1 <- do.call(rbind, lapply(1:time, function(ti){
do.call(rbind, lapply(1:2, function(i){
grp <- letters[i]
do.call(rbind, lapply(1:5, function(rep){
mu <- rnorm(1, means[ti, i], means[ti, i]/10)
sd <- rnorm(1, sds[ti, i], sds[ti, i]/10)
v1 <- rnorm(1000, mu, sd)
v1[v1>max_bin]<-max_bin
v1[v1<min_bin]<-min_bin
s1<-hist(v1, breaks=seq(min_bin, (max_bin+1), 1), plot=FALSE)$counts
s1d<-as.data.frame(cbind(data.frame(grp, rep, ti), matrix(s1,nrow=1)))
colnames(s1d)<-c("group","rep", "time", paste0("sim_",min_bin:max_bin))
s1d
}))
}))
}))
# aggregate it to Mu/Sd
ag1 <- do.call(rbind, lapply(1:nrow(df1), function(i){
sub <- df1[i, -c(1:3)]
hist <- rep(1:180, times = as.numeric(sub))
data.frame(group = df1[i,"group"], rep = df1[i,'rep'], time = df1[i,'time'],
mu = mean(hist), sd = sd(hist))
}))
# basic model of it
prior1 <- prior(student_t(3,0,5), nlpar = "A") +
prior(student_t(3,0,5), dpar="sigma") +
prior(gamma(2,0.1), class="nu")
fit1 <- brm(bf(mu ~ A + B*time,
sigma~ s(time, by=group),
A + B ~ 0+group,
autocor = ~arma(~time|rep:group,1,1),nl = TRUE),
family = student, prior = prior1, data = ag1, iter = 600,
cores = 2, chains = 2, backend = "cmdstanr", silent=0 )
fit1 # should work fine
# alternatively using weights instead of the mean
library(data.table)
dt <- as.data.table(df1)
df1l <- as.data.frame(melt(dt, id.vars = c("group", "rep", "time"),
measure.vars = patterns("sim_*"), value.name = "weight"))
df1l$bin <- as.numeric(gsub("sim_", "", df1l$variable))
prior1b <- prior(student_t(3,0,5), nlpar = "A") +
prior(gamma(2,0.1), class="nu")
fit1b <- brm(bf(bin|weights(weight) ~ A + B*time,
A + B ~ 0+group,
autocor = ~arma(~time|rep:group,1,1),nl = TRUE),
family = student, prior = prior1b, data = df1l, iter = 600,
cores = 2, chains = 2, backend = "cmdstanr", silent=0 )
fit1b # does not love to include a sigma term, but this version works