Dear Stan community,
My question is on how to deal with small sample sizes when fitting dose-response curves.
Consider a dataset with two simulated samples (red and blue below) applied equally to three blocks BUT with some degree of block to block variation in the horizontal movement of the curve (i.e. the curve is very similar but shifted differently to the right or to the left depending on the block it is in):
This would be called a Randomized Complete Block Design, and we try to fit these data using a four-parameter logistic:
y_{i,s,b} = \frac{A}{1+2^{-B(x_i - C_{s,b})}}+D+\epsilon_i
Where i indexes the observations, s the samples (1 and 2) and b the blocks (1, 2 and 3). In addition A is the range (fixed at 1), B is the slope (fixed at 1), D is the no-dose asymptote (fixed at 0.2) and \epsilon_i \sim \mathcal{N}(0,0.01) (which is very small). The parameters of interest are C_{s,b} (EC50), the response at 50% dose; and \sigma_{logCBlock}, the block to block variation for that parameter, described below:
C_{s,b} = logC_s + \zeta_b
where logC_s is the sample-specific EC50 and the \zeta_b is the block-specific “noise”:
logC_s \sim \mathcal{N}(\mu_{logC},\sigma_{logC}), \mu_{logC}=0 and \sigma_{logC}=1
\zeta_b \sim \mathcal{N}(0,\sigma_{LogCblock}),
with the hyperprior:
\sigma_{LogCblock} \sim \mathcal{N}(a,b), for this example, a=3 and b=2.
I am trying to fit this but get multiple divergent transitions unless the \sigma_{LogCblock} prior is concentrated around the true value of \sigma_{LogCblock}. For example, the simulated data for the reproducible example below uses \sigma_{LogCblock}=1.5, and the inference would result in the following pairs plot:
Below is a reproducible example:
modelCode = 'functions {
real fplBPosRange(real logX, real logBase, real A, real B, real logC, real D) {
return A/(1+logBase^(-B*(logX-logC)))+D;
}
}
data {
int<lower=1> N; // size of dataset
vector[N] logX; // dose value in logarithmic scale
real y[N]; // Response values
int<lower=1> nSamples; // number of samples
int<lower=1,upper=nSamples> sample[N]; // sample
int<lower=1> nBlocks; // number of blocks
int<lower=1,upper=nBlocks> cellBlock[N]; //block
real meanLogCLocationPrior;
real meanLogCScalePrior;
real blockSigmaLogCLocationPrior;
real blockSigmaLogCScalePrior;
}
parameters {
real meanLogC[nSamples];
real<lower=-blockSigmaLogCLocationPrior/blockSigmaLogCScalePrior> blockSigmaLogC_raw;
vector[nBlocks] blockNoiseLogC_raw;
}
transformed parameters {
vector[N] logC;
real blockSigmaLogC;
vector[nBlocks] blockNoiseLogC;
blockSigmaLogC = blockSigmaLogCLocationPrior + blockSigmaLogC_raw*blockSigmaLogCScalePrior;
for (b in 1:nBlocks) {
blockNoiseLogC[b] = blockNoiseLogC_raw[b]*blockSigmaLogC;
}
for (i in 1:N) {
logC[i] = meanLogC[sample[i]] + blockNoiseLogC[cellBlock[i]];
}
}
model {
vector[N] mu_y;
blockSigmaLogC_raw ~ normal(0,1);
blockNoiseLogC_raw ~ normal(0,1);
meanLogC ~ normal(meanLogCLocationPrior,meanLogCScalePrior);
for (i in 1:N) {
mu_y[i] = fplBPosRange(logX[i],
2,
1,
1,
logC[i],
0.2);
};
y ~ normal(mu_y,0.01);
}'
data <- data.frame(log2Dose=c(-13.83,-13.83,-13.83,-13.83,-13.83,-13.83,-11.31,-11.31,-11.31,-11.31,-11.31,-11.31,-8.8,-8.8,-8.8,-8.8,-8.8,-8.8,-6.29,-6.29,-6.29,-6.29,-6.29,-6.29,-3.77,-3.77,-3.77,-3.77,-3.77,-3.77,-1.26,-1.26,-1.26,-1.26,-1.26,-1.26,1.26,1.26,1.26,1.26,1.26,1.26,3.77,3.77,3.77,3.77,3.77,3.77,6.29,6.29,6.29,6.29,6.29,6.29,8.8,8.8,8.8,8.8,8.8,8.8,11.31,11.31,11.31,11.31,11.31,11.31,13.83,13.83,13.83,13.83,13.83,13.83),y=c(0.21,0.2,0.21,0.2,0.2,0.2,0.21,0.2,0.21,0.19,0.2,0.21,0.2,0.21,0.23,0.21,0.2,0.21,0.23,0.21,0.25,0.22,0.23,0.24,0.33,0.27,0.41,0.34,0.3,0.46,0.64,0.47,0.82,0.7,0.61,0.86,1.02,0.87,1.11,1.05,1,1.11,1.16,1.11,1.19,1.17,1.14,1.17,1.22,1.18,1.2,1.22,1.2,1.2,1.19,1.2,1.22,1.19,1.2,1.2,1.22,1.22,1.18,1.21,1.2,1.2,1.2,1.18,1.19,1.19,1.2,1.18),sample=c(1,1,2,1,2,2,1,1,2,1,2,2,1,1,2,1,2,2,1,1,2,1,2,2,1,1,2,1,2,2,1,1,2,1,2,2,1,1,2,1,2,2,1,1,2,1,2,2,1,1,2,1,2,2,1,1,2,1,2,2,1,1,2,1,2,2,1,1,2,1,2,2),cellBlock=c(1,2,1,3,2,3,1,2,1,3,2,3,1,2,1,3,2,3,1,2,1,3,2,3,1,2,1,3,2,3,1,2,1,3,2,3,1,2,1,3,2,3,1,2,1,3,2,3,1,2,1,3,2,3,1,2,1,3,2,3,1,2,1,3,2,3,1,2,1,3,2,3))
library(rstan)
seed <- 1
iterations <- 8000
stanFit <- rstan::stan(
model_code = modelCode,
data = list(N=nrow(data),
logX=data$log2Dose,
y=data$y,
nSamples=2,
sample=data$sample,
nBlocks=3,
cellBlock=data$cellBlock,
meanLogCLocationPrior=0,
meanLogCScalePrior=1,
blockSigmaLogCLocationPrior=3,
blockSigmaLogCScalePrior=2), # named list of data
chains = 4, # number of Markov chains
warmup = round(iterations/2), # number of warmup iterations per chain
iter = iterations, # total number of iterations per chain
cores = 4, # number of cores (could use one per chain)
refresh = 0, # no progress shown
seed = seed,
pars=c("meanLogC","blockSigmaLogC"),
save_warmup=FALSE,
control=list(adapt_delta=0.9)
)
I have two questions:
- The divergences are very likely due to the small number of blocks. Three blocks is perhaps too little in order to estimate the block to block variation \sigma_{LogCblock} (blockSigmaLogC in the pairs plot). The lp__, blockSigmaLogC square on the pairs plot shows several values for which the block to block variation is close to highest, which probably troubles the sampler. If I simulated data with more blocks this gets fixed. I think I either need to use more blocks or have a stronger prior. Would people agree with this or is there anything else that I can do? I just ask as in my experience in the lab, these dataset sizes are (perhaps unfortunately), typical so I trying to figure out strategies to deal with them.
- (Hopefully an easy question): As you can see the two sample parameters, meanLogC[1] and meanLogC[2] are correlated (LogC_1 and LogC_2 in the equation), and there can be sometimes 8 samples on each block, do you have any suggestions or examples about how to separate this correlation?
Thank you in advance for any suggestions.