Dose-response model with partial pooling

I was under the impression that I had already used truncated distributions; e.g. for c I have

vector<lower=0>[J] c;
real<lower=0> mu_c;
real<lower=0> sigma_c;
c ~ normal(mu_c, sigma_c);
mu_c ~ normal(0, 10);
sigma_c ~ cauchy(0, 2.5);

Since c has a lower bound of 0, normal(mu_c, sigma_c) should correspond to a truncated normal, no?

Do you think it might make more sense to have a Poisson prior on c and d, since they correspond to asymptotic counts and y ~ poisson(mu_y)? That would reduce the number of hyperparameters. Perhaps this is the issue here?

Following @arya’s suggestion, I changed model_code to implement non-centred parametrisation (as discussed in the Stan user guide).

model_code <- "
functions {
  real LL4(real x, real b, real c, real d, real ehat) {
    return c + (d - c) / (1 + exp(b * (log(x) - ehat)));
  }
}

data {
  int<lower=1> N;                           // Measurements
  int<lower=1> J;                           // Number of drugs
  int<lower=1,upper=J> drug[N];             // Drug of measurement
  vector[N] x;                              // Dose values
  int y[N];                                 // Response values for drug
}

// Transformed data
transformed data {
}

// Sampling space
parameters {
  vector<lower=0>[J] b_raw;            // slope
  vector<lower=0>[J] c_raw;            // lower asymptote
  vector<lower=0>[J] d_raw;            // upper asymptote
  vector[J] ehat_raw;                  // loge(IC50)

  // Hyperparameters
  real<lower=0> mu_c;
  real<lower=0> sigma_c;
  real<lower=0> mu_d;
  real<lower=0> sigma_d;
  real mu_ehat;
  real<lower=0> sigma_ehat;
  real<lower=0> mu_b;
  real<lower=0> sigma_b;
}

// Transform parameters before calculating the posterior
// Use non-centred parametrisation of b, c, d, ehat
transformed parameters {
  vector<lower=0>[J] b;
  vector<lower=0>[J] c;
  vector<lower=0>[J] d;
  vector[J] ehat;
  vector<lower=0>[J] e;

  b = mu_b + sigma_b * b_raw;
  c = mu_c + sigma_c * c_raw;
  d = mu_d + sigma_d * d_raw;
  ehat = mu_ehat + sigma_ehat * ehat_raw;

  e = exp(ehat);
}

// Calculate posterior
model {
  vector[N] mu_y;

  // Parameters
  b_raw ~ std_normal();
  c_raw ~ std_normal();
  d_raw ~ std_normal();
  ehat_raw ~ std_normal();

  // Priors for hyperparameters
  mu_c ~ normal(0, 10);
  sigma_c ~ cauchy(0, 2.5);
  mu_d ~ normal(max(y), 10);
  sigma_d ~ cauchy(0, 2.5);
  mu_ehat ~ normal(mean(log(x)), 10);
  sigma_ehat ~ cauchy(0, 2.5);
  mu_b ~ normal(1, 2);
  sigma_b ~ cauchy(0, 2.5);

  for (i in 1:N) {
    mu_y[i] = LL4(x[i], b[drug[i]], c[drug[i]], d[drug[i]], ehat[drug[i]]);
  }
  y ~ poisson(mu_y);

}
"

Run times are similar, as are estimates. I still end up with (what I think are) a significant amount of divergent transitions (somewhere between 100s to 1000s).

I’m not entirely sure I correctly implemented the non-centred parametrisation, in particular in light of the parameter bounds.

Blockquote

Usually for parameters that are constrained to be positive I would go for a log-normal, a chi-square, a gamma, or an inverse-gamma.

As for the truncated densities, you need to divide by the area under the curve to get the correct density. So if y is constrained to be positive then a truncated N(0,1) density should be the density of a N(0,1) which we’ll call N(y | 0,1), divided by the normalization term

\int_0^\infty N(y | 0,1) dy.

Since the support is smaller for a truncated normal, i.e. [0, \infty) instead of (-\infty, \infty) this normalization term is there to ensure the density integrates to one and is thus a proper density.

Ok, I will look into this. I’m not convinced that this will make a significant difference though. On that note I also realised that I cannot use Poisson priors on c and d as we cannot have integer arrays as parameters.

Yes, I agree with the mathematical details but I’m not sure this is relevant in Stan. From what I understand, setting a lower bound on the parameters ensures the boundedness of the density. For example, from Prior Choice Recommendations

[…] half-normal(0,10) (that’s implemented as normal(0,10) with a <lower=0> constraint in the declaration of the parameter) […]

No need to manually normalise densities.

An update

Log-normal densities for 0-bounded (count-like) parameters

Having lognormal densities for c and d significantly improved the runtime (thanks @arya for the suggestion).

The model now looks like this

functions {
  real LL4(real x, real b, real c, real d, real ehat) {
    return c + (d - c) / (1 + exp(b * (log(x) - ehat)));
  }
}

data {
  int<lower=1> N;                           // Measurements
  int<lower=1> J;                           // Number of drugs
  int<lower=1,upper=J> drug[N];             // Drug of measurement
  vector[N] x;                              // Dose values
  int y[N];                                 // Response values for drug
}

// Transformed data
transformed data {
}

// Sampling space
// We need to impose parameter constraints on the LL4 parameters:
// Since c,d ~ LogNormal(), we need <lower=0> for c,d
parameters {
  vector<lower=0>[J] b;                     // slope
  vector<lower=0>[J] c;                     // lower asymptote
  vector<lower=0>[J] d;                     // upper asymptote
  vector[J] ehat;                           // loge(IC50)

  // Hyperparameters
  real<lower=0> mu_b;
  real<lower=0> sigma_b;
  real mu_c;
  real<lower=0> sigma_c;
  real mu_d;
  real<lower=0> sigma_d;
  real mu_ehat;
  real<lower=0> sigma_ehat;
}

// Transform parameters before calculating the posterior
transformed parameters {
  vector<lower=0>[J] e;
  real log_sigma_b;
  real log_sigma_c;
  real log_sigma_d;
  real log_sigma_ehat;

  // IC50
  e = exp(ehat);
  
  log_sigma_b = log10(sigma_b);
  log_sigma_c = log10(sigma_c);
  log_sigma_d = log10(sigma_d);
  log_sigma_ehat = log10(sigma_ehat);
}

// Calculate posterior
model {
  // Declare mu_y in model to make it local (i.e. we don't want mu_y to show
  // up in the output)
  vector[N] mu_y;

  // Parameter priors
  c ~ lognormal(mu_c, sigma_c);
  d ~ lognormal(mu_d, sigma_d);
  ehat ~ normal(mu_ehat, sigma_ehat);
  b ~ normal(mu_b, sigma_b);

  // Priors for hyperparameters
  mu_c ~ normal(0, 5);
  sigma_c ~ cauchy(0, 2.5);
  mu_d ~ normal(max(y), 5);
  sigma_d ~ cauchy(0, 2.5);
  mu_ehat ~ normal(mean(log(x)), 5);
  sigma_ehat ~ cauchy(0, 2.5);
  mu_b ~ normal(1, 2);
  sigma_b ~ cauchy(0, 2.5);

  for (i in 1:N) {
    mu_y[i] = LL4(x[i], b[drug[i]], c[drug[i]], d[drug[i]], ehat[drug[i]]);
  }
  y ~ poisson(mu_y);
}

Correlation between parameters and their hyperparameters

There is still a potential issue with the correlation between the LL4 parameters and their hyperparameters. For example, pairs(fit, pars = c("b", "mu_b", "log_sigma_b")) gives

There is a clear correlation structure between mu_b and the log-transformed sigma_b that looks funnel-shaped.

Question: Is this something to worry about? And if so, what steps can I take to avoid this from happening.

Glad to hear the lognormal worked out. I wonder if it could’ve been the problem with normalizing the truncated normal.

Yeah that’s definitely a funnel but if you’re not getting divergences and otherwise good sampler diagnostics, you should be ok.

I notice that using the original pooled model with normal priors, sigma_d is estimated to be 985 but its prior is cauchy(0, 2.5). If you shift that prior to have more amplitude near the estimated value, does the performance improve?

Thanks @FJCC for pointing that out. In fact, I had a closer look at the d estimates (including hyperparameters), and here are the results

summary(fit, pars = c("d", "mu_d", "sigma_d"))$summary
#            mean    se_mean          sd     2.5%      25%      50%      75%
#d[1]    3489.865  0.5093660   31.741119 3427.630 3467.723 3490.117 3511.441
#d[2]    2127.745  1.1523709   46.468022 2038.463 2096.270 2127.392 2158.768
#d[3]    2576.638  0.7579097   38.674776 2501.041 2549.911 2575.692 2602.964
#mu_d    3474.843  0.0646205    4.871406 3465.109 3471.562 3474.840 3478.181
#sigma_d 3779.928 40.9239714 1960.299861 1791.430 2589.539 3245.744 4337.591
#           97.5%    n_eff      Rhat
#d[1]    3550.748 3883.154 0.9997746
#d[2]    2221.764 1626.012 1.0003123
#d[3]    2653.945 2603.878 1.0004396
#mu_d    3484.382 5682.874 0.9996138
#sigma_d 8859.847 2294.507 1.0022283

Note that sigma_d is as large as mu_d itself. I find that odd, considering that the measured max of the responses per group (which is what d should roughly correspond to) are

by(data_stan$y, data_stan$drug, FUN = max)
#data_stan$drug: 1
#[1] 3475
#------------------------------------------------------------
#data_stan$drug: 2
#[1] 2092
#------------------------------------------------------------
#data_stan$drug: 3
#[1] 2575

Ok, d[1], d[2] and d[3] agree with these numbers. But the mean and sd of these numbers are 2714 and 701, respectively. I’m very confused as to why the mu_d and sigma_d estimates are so large? That doesn’t seem to make sense to me. The whole point of pooling should have been to shrink estimates towards the overall mean.

First, a disclaimer - I am just a novice Stan user trying to learn by participating in this forum. Don’t take anything I say as authoritative.

I am unsure which model produced your last results but in any case, I think of mu_d as being estimated from only three values. d is the upper asymptote of a drug and you only have three of those. In any context, given the three numbers 2100, 2600, and 3500, you can’t say much about the distribution of the population that produced the values other than it is typically in the low thousands, kind of, probably. What really strikes me is the very tight estimate on mu_d coupled with the huge standard deviation on sigma_d. Is the prior on mu_d still normal(max(y), 5)? That is really tight and matches the posterior estimate. That might drive sigma_d to larger values so that d[2] down at 2100 is reasonably probable. That isn’t expensive in terms of the total probability because there is only one value up at 3500 that loses probability as the sigma increases. If you do have the tight prior on mu_d, try making it much broader and not pinned to the global maximum and see how the result changes. Maybe normal(mean(y), 750). Though, again, I am unsure which model you are using so that suggestion might not make sense.
Edit: I wrote the wrong thing with normal(mean(y), 750). What I meant by mean(y) there is mean(y_max), the mean of the three maximum y values, one for each drug.

I think you hit the nail on the head @FJCC! Good spot. The prior on mu_d was way too tight, which meant that the posterior density was entirely driven by the prior.

The updated model now has

mu_d ~ normal(log(max(y)), sqrt(log(max(y))));
sigma_d ~ cauchy(0, 2.5);

I chose log(max(y)) because the the median of the lognormal distribution is \exp{\mu}, and centring the prior on mu_d around (the estimate of) the median of the asymptotic d values should correspond to a weakly informative prior for mu_d.

I’ve put the full model (which has the lognormal densities for c and d as proposed by @arya and the revised prior on mu_d) in a public Gist.

You can run the model with

source("https://gist.githubusercontent.com/mevers/04a9de33cda14007d0f42de95f41ac1c/raw/fd2e56dc306907bc774304c67859e187e1b333ee/drc_partial_pooling_28Mar20.R")
summary(fit, pars = c("d", "mu_d", "sigma_d"))$summary
#               mean    se_mean         sd         2.5%          25%
#d[1]    3489.382726 0.53136354 31.4417773 3427.6113664 3468.3930208
#d[2]    2130.578929 1.09318741 46.8088715 2043.3053502 2098.9764028
#d[3]    2577.925938 0.79023112 39.5864489 2504.3899352 2550.4163144
#mu_d       7.890196 0.01387084  0.4580756    7.0108383    7.7181793
#sigma_d    0.652075 0.01697801  0.6281559    0.1514284    0.2881427
#                 50%          75%       97.5%    n_eff      Rhat
#d[1]    3488.4909798 3510.6422227 3552.552667 3501.311 1.0000432
#d[2]    2130.6564313 2160.3129676 2227.081541 1833.442 0.9994563
#d[3]    2576.3669121 2605.1142657 2658.267030 2509.486 1.0020853
#mu_d       7.8863173    8.0569836    8.895245 1090.609 1.0032951
#sigma_d    0.4406763    0.7782541    2.350519 1368.867 1.0014950

Much better! exp(7.89) = 2670 which is very close to the mean of 3475, 2092, 2575 (the max response per drug).


Some issues remain:

  1. Run times are still quite long (~6 hours) for >1000 drugs.

  2. There are still funnels in the pairs. For example,

    pairs(fit, pars = c("d", "d_mu", "log_d_sigma"))
    

    Similarly for b, c and ehat. Is this something to worry about?

Since mu_d is on a log scale and all of the d[x] parameters are not, I am not surprised that the pairs plot looks pinched in the mu_d direction. Maybe that is just naive on my part.

I meant the funnel in the mu_d vs. log_sigma_d plot. I’m not concerned about the features in the d[i] vs. mu_d pair plots (I don’t think they are examples of Neal’s funnels); as you said, they are due to the difference in scale (linear vs. log).

I’m unsure whether the presence of the mu_d vs. log_sigma_d funnel (which also show up in the mu_c vs. log_sigma_c, mu_b vs. log_sigma_b etc. plots) are any reason for concerns (e.g. regarding parameter estimates, sampling performance etc.).

Apparently sigmoid curves are tricky for identifiability reasons.

See these links:
https://www.martinmodrak.cz/2018/05/14/identifying-non-identifiability/

Struggling with all this myself!

2 Likes

I hesitate to say much about the quality of your model since I have only a little experience with Stan. There is a nice document describing a good Stan workflow available here. There is a substantial section of fit diagnostics. (I think the recommendation for Rhat has changed to < 1.01.) I like using the ShinyStan package to quickly graph of the diagnostics

Any insight is useful. I always use shinystan for diagnostics; it’s a great tool. Thanks also for the link to Michael Betancourt’s document.

Having read through the links that @jroon posted (thanks @jroon, they were very useful!), this seems to be more complex than I had naively thought.

To give some background: The idea is to use a hierarchical model to model dose response data. Data are based on high-content phenotypic screens and measure the number of cells that have survived following treatment with different drugs at different concentrations. This is a breadth-over-depth experiment, so we expect few dose measurements for a lot of drugs. Using partial pooling we should be able to improve the estimates of the model parameters for every drug, by shrinking them towards their overall mean which we learn from the full data.


The main issues seem to center around non-identifiability (or limited identifiability) of these kind of models, even in the presence of data across the full range of the sigmoid curve. This is obvious from the correlation between the model parameters b, c, d, and \hat{e}. Identifiability gets even worse if measurements are located towards the tail ends of the sigmoid function.

Still, I think partial pooling should help here. Even if we have limited data for some drugs, the distribution of the parameters b, c, d, and \hat{e} which we learn from all measurements should provide better estimates for those drugs as well.

Unfortunately, in its current form, the model doesn’t seem to do that: The model works fine for well-behaved & “good” data; as soon as we include data that doesn’t show a clear sigmoid trend, the sampler struggles and gives (lots of) divergent transitions.

Some of the things that were discussed in the links that @jroon posted include

  1. partially pooling some but not all parameters,
  2. using different parameterisations of the model.

On top of the issues I fixed earlier thanks to @arya and @FJCC, I’ve tried

  1. partially pooling d_i \sim N(\mu_d, \sigma_d) but having un-pooled c's,
  2. using non-centred parametrisation for all model parameters (to deal with the funnels in the hyperparameter correlation plots),
  3. imposing the constraint c_i < d_i on the parameters (this in combination with b > 0 is to ensure that the sigmoid response decreases with increasing concentrations).

None of these attempts have made the fit more stable. Since I’m no Stan expert, this all feels a bit like stabbing in the dark.

Just wanted to say that I was completely unable to fit sigmoids reliably without the reparametrization I discuss in the links @jroon already posted, so I would start there.

I would just add that as a part of the "Dissection of prostate tumour … " preprint we implemented a model with partial pooling (regularized a.k.a. Finnish horseshoe) over the “slope” parameter of the sigmoid + a variable amount of fixed effects, see here: https://github.com/stemangiola/TABI/blob/master/src/stan_files/DE_sigmoid.stan (the code is not the prettiest, but hope it can help you, also the Finnish horseshoe is a tricky beast and if I can justify it, I usually keep to “normal” partial pooling)

As a general advice, if a simpler model has problems, even minor ones (e.g. correlated parameters), it is IMHO usually better to try to resolve those before adding complexity in the model. My experience is that when I don’t invest time in getting the basic parts of the model just right, I end up “stabbing in the dark” a lot and still end up revisiting the simple model, but after wasting quite a lot of time.

Best of luck with your model and hope you’ll find my inputs useful.

8 posts were merged into an existing topic: Hierarchical Gompertz model

Thanks @martinmodrak and @jroon for the interesting comments and feedback. Martin, I had a look at the preprint, interesting paper; it seems we’re working in very similar areas.

I think (hope) that some of the fitting difficulties you’ve encountered may be avoided in our case, where most of the data are expected to cover the full range of the sigmoid response. To give a bit more context, we are measuring cell survival following treatment with ~4000 drugs at around 10 different concentrations in a high-content phenotypic screen. Concentrations were chosen such that the sigmoid response should be covered for a large fraction of drugs. Of course, there will be drugs with very high levels of toxicity or very low efficacies which would result in all measured responses being located in the tail ends of the sigmoid response. In fact, this was one of the motivations for using a hierarchical model with partial pooling: to learn the distribution of critical model parameters from all measurements to improve estimation on the level of individual drugs.

Thanks to your explorations, I started adopting a similar re-parametrisation of the four-parameter log-logistic model of the form

\text{LL4}'(x; b, c, \theta, \kappa) = c + \frac{(\theta-c) \exp{\left(-\kappa\right)}}{1 + \exp{\left(b \log{x} - \kappa\right)}}

I can see the advantage of this parametrisation in the asymptotic regimes

\begin{aligned} \lim_{x\to\infty}\textrm{LL4}'(x) &= c\\ \lim_{x\to0}\textrm{LL4}'(x) &= c + (\theta - c) \exp{\left(-\kappa\right)}\,. \end{aligned}

For example in the presence of (lots of) data at very low concentrations x\to 0, the model should inform on c, \theta and \kappa , whereas the initial parametrisation (see my initial post) would only help with estimating d.

Unfortunately I have been unsuccessful fitting the re-parametrised model to data. I don’t understand (yet) why this happens but will update once I’ve done some more testing.

2 Likes

Hi @jroon and @martinmodrak. It might be better to move these last discussions involving @jroon’s Gompertz model to a different post (or perhaps to the original post). The reason being that my original question deals with quite a different generative model (a four-parameter log-logistic model); I think keeping these two discussions separate will make it easier for future readers to identify relevant posts.

Perhaps an admin/mod can move these posts (@martinmodrak)?

Cheers,
Maurits

1 Like

Good point, I moved them to the Gompertz topic! Thanks for reminding me.

1 Like