Physically-based model: How to parameterize program such that one variable must be larger than another


#1

Hi all,
I’m trying to fit a non-linear model used in soil science, relating soil pressure to water content - the van Genuchten model. I’m getting a large number of divergent transitions (70 - 90%), however. I’ve been reading the Stan manual and poking around online to try and figure out why this might be, but haven’t been successful, yet. I’m starting to think one of my (many?) problems is the way I’ve parameterized the model. One specific thing I’m looking into is that one of my parameters must be greater than another (their subtraction cannot be negative). Specifically, s > r in the model below.

Making-up data in R to be fit in Stan:

library(tidyverse)
library(rstan)
#Creating the non-linear soil water retention curve function: van Genuchten (1980)
#psi: pressure (kPa)
#s: porosity (unitless)
#r: specific retention (unitless)
#a: capillarity (kPa^-1)
#n: pore size distribution index (unitless)
#theta: volumetric soil moisture content (unitless)
vanGen <- function(psi, s, r, a, n) {
  top <- s - r
  bot1 <- 1 + ((a * abs(psi)) ^ n)
  bot2 <- bot1 ^ (1 - (1 / n))
  theta <- r + (top / bot2)
  theta
}
#Generate pressure data that spans orders of magnitude.
psi <- c(1:9 %o% 10^(-2:7)) #kPa.
#Sample the result to make it a but less regular and thus more realistic.
psi <- sample(psi, round(length(psi) * 0.5))
s <- 0.8
r <- 0.2
a <- 1/5 #5 kPa is approximately 50 cm.
n <- 1.56
#Run the deterministic model.
theta <- vanGen(psi, s, r, a, n)
#Add noise to the response.
theta <- rnorm(length(theta), theta, 0.015)
#Visually inspect that the relationship looks right.
qplot(psi, theta) + scale_x_log10()
#Location of stan file on my computer (see stan code below).
fileLoc <- "ExampleModels/NonLinearRegression/vanGenuchten.stan"
#Data to be used in Stan.
data <- list(N = length(theta), theta = theta, psi = psi)
#Stan call.
vgBayes <- stan(
  file = fileLoc,
  data = data,
  chains = 4,
  iter = 1000,
  warmup = 500,
  algorithm = "NUTS",
  control = list(adapt_delta = 0.99)
)

Below is the Stan function I made:

functions {
//Below is the same van Genuchten model made in R.
  real vg(real psi, real s, real r, real a, real n) {
    real top;
    real bot1;
    real bot2;
    real out;
    
    top = s - r; // I think this might be the problem spot, as s must always be greater than r to be physically
// meaningful.
    bot1 = 1 + ((a * fabs(psi)) ^ n);
    bot2 = bot1 ^ (1 - (1 / n));
    out = r + (top / bot2);
    return out;
  }
  
//A root mean squared error function that I made up.
  real RMSEFunc(int N, real[] yPred, real[] yObs) {
    real L;
    real SUM[N];
    real out;
    
    L = N * 1.0; //Wasn't sure the best practice for converting int to real, but this worked.
    for(i in 1:N) {
      SUM[i] = (yPred[i] - yObs[i]) ^ 2;
    }
    out = sqrt(sum(SUM) / L);
    return out;
  }
}
data {
  int N;
  real <lower = 0, upper = 1> theta[N];
  real psi[N];
}
parameters {
  real <lower = 0, upper = 1> s; //Should be physically bounded between r and 1.
  real <lower = 0, upper = 1> r; //Should be physically bounded between 0 and s.
  real <lower = 0> a; //Must be above 0.
  real <lower = 1> n; //Must be above 1.
  real <lower = 0> sigma;
}
transformed parameters{
  real theta_pred[N];
  for(i in 1:N){
    theta_pred[i] = vg(psi[i], s, r, a, n);
  }
}
model {
  s ~ beta(5, 2) T[0, 1]; // Seems like a reasonable distribution for both s and r.
  r ~ beta(2, 5) T[0, 1]; 
  a ~ double_exponential(0, 0.1) T[0, ];
  n ~ double_exponential(1, 0.1) T[1, ];
  sigma ~ cauchy(0, 1) T[0, ]; // As I understand it, this is effectively a half-Cauchy, which is perfect for me.
  for(i in 1:N){
    theta[i] ~ normal(theta_pred[i], sigma) T[0, 1];
    //print(theta[i]);
  }
}
generated quantities {
  real RMSE;
  vector[N] log_lik;
  
  RMSE = RMSEFunc(N, theta_pred, theta);
  
  for(i in 1:N) {
    log_lik[i] = normal_lpdf(theta[i] | theta_pred[i], sigma);
  }
}

What I’d like to be able to do is have a condition somewhere such that s is always greater than r. I tried adding that condition to the parameters block, like:

parameters {
real <lower = r, upper = 1> s;
real <lower = 0, upper = s> r;
...
}

But this generated an error, as r had not yet been defined. I also don’t think I can have a dynamic boundary condition, like that.

If anyone has any ideas as to how to parameterize this problem, or a way to write the function that incorporates this condition, or sees other issues that I could improve upon, I’d be very grateful.

On a side note, even though I have a massive amount of divergence, the outcome of the above model is actually really good. I’m just worried because I’ve read that divergence could imply bias in the posteriors, and I want to make sure that 1) my results are as accurate as possible, and 2) that I haven’t done something extremely naive.

Cheers!


#2

Why is it

parameters {
  real<lower=r, upper = 1> s;
  real<lower=0, upper = 1> r;
}

which is circular. Should it instead be the legal syntax

parameters {
  real<lower=0, upper = 1> s;
  real<lower=0, upper = s> r;
}

?
?


#3

Your second block is how I’ve written it in the program (see the main Stan block in my post) that is currently working (though the deviance is quite high). I actually tried to write the *first* block, but as you point out, it’s ill formed (and not defined). I merely added the first block of code as a conceptual example of the kind of boundaries that I need to enforce.

The crux, is that: 1) I have really high deviances that I’d like to fix, and 2) I don’t know how to reparameterize the program, so I’m starting with the simplest thing I can think of, but even that has got me a bit stumped.

Thanks for looking over my post.

EDIT: Changed *second* to *first*, because that’s what I meant.


#4

Also, I should note that I am currently running a version of this model using the dense mass matrix, to better understand if that helps. I did run a version of that control (i.e., metric = “dense_e”) earlier, and found that it did get rid of the deviance problem, but I didn’t set my tree depth high enough, and it took forever to run, so I’ve been hesitant do it again without fixing some simpler issues. In short, I’ll report back what I find (maybe in another thread, though).


#5

Why not just declare them as an ordered vector.

ordered[2] rs;

rs[1] is r; rs[2] is s.


#6

That’s cheating.

r = inv_logit(r_raw);
s = (1-r) * inv_logit(s_raw) + r;

or

cumsum of a simplex[2]


#7

Ben’s second block is fine. And the way we recommend doing this if the cases are easily enumerated. Otherwise, generalizing @andre.pfeuffer’s suggestion, you can also use the the first K-1 dimensions of a cumulative sum of a simplex.

The original post seemed to imply a between 0 and 1 constraint, as well.


#8

Right; I missed that part. Still, could you do ordered + a softmax?


#9

You could do ordered plus softmax, plus cumulative sum and throw away the last element. But that’s not identified as softmax reduces a degree of freedom.

If you want a parameter vector \alpha constrained so that 0 \leq \alpha_1 \leq \alpha_2 \leq \cdots \leq \alpha_N \leq 1, the easiest way to do this as follows.

parameters {
  simplex[N + 1] alpha_simplex;
  ...
}  
transformed parameters {
  positive_ordered[N] alpha = cumulative_sum(alpha_simplex)[1:N];
  ...

alpha_simplex is uniform over simplexes, so alpha is uniform over sequences satisfying the constraint 0 <= alpha[1] <= alpha[2] <= ... <= alpha[N].


#10

UPDATE: Running the model with a dense mass matrix did not fix the problem. In fact, it generated some really weird posteriors. So it would seem that the diagonal matrix is still the best option.


#11

I totally forgot about ordered vectors! One problem, however, is that I cannot constrain these values to be within 1 and 0 - at least not when I define them. I’ve been running the model using this parameterization and it’s unveiled that if my my sigma in:

for(i in 1:N){
    theta[i] ~ normal(theta_pred[i], sigma) T[0, 1];
  }

drops below ~0.024 then I still get a bunch of divergent results. The same happens if I add a hyperparameter to estimate sigma (e.g., sigma ~ inv_chi_square(sigma_nu)). But if I enforce a lower sigma value above 0.024 (i.e., <lower = 0.024>, T[0.024, ]) then I have no more divergent behavior. This doesn’t seem optimal, because it suggests to me that something more nefarious is going on after warm-up, again implying that i should try some of your suggestions, to ensure my posteriors aren’t biased.

Thanks a bunch everyone! I really appreciate your comments.


#12

I’m not understanding this solution. What do r_raw and s_raw represent? Are those hyper parameters? If that’s the case, then I tried this implementation:

parameters {
  real r_raw;
  real s_raw;
  real <lower = 0.0> a;
  real <lower = 1.0> n;
  real <lower = 0.0> sigma_nu;
  real <lower = 0.0> sigma;
}
transformed parameters{
  real theta_pred[N];
  real <lower = 0, upper = 1> s;
  real <lower = 0, upper = 1> r;
  
  r = inv_logit(r_raw);
  s = (1-r) * inv_logit(s_raw) + r;
  for(i in 1:N){
    theta_pred[i] = vg(psi[i], s, r, a, n);
  }
}
model {
  s_raw ~ beta(2, 5) T[0.0, 1.0]; //Specific retention
  r_raw ~ beta(5, 2) T[0.0, 1.0]; //Porosity
  a ~ double_exponential(0.0, 0.1) T[0.0, ];
  n ~ double_exponential(1.0, 0.1) T[1.0, ];
  sigma ~ inv_chi_square(sigma_nu) T[0.0, ];
  for(i in 1:N){
    theta[i] ~ normal(theta_pred[i], sigma) T[0.0, 1.0];
  }
}

I get no transition errors, but now the parameters don’t make much sense now. Am I missing something?

Cheers!


#13

Sorry, last thing. Just to be clear, it’s not that r and s must add up to 1, it is that r and s are between 0 and 1 and r is less than s. Not sure if that was clear. More precisely:
0 <= r <= 1
0 <= s <= 1
r < s


#14

Then you can just declare as

parameters {
  real<lower=0, upper = 1> s;
  real<lower=0, upper = s> r;
}

#15

I didn’t think you could do dynamic boundaries like that! But re-examining pg. 38 of the manual (“Expressions as Bounds”) I can now see that I was wrong. Thanks a bunch for the tip!

After re-running, I’m still getting divergence about 80% of the time. This makes me think that the real problem is, in fact, the sigma value in the call, or at least how I’ve parameterized it:

  for(i in 1:N){
    theta[i] ~ normal(theta_pred[i], sigma) T[0.0, 1.0];
  }

Here’s what I’m dealing with at the moment:

The red triangles are the points of divergence (alpha = 0.3). Hopefully you can see that lower sigma values are associated with divergence. I’ve been trying all kinds of different distributions on sigma, and using hyperparameters in the hopes that NUTS would not longer explore the deviant parameter space, but I can’t get it to move. It doesn’t feel right to impose a lower boundary on sigma, other than 0, but I’m not really sure how else to deal with this, since I’d expect NUTS to move the posterior out of range of these divergent values.

Thoughts?


#16
functions {
//Below is the same van Genuchten model made in R.
  real vg(real psi, real s, real r, real a, real n) {
    real top;
    real bot1;
    real bot2;
    real out;

    top = s - r; // I think this might be the problem spot, as s must always be greater than r to be physically
// meaningful.
    bot1 = 1 + ((a * fabs(psi)) ^ n);
    bot2 = bot1 ^ (1 - (1 / n));
    out = r + (top / bot2);
    return out;
  }

//A root mean squared error function that I made up.
  real RMSEFunc(int N, vector yPred, vector yObs) {
    real L;
    real SUM[N];
    real out;

    L = N * 1.0; //Wasn't sure the best practice for converting int to real, but this worked.
    for(i in 1:N) {
      SUM[i] = (yPred[i] - yObs[i]) ^ 2;
    }
    out = sqrt(sum(SUM) / L);
    return out;
  }
}

data {
  int N;
  vector <lower = 0, upper = 1>[N] theta;
  real psi[N];
}

transformed data {
    vector[N] theta_logit = logit(theta);
}

parameters {
  real<lower=0, upper = 1> s;
  real<lower=0, upper = s> r;
  real <lower = 0> a; //Must be above 0.
  real <lower = 1> n; //Must be above 1.
  real <lower = 0> sigma;
}

transformed parameters {
  vector[N] theta_pred;
  for(i in 1:N){
    theta_pred[i] = vg(psi[i], s, r, a, n);
  }
}

model {
  a ~ normal(0, 0.1);
  n ~ normal(1, 1);
  sigma ~ normal(0, 1);
  theta_logit ~ normal(logit(theta_pred), sigma);
}

generated quantities {
  real RMSE;
  // TBD
  //vector[N] log_lik;

  RMSE = RMSEFunc(N, theta_pred, theta);
/*
  for(i in 1:N) {
    log_lik[i] = normal_lpdf(theta[i] | theta_pred[i], sigma);
  }*/
}

vanGenuchten.stan (1.5 KB)
van.R (1.3 KB)
Compile.R (148 Bytes)

SAMPLING FOR MODEL ‘vanGenuchten’ NOW (CHAIN 1).

extr<-extract(vgBayes)
mean(extr$RMSE)
[1] 0.01240366
rstan:::rstudio_stanc(“vanGenuchten.stan”)
vanGenuchten.stan is syntactically correct.
mean(extr$a)
[1] 0.206548
mean(extr$n)
[1] 1.521859
mean(extr$sigma)
[1] 0.07183029
mean(extr$s)
[1] 0.8011613
mean(extr$r)
[1] 0.1967349

Modified your program slightly. It runs with 0 divergence and can reproduce the parameters.


#17

Thanks for fixiing this. Was the main change constraints?

You can declare-define now, which I think makes programs much more readable by bringing the definition next to the declaration:

real top = s - r;

The use of fabs(psi) is usually bad news if psi is a parameter because it introduces multimodality between psi and -psi. Here, it looks like data.


#18

I changed this to work on the logit-scale. theta_pred and theta are on [0;1] scale.
The assumption theta - theta_pred ~ N(0, sigma)T[0, 1] doesn’t hold.

I further went to change the real[]'s to vector’s.

becomes

For parameters s, r I sticked with Ben’s suggestion. The beta distributions are not needed.
Parameters a, n I call the desperate voodoo part in the original code.
It becomes this:

sigma is small 0.07, so I not decided to go for half-cauchy(0, 1); but half-normal.
Its a classic physically-based model, thus I expect these parameters well-behaved.

As you saw already, I left some fine tuning to the author of the original code.


#19

Thank you! This solution is accurate and wicked fast!

I’m still trying to unpack what you’ve done. Apologies if this is getting cumbersome. I’m just very thirsty to learn and this thread has become a fountain of knowledge.

Logit transformation

The logit transform is kind of blowing my mind, but it seems to be the most important element of removing the divergence I was observing, so I’m just putting down some thoughts.

I’ve not tried working in transformed response space before (I’ve only ever worked with transformed independent variables), so this is a novel realm for me and I don’t entirely understand it, yet. Looking at a graphical comparison of theta versus logit(theta) below, it appears that the relationship is essentially linear, given the domain of theta:

LogitTransformOfTheta

In an effort to understand the utility of this transformation I replaced some of the code with a simple linear transform that multiplied theta by 10, effectively expanding the response space.

Became:

transformed data {
  vector[N] theta_logit = 10 * theta;
}

And

Became

model {
…
theta_logit ~ normal(10 * theta_pred, sigma);
}

Using iter = 10000, warmup = 5000, thin = 2, and chains = 4, I got the following results:

10x transform:

extract(vgBayes, pars = c("s", "r", "a", "n", "sigma", "RMSE")) %>%
purrr::map_df(mean)
      s     r     a     n sigma   RMSE
  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1 0.802 0.200 0.210  1.54 0.175 0.0174

Logit transform:

      s     r     a     n sigma   RMSE
  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1 0.802 0.201 0.203  1.56 0.100 0.0176

It appears that the logit transform is more accurate than the 10x transform, may be because it varies across zero?

Does anyone have any thoughts on why this is? Or perhaps a favorite book, website, YouTube series, or article that you’d recommend, but is still comprehensible for someone that has only taken up to linear algebra and integral calculus?

Parameters and desperate voodoo

Uninformative priors are okay?

That’s really interesting that you can get away with not explicitly estimating the r and s parameters. From the manual, this suggests an implicit uniform prior of 0 and 2, which I guess makes sense, since it is in the ball park of the expected result. I’m guessing that’s not a problem, because, as you say, the function is relative well behaved (now).

I played around with removing a and n from the model block, and I got slightly worse results, but not by much, suggesting a moderately informative prior is still useful – probably because a and n are highly correlated, so adding priors helps uncorrelated them a bit.

No need to truncate?

I also find it interesting that you can get away with no truncation. I’m guessing this is because the truncation would only come into play if the posterior creeped into the edge of the truncation values?

In any case, I really appreciate the effort you put into this @andre.pfeuffer!


#20

The following contains some informal thoughts… just to get the picture, not a 100% thorough math
master thesis.

We are not talking about theta, or theta_logit, because this is handled by your model,
which I assume is correct. We are talking about the errors and its distribution.

If we write something, that means

(theta_logit - 10 * theta_pred) / sigma + err = 0

with err ~ normal(0, 1);

The normalized difference is unit-normal distributed.
after fitting must check our errors, if they have an bias. And if they are follow a normal
distribution. I did not do. I corrected which I thought are the most crucial problems.

Actually we want to minimize the MSE, which is BLUE (best linear unbiased estimator).
So we have to check our errors, that they are not trending (zero mean), and furthermore we
the variance is same all over every err.

With the logit transform I just moved the parameter to the unconstrained space. I’m not sure if
the logit is the best transform for that. I did it because the parameter is in [0,1] range.
Suppose you would calculate the % deviation and you would real parameter is 0.7 and your model says
0.8. Then you could do something like:

0.8/0.7-1
[1] 0.1428571

but this is not symmetric:

0.7/0.8-1
[1] -0.125

but if you take the log odds your measure is symmetric:

log(0.7/0.8)
[1] -0.1335314
log(0.8/0.7)
[1] 0.1335314

There are more possible transforms:
https://en.wikipedia.org/wiki/Power_transform

You could try eg. if instead of the logit transform the pure log transform is a better choice.