 # Multi-level truncated regression (with random truncation points)

Hello,

I have a question about how best to model truncated multilevel data. What I have is strong ground-motion data, recorded at stations with seismometers, and I want to model them as a function of variables like magnitude of the earthquake, distance to the station, and some other variables. What we typically use as one target variable is peak ground acceleration (PGA), which is lognormally distributed.
We have a multilevel structure (multiple recordings from one earthquake, multiple recrodings at one station), and typically model that with an event-specific constant and a station-specific constant.
So far, it is easy to build a simple model for the ground-motion data. One problem is that some stations only record data if they are triggered, i.e. if the observed PGA is larger than some threshold (sometimes this threshold changes over time for some stations, and these changes are not always reported, but we’ll ignore that for now), so the data is truncated.

A simplified model looks like this (in reality, the functional form will be different, and priors might be different). I could also make the truncation threshold a vector, to account for different trigger levels for different stations.


data {
int<lower=1> N;  // overall number of records
int<lower=1> NEQ;   // number of earthquakes
int<lower=1> NSTAT; // number of stations

real Ytrunc; // truncation threshold

vector[N] X;       // moment magnitude
vector[N] Y;       // log pga

int<lower=1,upper=NEQ> eq[N];       // earthquake id
int<lower=1,upper=NSTAT> stat[N];     // station id
}

parameters {
real<lower=0> sigma_rec;
real<lower=0> sigma_eq;
real<lower=0> sigma_stat;

real c1;
real c2;

vector[NEQ] eqterm;
vector[NSTAT] statterm;
}

model {
real mu_rec[N];

sigma_rec ~ cauchy(0,0.5);
sigma_eq ~ cauchy(0,0.5);
sigma_stat ~ cauchy(0,0.5);

c1 ~ normal(0,1);
c2 ~ normal(0,1);

eqterm ~ normal(0,sigma_eq);
statterm ~ normal(0,sigma_stat);

for(i in 1:N) {
mu_rec[i] = c1 + c2 * X[i] + eqterm[eq[i]] + statterm[stat[i]];
Y[i] ~ normal(mu_rec[i],sigma_rec) T[Ytrunc,];
}
}



The above model works with simulated data, and also gives reasonable results on real data. One question I have is about vectorization. Truncation does not allow vectorization, but I can rewrite the for loop as

  mu_rec = c1 + c2 * X + eqterm[eq] + statterm[stat];
target += normal_lpdf(Y | mu_rec,sigma_rec)
- normal_lccdf(Ytrunc | mu_rec,sigma_rec);


(edited to account for comment by nhuurre)
which runs much faster, and gives me the same results. Is the only difference that in the vectorized model, no check is performed whether the truncation level is actually lower than the data? Sure, this can be a problem, but that just means that I need to be careful when preparing the data, or am I missing something?

My main question (sorry for the long introduction) is about the truncation, which in reality is not as fixed as it may seem. A seismometer typically measures ground motion in three directions (two horizontal, one vertical), so we have three PGA values. If any of those PGA values is larger than the trigger level, the seismogram is recorded on all three components. This means that one can have a PGA on one component that is smaller than the truncation threshold if one of the other components is larger. In the ground-motion world, we typically model something like the geometric mean of the two horizontal components, and this means that even with a fixed trigger level, there will be a few records where the target variable is below the trigger level, which creates a problem for a model with fixed truncation theshold. One way to deal with this is to discard these records, but that is not really a good idea.
A better way this can be modeled is with a random truncation level, which would be different for each record. If one assumes that the truncation level has a normal distribution with mean \mu_T and standard deviation \sigma_T, then the PDF for the data becomes
f_T(y) = \frac{\phi(\frac{y - \mu}{\sigma}) \Phi(\frac{x - \mu_T}{\sigma_T})}{\int \phi(\frac{x - \mu}{\sigma}) \Phi(\frac{x - \mu_T}{\sigma_T}) dx}
where \phi() is the PDF of a standard normal function, and \Phi is the CDF, and \mu is the same as mu_rec in the Stan code.
The plot shows an example with simulated data, the histogram of the truncated data, and the PDF according to the above equation. And this is where I am stuck, because I’m not sure how I can put this model best into Stan code. I tried calculating the integral using integrate_1d, but that was really slow, and I aborted the run. I also tried to estimate a different truncation threshold for each record, but that seems like not a good solution, especially when it comes to larger data sets.

Does anyone have an idea how best to model this kind of data, or how to implement the randomly truncated model in Stan?

That vectorized truncation doesn’t seem right. Surely it should be

target += normal_lpdf(Y | mu_rec,sigma_rec)
- normal_lccdf(Ytrunc | mu_rec,sigma_rec);


It is normal_lccdf, not normal_lcdf, that gives the probability of observation, and it should be divided out, not multiplied.

Is your data truncated or censored? It sounds like you’d still know which stations were around to measure a particular earthquake, even if the quake wasn’t strong enough to trigger a recording. That’s useful information too, it tells you PGA was below the threshold.

You are right about the truncation, and the way you wrote it is actually the way I had implemented it. I had made an error in transcribing the model.

I had thought about censoring, because as you said an unobserved earthquake at a station provides some information about the maximum PGA. But as I see it, that does not help me with records where the observed PGA (on one component) is lower thatn the trigger level.

The general way to write a truncated distribution is

y ~ distribution(params);
target += -log(probability_of_detection);


The probability of at least one component being above the threshold is the complement of the probability that all are below the threshold.
With three components something like this should work

y[i,1] ~ normal(mu_rec[i,1], sigma_rec);
y[i,2] ~ normal(mu_rec[i,2], sigma_rec);
y[i,3] ~ normal(mu_rec[i,3], sigma_rec);
target += -log1m_exp(
normal_lcdf(Ytrunc| mu_rec[i,1], sigma_rec)
+ normal_lcdf(Ytrunc| mu_rec[i,2], sigma_rec)
+ normal_lcdf(Ytrunc| mu_rec[i,3], sigma_rec)
);

2 Likes

Thanks a lot for your helpe with this topic. What you suggest should work for PGA, but there are also other ground-motion parameters that one typically wants to model, and for those I think one needs a randomly truncate model. I didn’t go into detail in the original post because I thought it was long enough, and solving the problem once should be enough to make it work for those cases as well.
Triggering an instrument is based on ground acceleration, so for PGA we know the truncation threshold. However, if we want to model other things like peak ground displacement (PGA), peak ground velocity (PGV), or the Fourier amplitude spectrum at a particular frequency, or the response spectrum at a particular period (some of these might be important to know to assess the vulnerability of certain building types), then we don’t know the threshold. These ground-motion parameters are correlated with PGA, so there isn’t an exact value of the truncation threshold anymore. I can calculate the mean and standard deviation of the truncation level of e.g. PGV given the correlation with PGA, but then I am still left with a randomly truncated model.
At least that is what I think, but it might be that I thought too much about this particular model that I’m not seeing some other obvious way to model build the model.

I think the idea is still the same. Predict the probability of censoring conditional on the model parameters and then do target += -log1m(...).
If you have both PGA and PGV from a station and PGA is what selects which events are recorded then PGA truncation handles all selection effects; PGV does not need separate truncation.
Conceptually,

pga ~ normal(mu_pga, sigma);
target += -log1m(prob_pga_below_threshold(mu_pga));
pgv ~ normal(mu_pgv, sigma);


If PGA is not available, well, maybe predict it anyway (for the second line) and just omit the first line.