Hello,
I have a question about how best to model truncated multilevel data. What I have is strong ground-motion data, recorded at stations with seismometers, and I want to model them as a function of variables like magnitude of the earthquake, distance to the station, and some other variables. What we typically use as one target variable is peak ground acceleration (PGA), which is lognormally distributed.
We have a multilevel structure (multiple recordings from one earthquake, multiple recrodings at one station), and typically model that with an event-specific constant and a station-specific constant.
So far, it is easy to build a simple model for the ground-motion data. One problem is that some stations only record data if they are triggered, i.e. if the observed PGA is larger than some threshold (sometimes this threshold changes over time for some stations, and these changes are not always reported, but we’ll ignore that for now), so the data is truncated.
A simplified model looks like this (in reality, the functional form will be different, and priors might be different). I could also make the truncation threshold a vector, to account for different trigger levels for different stations.
data {
int<lower=1> N; // overall number of records
int<lower=1> NEQ; // number of earthquakes
int<lower=1> NSTAT; // number of stations
real Ytrunc; // truncation threshold
vector[N] X; // moment magnitude
vector[N] Y; // log pga
int<lower=1,upper=NEQ> eq[N]; // earthquake id
int<lower=1,upper=NSTAT> stat[N]; // station id
}
parameters {
real<lower=0> sigma_rec;
real<lower=0> sigma_eq;
real<lower=0> sigma_stat;
real c1;
real c2;
vector[NEQ] eqterm;
vector[NSTAT] statterm;
}
model {
real mu_rec[N];
sigma_rec ~ cauchy(0,0.5);
sigma_eq ~ cauchy(0,0.5);
sigma_stat ~ cauchy(0,0.5);
c1 ~ normal(0,1);
c2 ~ normal(0,1);
eqterm ~ normal(0,sigma_eq);
statterm ~ normal(0,sigma_stat);
for(i in 1:N) {
mu_rec[i] = c1 + c2 * X[i] + eqterm[eq[i]] + statterm[stat[i]];
Y[i] ~ normal(mu_rec[i],sigma_rec) T[Ytrunc,];
}
}
The above model works with simulated data, and also gives reasonable results on real data. One question I have is about vectorization. Truncation does not allow vectorization, but I can rewrite the for loop as
mu_rec = c1 + c2 * X + eqterm[eq] + statterm[stat];
target += normal_lpdf(Y | mu_rec,sigma_rec)
- normal_lccdf(Ytrunc | mu_rec,sigma_rec);
(edited to account for comment by nhuurre)
which runs much faster, and gives me the same results. Is the only difference that in the vectorized model, no check is performed whether the truncation level is actually lower than the data? Sure, this can be a problem, but that just means that I need to be careful when preparing the data, or am I missing something?
My main question (sorry for the long introduction) is about the truncation, which in reality is not as fixed as it may seem. A seismometer typically measures ground motion in three directions (two horizontal, one vertical), so we have three PGA values. If any of those PGA values is larger than the trigger level, the seismogram is recorded on all three components. This means that one can have a PGA on one component that is smaller than the truncation threshold if one of the other components is larger. In the ground-motion world, we typically model something like the geometric mean of the two horizontal components, and this means that even with a fixed trigger level, there will be a few records where the target variable is below the trigger level, which creates a problem for a model with fixed truncation theshold. One way to deal with this is to discard these records, but that is not really a good idea.
A better way this can be modeled is with a random truncation level, which would be different for each record. If one assumes that the truncation level has a normal distribution with mean \mu_T and standard deviation \sigma_T, then the PDF for the data becomes
f_T(y) = \frac{\phi(\frac{y - \mu}{\sigma}) \Phi(\frac{x - \mu_T}{\sigma_T})}{\int \phi(\frac{x - \mu}{\sigma}) \Phi(\frac{x - \mu_T}{\sigma_T}) dx}
where \phi() is the PDF of a standard normal function, and \Phi is the CDF, and \mu is the same as mu_rec
in the Stan code.
The plot shows an example with simulated data, the histogram of the truncated data, and the PDF according to the above equation.
And this is where I am stuck, because I’m not sure how I can put this model best into Stan code. I tried calculating the integral using integrate_1d
, but that was really slow, and I aborted the run. I also tried to estimate a different truncation threshold for each record, but that seems like not a good solution, especially when it comes to larger data sets.
Does anyone have an idea how best to model this kind of data, or how to implement the randomly truncated model in Stan?