Modeling triangular measurement error

Hi,

Ash trees loose canopy because larvae eats their tissue. One way to determine how dead tree is to measure the % of remaining canopy. However, measurement error is very small when canopy is 0 or 100%. It is highest when canopy is 50%. Does anybody have any suggestion how to implement such error model in stan? Intuitively measurement error is a+b*abs(canopy-50).

The related question is scaling of observed values - canopy. Physically it is in the interval [0,100]. Does it make sense to go with nonlinear scaling atanh(canopy/50-1) and then standardization by 2 sd to assure that predictions are physically feasible? Or there are more elegant approaches?

Thanks for any advice.

1 Like

So you have some error \epsilon_1 which you always incur regardless of canapy (x) percentage. I.e. the error at 0% or 100%. The maximum error is \epsilon_2 which occurs at 50%. Let x \sim Beta(\alpha,\beta). You could model error like this: N(\epsilon_1,\sigma_1) + P(x \mid \theta)N(\epsilon_2,\sigma_2) where \theta=(\alpha,\beta). You can then use the \alpha,\beta parameters to determine whether you want the error term to be symmetric, asymmetric, tight, diffused, etc.

Since canopy is in the interval (0,1) you can use the logit transform to map it to (-\infty,\infty) if you need to.

Thanks a lot. I am not very happy when predictions go outside physical allowable range so it would be great to use logistic regression to transform y. Can I still use the same beta distribution based error model? While it was valid for original scale of y will it be valid for transformed y? Or should I rather standardize y and live with negative/100+ predictions?

I think so yes . You got two function logit and inv_logit. Whatever shape the value is in you can convert it to the other shape.

Try write out your Stan model and I’ll help you work it out on the thread.

1 Like

Thanks for your help.

I am bit confused. If I use logit transformation and my data has 0 or 1 then after transformation it will become infinity. Looks like I may be missing something.

With respect to error model:

data{
  vector[N] y # canopy in [0, 100]
  real[N, Ncol] x # design matrix
}
transformed data {
  y_tr = y / 100;
}
parameters {
  real<lower=0> alpha;
  real<lower=0> beta;
  real ksi;
  vector[2] mu;
  vector<lower=0>[2] sigma;
  vector[2] err;
}
model {
  alpha ~ normal(0, 1);
  beta ~ normal(0, 10);
  ksi ~ beta(alpha, beta);
  mu ~ normal(0, 1);
  sigma ~ normal(0, 1);
  err ~ normal(mu, sigma);
  y_tr ~ normal(f(x), err[1] + ksi * err[2]); 
}

where f is the model for canopy

Did you mean this?

The logit issue comes up a lot and it’s typical to use values very close to 0 and 1 instead of zero and one. E.g. 0.00001 and 0.99999.

I think error typically has a zero mean, otherwise it’s bias rather than error. I think you could do something like this.

data{
    vector[N] y # canopy in [0, 100]
    real[N, Ncol] x # design matrix
}
transformed data {
    vector[N] y_tr = y / 100;
}
parameters {
    real<lower=0> d;
    real<lower=0> c;
    real<lower=0> a;
    real<lower=0> b;
}
model {
    for (n in 1:N)
        y_tr[n] ~ normal(f(x[n]), c + d*beta(y_tr[n] | a,b));
}

c is the minimal variance, d is the maximum variance. The beta function has support on (0,1), so here it serves to define how y_tr relates to variance. I.e. what proportion of the variance d to use given some % canopy value.

Please correct me if I’m wrong, but I think you’re trying to perform a linear regression with a heteroscedastic variance. You know that the error will be directly related to % canopy cover (min at 0% and 100%, max at 50%) but you’re not sure how. You’d like to create a linear model that takes this into account.

I suspect that at this point @maxbiostat may be able to pitch in and give you a more authoritative evaluation of prospects.

Thanks @emiruz for tagging me. @linas, is it possible you need a variant of beta regression? In that case, this and this could help you get started. Let’s iterate until we have something that works for you.

Good luck!

Thanks for help.

I really don’t know. The data I have is for 5 years. YT_year designates true thin (1-canopy, instead of percent I moved to 0,1). The current model is attached but predictions are quite poor. I show the median of posterior predictive for each tree vs actual observation. Since I don’t know the model I am using Bayesian NN. I have to use MPI since otherwise sampling is too slow. However, after I corrected error model I got the message

original model cliffMPI.stan (4.2 KB)
median of posterior predictive vs observed cliffMPI_cliffMPI_1_0_7_1_100_1predactual.pdf (35.6 KB)
new error model cliffMPIbeta.stan (4.5 KB)

Think I’ll need some more information. What’s the goal here? Do you want to predict how much canopy a tree has left? Do you have any covariates? I understand you want your error model to reflect the fact that measurements closer to 50 are more uncertain. It might just be that Beta regression will do a decent enough job so you don’t need to bother with anything else. Can you share a sample of data (can be simulated data analogous to the real data you want to analyse)?

Yes, the goal is to predict canopy after for untreated trees after X years of treatment with insecticide every few years. Only some percentage of trees are treated. However, the untreated trees in the vicinity benefit from treated trees. So I need a model for untreated trees. The covariates are tree height, year since the start of treatment, density of trees in nearby area, distance to the nearest tree, species (white, green, blue ash). Since I don’t know the true value of thin at the start of treatment I have it as parameter YT0 while observed value at the start of treatment is Y0. I am fine with any approach as long as predictions are good. Attached is the data file, thin is between 0 and 1.
cliffMPI.data.r (121.7 KB)

Please note that high tees are treated with powerful insecticide while small trees are treated with weak insecticide. So there are 3 densities - density of untreated trees within some radius from a given untreated trees, density of small trees in the same area, density of high trees in the same area. Also there are 3 nearest distances to the same category of trees. Since we have tree species they are one hot encoded as two binary variables. So total I have 10 covariates.

I found a similar problem in BETA REGRESSION FOR TIME SERIES ANALYSIS OF BOUNDED DATA, WITH APPLICATIONTO CANADA GOOGLE® FLU TRENDS. In my case I think I can stick to beta regression with AR(1).

You can use brms for this, I think. See this for how to specify a zero-one-inflated beta regression – which you’ll need because you have a bunch of zeroes and ones.

@emiruz, @paul.buerkner and myself can help you more after you’ve specified your model(s).

Edit: this may also be of interest.

1 Like

Thanks for the help.

I am struggling how to represent time series (longitudinal) data: some trees may have data for years 0-3 but not for years 4 and 5. As I read brms handles well generalized linear models but not sure how to incorporate time series (longitudinal) aspect.

I’m starting to grasp the problem. Can you show some “time series” plots so we can have an idea of what kind of behaviour is present in the data? Does the variable (canopy density?) increase over time as well as decrease? Does it oscillate? This will tell us whether an AR(1) model is appropriate. This is an interesting problem!

1 Like

Yes, canopy changes over the time. For some trees it decreases, for some increases, for some oscillates since the determination of canopy is done by visual observation. Here is data for 20 trees.
data

I think the following may work:


where mu is the mean of beta regression.

2 Likes

@linas did you get this to work? If not maybe start a new thread with a more specific question - this is one is getting long and it is hard for people to jump in.

I am still working on it. It seems beta regression might be an option - the question how to integrate beta regression with time series aspect. Another problem is how to represent ragged dataset.

I wonder if you can’t just concatenate all the data.frames and add an “id” column with the year of collect ion [that’s the ragging variable, right?].

1 Like