Estimating mean length-at-birth from an unbalanced sample of unborn and born lengths

For a species of animal, I have a dataset of some lengths of prenatal (unborn) individuals and some lengths of postnatal (born) individuals, and I would like to estimate the mean length-at-birth (and the variance among individuals, if possible).

It is much easier to get lengths from postnatal individuals, so the dataset is very unbalanced, with 31 prenatal lengths and >1000 postnatal lengths. Thus, I’m pretty sure that plain old logistic regression is giving me biased estimates of the average length-at-birth. I’ve played with weighting the cases but I have reason to believe that this is causing bias in the other direction.

I’ve come up with a generative model with some informative priors:

\begin{aligned} \ y_{i} &= \begin{cases} 1 & \text{if } x_i \geq \theta_i \\ 0 & \text{otherwise} \end{cases} \\ \theta_i &= \mu + \delta_i \\ \mu &\sim \mathrm{N}( 170,2 ) \\ \delta_i &\sim \mathrm{N}(0, \sigma) \\ \sigma &\sim \mathrm{N^+}( 10, 10) \\ \end{aligned}

where y_i takes the value 1 for postnatal and 0 for prenatal; x_i is the length of an individual when it was measured; \theta_i is the length at which an individual is born; \delta_i is the difference of an individual’s birth length from the mean birth length; \mu is the average age at birth; \sigma is the standard deviation of individual birth lengths.

This seems to be doing what I want it to do. Individuals grow and then are born at some length, and these lengths differ somewhat among individuals.

get_xy <- function() {
  sigma = abs(rnorm(1,10,10))
  delta = rnorm(1,0,sigma)
  mu = rnorm(1,170,2)
  theta = mu + delta
  length = 100:250
  born = length >= theta
  return(tibble(length,born))
}

plot(get_xy(), type = "l")
replicate(100, lines(get_xy()))

image

The data are structured as follows:

Rows: 1,306
Columns: 2
$ Y <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
$ X <dbl> 31.3, 42.5, 47.0, 69.0, 106.0, 168.0, 173.0, 176.0, 176.0, 165~

The trouble is I’m having trouble figuring out how to fit this model to data. Here’s what I have so far:

data{
    int n; // sample size
    int<lower=0,upper=1> Y[n]; // Born or unborn
    vector[n] X; // lengths
}
parameters{
    vector[n] delta ;
    real<lower=0> sigma;
    real mu;
}
model{
    real theta[n] ;  // length at birth (LAB)
    delta ~ normal( 0 , sigma ) ; // difference from mean 
    mu ~ normal( 170 , 2 ) ; mean LAB
    sigma ~ normal( 10 , 10 ) ; sd of LABs
    theta = mu + delta ;
    
    if ( Y[i] ) ... X[i] > theta[i] ... ; 
    
    }
}

Any assistance would be gratefully received. I’m a first-time poster, so I apologise if this is a bit rough.

Should the first ‘prenatal’ be ‘postnatal’? I guess it is probably easier to get measurements after they are born. Some other questions:

  • Do you have any information on the time (relative to birth?) at which these measurements are taken? It seems a little hard to estimate the length at birth when you don’t know if the postnatal observations were taken at 1 hour or 2 years after birth. In fact I’m not sure it is possible?
  • Logistic regression would have y_{i} as a probabilistic function of x_i, not a deterministic one (the probability of being born would increase as the measured length increases). I can’t see the likelihood contribution that x_{i} and y_{i} make at the moment:
    model {
      .
      .
      .
      for (ii in 1 : n) {
        if (X[i] >= theta[i]) {
            // not obvious what goes here?
          } else {
            // nor here?
          }
      }
    }
    
    This might also introduce a discontinuous gradient with respect to \delta_{i} and \mu? It would depend what the likelihood is.
  • If you had p_{i} = \mu + \delta_{i} + x_{i}\theta with the likelihood being (p_{i})^{y_{i}}(1 - p_{i})^{1 - y_{i}} you could then ask the inverse regression question: ‘what is the value of x_{i} that gives me a p_{i} (chance of being born) of 0.5?’ This might be your best guess at ‘how long was this individual when they were born’?

Hopefully some of these questions are helpful? I’m not sure I see any way of answering the question without some form of information about the time of measurement. Even if all the pre/post-natal observations are taken at the same time, it seems like you need to know when that is to estimate length at birth?

Thank you for your response, @hhau

Yes, sorry for that. I’ve changed this in the original post.

Yes I was wondering that too, but I don’t think we have this sort of information.

Yes, I think this is the crux of the issue: I haven’t expressed a likelihood for either X_i or Y_i. In my generative model, the binary Y_i simply tells us whether \theta_i is less than or greater than X_i. This post seems to get at this idea.

What I have described may boil down to a probit regression model. My main concern at this stage is with finding a way to balance the influence of the two samples. For any given length, the born cases are overrepresented and the unborn cases are underrepresented, so I don’t feel comfortable estimating a probability on sheer relative numbers of cases. It causes a pretty clear downward bias in the estimated mean length-at-birth.

The traditional, non-Bayesian approach to addressing imbalance is to weight the cases by the inverse sample sizes. I tried this and it seemed to cause bias in the opposite direction. When I square-rooted the weights, it seemed about right but this seemed a little arbitrary.

Some other discussions on this here and here.

Perhaps what I really want is to find is where the upper tail of the density of unborn lengths crosses the lower tail of the density of born lengths?