Normal distribution parameter estimation with uncertainty

fitting-issues

#1

Hi,

I have a problem with a model which is reasonably complicated. However, I’ve determined that the underlying problem arises in a very simple situation. Say I have data which I create simply by sampling from a normal with mean/sd of mu and sigma. Then I “measure” each variate with some (normal) uncertainty, say u, so that my “observed” sample is

X ~ N(mu,sigma) + N(0,u)

To generate the posterior distributions of mu and sigma, I use a simple hierarchical model as follows:

data {
    int<lower=0> N;     // Number of samples
    vector[N] x_meas;   // Measured values
    vector<lower=0>[N] uncertainty; // Uncertainty in values
}

parameters {
    real<lower=0,upper=8> mu;
    real<lower=0> sigma;
    vector[N] xtrue;
}

model {
    xtrue ~ normal(mu,sigma);
    x_meas ~ normal(xtrue,uncertainty);
}

This works perfectly fine if u is a single constant. However, let’s say u is dependent on the intrinsic value, so for instance in the simplest case, we could say

u = u_0 for x < mu,
u = u_1 for x > mu

I would expect the same model to work, since it incorporates individual values for the uncertainty. However, it doesn’t seem to. Using mu=4, sigma=2, u_0 = 2 and u_1=0.2 for a sample of 4500, I get the following:

The variety of red lines/regions represent the pdf of the estimated hyperparameters. Clearly, they gravitate strongly towards a Gaussianized version of the observed distribution, rather than the underlying one. Furthermore, the likelihood of the steps in the chain are higher than the likelihood gotten from re-inputing the actual underlying distribution.

So, is this problem fundamentally ill-posed, unsolvable, or am I just doing something stupid?


#2

Is this really unexpected? Looks right for the model to me. I would check by doing repeat simulations and checking calibration of the estimates.


#3

Could you clarify what things are in the plot? This might help.

Intrinsic is the distribution of some random variable that isn’t in the model.

Uncertain is generated as a function of intrinsic (and is the data ‘x_meas’ being passed into the model? Not uncertainty?).

The last 4 estimated things in the plot are based posterior samples of mu (the mean?)?

The posterior estimates of mu seem totally reasonable just kinda looking at it (looks like both of the normals mixed up in “Uncertain” have about a mean of 5~). If you want to generate a distribution that looks like “Uncertain” in the block, you’ll probably want to use some sorta generated quantities block for that.


#4

Thanks for the replies! Sorry, I will clarify a bit.

Let xint be the samples created by sampling from the normal distribution N(mu,sigma) (i.e. we create xint with the analog of xint ~ normal(mu,sigma) in python). There are 4500 of these.

Then x_meas is created by adding some uncertainty to xint, i.e. we create x_meas using something like x_meas = xint + normal(0,uncertainty).

The value of uncertainty is a vector with the same length as xint, where the value is 2 where xint is less than mu, and 0.2 otherwise.

Thus the model I posted above is the exact generative model (as far as I can tell). That is, we have hyperparameters xtrue, which should be estimates of xint, and we try fitting mu and sigma.

On the plot, the blue line is the pdf (Gaussian KDE) of xint. The green line is the pdf of x_meas. The red lines are the pdf of xtrue. The ~ML estimate line is the point in the chain with highest likelihood. The median is gotten by taking the pdf for every point in the chain and taking the median over a range of x. The ranges do the same thing, but of course just plot the quantiles of the pdfs.

I am not showing at all the estimates of mu and sigma, but it is clear from this plot that mu is estimated with a positive bias with respect to the true distribution.

I am not sure how this is expected – I expect the red lines to broadly match the blue line. I have done many repeat simulations with various parameters.

Many thanks.


#5

Just to be a bit more explicit, here is the python code I run to generate the samples and call PyStan:

    import numpy as np

    mu = 4
    sigma = 2

    xint = np.random.normal(loc=mu, scale=sigma, size=4500)
    sd = np.where(xint<mu,2.,0.2)
    
    x_meas = xint + np.random.normal(scale=sd)

    model = pystan.StanModel(model_code) # code was in first post

    fit = model.sampling({"N": len(xint),
                          "x_meas": x_meas,
                          "uncertainty": sd},
                           other_sampling_args,...)

Cheers.


#6

I don’t like using words like uncertainty for variables, but given that’s what you did, you want that uncertainty parameter to be a sclar to match your description.

Also, you problably don’t want those hard upper and lower bounds on mu rather than a simple prior—you’re not simulating the data that way, so you shoudln’t expect fitting it that way to give you the right answer. Write a proper generative model, generate the data from it, and then you should be able to recover. To the extent that the data generating process and the model don’t match, all bets are off.


#7

Hi Bob,

Thanks a lot for the reply.

It’s quite possible I misunderstand your terminology, but I don’t think uncertainty is a variable – it is data, known exactly (hypothetically). Also, it is not exactly scalar – each measurement that I draw has its own uncertainty – 2 if it is intrinsically below mu, and 0.2 otherwise. In principle, each measurement could have a completely unique uncertainty (and in my real use-case, this is what I have). The fundamental point that is giving me these incorrect results is that this uncertainty depends on the intrinsic value, xint.

I can certainly take away the hard bounds on mu, and use a weak prior. Here are the results of using a N(4,2) prior on mu:

It has pretty much the same result as before.

Sorry to be thick, but I don’t really understand why you say that I’m not simulating the data this way? The way I understand it is that I choose mu and sigma to be 4 and 2 respectively. This implies no knowledge of their prior. From there, the way I generate the data corresponds exactly to the generative model (as far as I can tell).

If I was instead to say that I didn’t choose mu and sigma but rather they were drawn from a normal or lognormal prior for instance, then I guess it would be consistent to use those priors in the generative model. But then I guess the appropriate test would be to run the simulation multiple times and see if it is consistently biased?

The problem is, this whole model works perfectly well if the uncertainty is not dependent on xint. That is, with exact same model as I posted above, but generating the uncertainty data for example using uncertainty ~ normal(0,2) (so each observation still gets unique uncertainty).

Thanks for your help, and I am sincerely sorry if I am missing something obvious.


#8

For terminology, we use the standard computer science terminology where the data variables are called variable. Sometimes data is used for observations of random variables and sometimes it’s used for constants and some times for unmodeled predictors (which can be thought of as random but aren’t really modeled as such because their model marginalizes out).

Yes to running simulation multiple times to see if it’s biased. See http://andrewgelman.com/2017/04/12/bayesian-posteriors-calibrated/ (I’m working on a case study). The post explains why you want to draw values from the generate process:

  • draw parameters theta from prior p(theta), then

  • draw observed data y from sampling distribution p(y | theta).

If there’s a different measurement error on each of your observations, that’s fine (I take it that’s what uncertainty is for). If the uncertainty depends on a predictor, you want to model that.

So the big takeway is that everything will work if you sample from your generative process and then fit with it. It won’t be guaranteed to fit every data set perfectly, but over data sets, you will have calibration.


#9

I think the answer here was in this statement:

I realised that the generative process includes the statement that uncertainty = 2 if xint < mu and 0.2 otherwise, which I was not including in the Stan model (but rather just providing the uncertainty as observed data).

Upon changing my model to reflect this:

 data {
    int<lower=0> N;        // Number of samples
    vector[N] x_meas;      // Measured values
    real<lower=0> sd_norm; // Uncertainty in values
}

parameters {
    real mu;
    real<lower=0> sigma;
    vector[N] xtrue;
}

transformed parameters {
    vector<lower=0>[N] uncertainty; // Individual uncertainties

    for (i in 1:N){
      if (xtrue[i] < mu)
        uncertainty[i] = sd_norm;
      else
        uncertainty[i] = sd_norm/10.;
    }
}

model {
    mu ~ normal(4,2);
    xtrue ~ normal(mu,sigma);
    x_meas ~ normal(xtrue,uncertainty);
}

(where I have been using sd_norm=2), I get the following pdf:

That is, the posterior sits right on top of the input.

I now have to be able to translate this into my real problem, which is a bit more complex, and perhaps I’ll ask another question about that later :)

Thanks for all the help!