# Normal distribution parameter estimation with uncertainty

Hi,

I have a problem with a model which is reasonably complicated. However, I’ve determined that the underlying problem arises in a very simple situation. Say I have data which I create simply by sampling from a normal with mean/sd of mu and sigma. Then I “measure” each variate with some (normal) uncertainty, say u, so that my “observed” sample is

X ~ N(mu,sigma) + N(0,u)

To generate the posterior distributions of mu and sigma, I use a simple hierarchical model as follows:

``````data {
int<lower=0> N;     // Number of samples
vector[N] x_meas;   // Measured values
vector<lower=0>[N] uncertainty; // Uncertainty in values
}

parameters {
real<lower=0,upper=8> mu;
real<lower=0> sigma;
vector[N] xtrue;
}

model {
xtrue ~ normal(mu,sigma);
x_meas ~ normal(xtrue,uncertainty);
}
``````

This works perfectly fine if u is a single constant. However, let’s say u is dependent on the intrinsic value, so for instance in the simplest case, we could say

u = u_0 for x < mu,
u = u_1 for x > mu

I would expect the same model to work, since it incorporates individual values for the uncertainty. However, it doesn’t seem to. Using mu=4, sigma=2, u_0 = 2 and u_1=0.2 for a sample of 4500, I get the following: The variety of red lines/regions represent the pdf of the estimated hyperparameters. Clearly, they gravitate strongly towards a Gaussianized version of the observed distribution, rather than the underlying one. Furthermore, the likelihood of the steps in the chain are higher than the likelihood gotten from re-inputing the actual underlying distribution.

So, is this problem fundamentally ill-posed, unsolvable, or am I just doing something stupid?

Is this really unexpected? Looks right for the model to me. I would check by doing repeat simulations and checking calibration of the estimates.

Could you clarify what things are in the plot? This might help.

Intrinsic is the distribution of some random variable that isn’t in the model.

Uncertain is generated as a function of intrinsic (and is the data ‘x_meas’ being passed into the model? Not uncertainty?).

The last 4 estimated things in the plot are based posterior samples of mu (the mean?)?

The posterior estimates of mu seem totally reasonable just kinda looking at it (looks like both of the normals mixed up in “Uncertain” have about a mean of 5~). If you want to generate a distribution that looks like “Uncertain” in the block, you’ll probably want to use some sorta generated quantities block for that.

Thanks for the replies! Sorry, I will clarify a bit.

Let `xint` be the samples created by sampling from the normal distribution N(mu,sigma) (i.e. we create `xint` with the analog of `xint ~ normal(mu,sigma)` in python). There are 4500 of these.

Then `x_meas` is created by adding some uncertainty to `xint`, i.e. we create `x_meas` using something like `x_meas = xint + normal(0,uncertainty)`.

The value of `uncertainty` is a vector with the same length as `xint`, where the value is 2 where `xint` is less than `mu`, and 0.2 otherwise.

Thus the model I posted above is the exact generative model (as far as I can tell). That is, we have hyperparameters `xtrue`, which should be estimates of `xint`, and we try fitting `mu` and `sigma`.

On the plot, the blue line is the pdf (Gaussian KDE) of `xint`. The green line is the pdf of `x_meas`. The red lines are the pdf of `xtrue`. The ~ML estimate line is the point in the chain with highest likelihood. The median is gotten by taking the pdf for every point in the chain and taking the median over a range of x. The ranges do the same thing, but of course just plot the quantiles of the pdfs.

I am not showing at all the estimates of `mu` and `sigma`, but it is clear from this plot that `mu` is estimated with a positive bias with respect to the true distribution.

I am not sure how this is expected – I expect the red lines to broadly match the blue line. I have done many repeat simulations with various parameters.

Many thanks.

Just to be a bit more explicit, here is the python code I run to generate the samples and call PyStan:

``````    import numpy as np

mu = 4
sigma = 2

xint = np.random.normal(loc=mu, scale=sigma, size=4500)
sd = np.where(xint<mu,2.,0.2)

x_meas = xint + np.random.normal(scale=sd)

model = pystan.StanModel(model_code) # code was in first post

fit = model.sampling({"N": len(xint),
"x_meas": x_meas,
"uncertainty": sd},
other_sampling_args,...)
``````

Cheers.

I don’t like using words like `uncertainty` for variables, but given that’s what you did, you want that `uncertainty` parameter to be a sclar to match your description.

Also, you problably don’t want those hard upper and lower bounds on `mu` rather than a simple prior—you’re not simulating the data that way, so you shoudln’t expect fitting it that way to give you the right answer. Write a proper generative model, generate the data from it, and then you should be able to recover. To the extent that the data generating process and the model don’t match, all bets are off.

Hi Bob,

Thanks a lot for the reply.

It’s quite possible I misunderstand your terminology, but I don’t think `uncertainty` is a variable – it is data, known exactly (hypothetically). Also, it is not exactly scalar – each measurement that I draw has its own uncertainty – 2 if it is intrinsically below `mu`, and 0.2 otherwise. In principle, each measurement could have a completely unique uncertainty (and in my real use-case, this is what I have). The fundamental point that is giving me these incorrect results is that this uncertainty depends on the intrinsic value, `xint`.

I can certainly take away the hard bounds on `mu`, and use a weak prior. Here are the results of using a N(4,2) prior on `mu`: It has pretty much the same result as before.

Sorry to be thick, but I don’t really understand why you say that I’m not simulating the data this way? The way I understand it is that I choose `mu` and `sigma` to be 4 and 2 respectively. This implies no knowledge of their prior. From there, the way I generate the data corresponds exactly to the generative model (as far as I can tell).

If I was instead to say that I didn’t choose `mu` and `sigma` but rather they were drawn from a normal or lognormal prior for instance, then I guess it would be consistent to use those priors in the generative model. But then I guess the appropriate test would be to run the simulation multiple times and see if it is consistently biased?

The problem is, this whole model works perfectly well if the uncertainty is not dependent on `xint`. That is, with exact same model as I posted above, but generating the `uncertainty` data for example using `uncertainty ~ normal(0,2)` (so each observation still gets unique uncertainty).

Thanks for your help, and I am sincerely sorry if I am missing something obvious.

For terminology, we use the standard computer science terminology where the data variables are called variable. Sometimes data is used for observations of random variables and sometimes it’s used for constants and some times for unmodeled predictors (which can be thought of as random but aren’t really modeled as such because their model marginalizes out).

Yes to running simulation multiple times to see if it’s biased. See http://andrewgelman.com/2017/04/12/bayesian-posteriors-calibrated/ (I’m working on a case study). The post explains why you want to draw values from the generate process:

• draw parameters theta from prior p(theta), then

• draw observed data y from sampling distribution p(y | theta).

If there’s a different measurement error on each of your observations, that’s fine (I take it that’s what `uncertainty` is for). If the uncertainty depends on a predictor, you want to model that.

So the big takeway is that everything will work if you sample from your generative process and then fit with it. It won’t be guaranteed to fit every data set perfectly, but over data sets, you will have calibration.

I think the answer here was in this statement:

I realised that the generative process includes the statement that `uncertainty = 2` if `xint < mu` and 0.2 otherwise, which I was not including in the Stan model (but rather just providing the uncertainty as observed data).

Upon changing my model to reflect this:

`````` data {
int<lower=0> N;        // Number of samples
vector[N] x_meas;      // Measured values
real<lower=0> sd_norm; // Uncertainty in values
}

parameters {
real mu;
real<lower=0> sigma;
vector[N] xtrue;
}

transformed parameters {
vector<lower=0>[N] uncertainty; // Individual uncertainties

for (i in 1:N){
if (xtrue[i] < mu)
uncertainty[i] = sd_norm;
else
uncertainty[i] = sd_norm/10.;
}
}

model {
mu ~ normal(4,2);
xtrue ~ normal(mu,sigma);
x_meas ~ normal(xtrue,uncertainty);
}
``````

(where I have been using `sd_norm=2`), I get the following pdf: That is, the posterior sits right on top of the input.

I now have to be able to translate this into my real problem, which is a bit more complex, and perhaps I’ll ask another question about that later :)

Thanks for all the help!