Unable to recover simulated parameters

I’ve generated some fake data per my data generating process and I’m trying to recover it using Stan. But I’m having some trouble. Let me preface this by saying that I’m actually not sure if the model is estimable. Would love some input on that as well.

Let’s imagine that there are two buckets b_1 and b_2, each with a different quantity of water (t_1 and t_2). Based on some unknown ratio p_12, water is transferred from b_1 to b_2. The remaining water remains in b_1. Similarly, based on an unknown ratio (p_21), water is transferred from b_2 to b_1 and the remaining water remains in b_2. I have the final quantities of water in b_1 and b_2 (d_1 and d_2). I’m trying to recover t_1, t_2, p_12 and p_12. In summary, the DGP is as follows:

d_1 = t_1 * p_11 + t_2 * p_21
d_2 = t_1 * p_12 + t_2 * p_22

I set strong priors on p, just to see if the model is identifiable. But so far i’ve not been able to recover the original parameters. The Stan code is below:

data {
  int N;
  matrix[N, 2] d;
}

parameters {
  matrix<lower=0, upper=1>[2, 2] p;
  matrix<lower=0, upper=100>[N, 2] t;
}

model {
  p[1, 1] ~ normal(0.8, 0.1);
  p[1, 2] ~ normal(0.1, 0.1);
  p[2, 1] ~ normal(0.3, 0.1);
  p[2, 2] ~ normal(0.9, 0.1);

  for (n in 1:N) {
    d[n, 1] ~ normal(t[n, 1] * p[1, 1] + t[n, 2] * p[2, 1], 5);
    d[n, 2] ~ normal(t[n, 1] * p[1, 2] + t[n, 2] * p[2, 2], 5);
  }
}

The R code for the simulated data is below:

library(dplyr)
library(rstan)

p11 <- 0.8
p12 <- 0.2
p21 <- 0
p22 <- 1

n_rows <- 10000
n <- n_rows * 2
p <- matrix(c(p11, p12, p21, p22), ncol = 2)
orig_volume <- matrix(round(runif(n, 0, 100)), ncol = 2)
actual_volume <- matrix(rnorm(length(orig_volume), orig_volume %*% p, 0.5),
                        ncol=2)

fit <- stan("deconvolve.stan", data = list(N = n_rows,
                                           d = actual_volume),
            cores = 4)

The model fits without a hitch. However, the recovered p’s are below:

        mean     se_mean        sd       2.5%        25%        50%       75%
p[1,1] 0.4941585 0.1936347 0.2740242 0.204176857 0.22033892 0.4969178 0.7679620
p[1,2] 0.4936669 0.3318837 0.4695399 0.007956682 0.02413307 0.4966741 0.9630371
p[2,1] 0.4900769 0.1943634 0.2750601 0.198570377 0.21497245 0.4891215 0.7648648
p[2,2] 0.5048441 0.3306266 0.4677627 0.020220718 0.03729904 0.5042397 0.9727109
  
         97.5%     n_eff     Rhat
p[1,1] 0.7819635 2.002680 32.08681
p[1,2] 0.9765943 2.001581 54.61768
p[2,1] 0.7791202 2.002748 31.65371
p[2,2] 0.9857066 2.001592 54.26038

All the p’s are close to 0.5, even though p11 = 0.8 and p12 = 0.2. What’s going on here?

Also I know the model code can be improved via vectorization. Would love to get some pointers on that as well.

I did not read the whole code but I think your can simplify the model by setting p_11 = 1 - p_12 and p_22 = 1 - p_21. It follows from the “the remaining water remains” in the original bucket. Now you only estimate 2 parameters instead of 4. I think this will help with making the model identifiable.

Good point. I modified the code to do that. But the estimates for p are still close to 0.5.

data {
  int N;
  matrix[N, 2] d;
}

parameters {
  real<lower=0, upper=1> p11;
  real<lower=0, upper=1> p21;
  matrix<lower=0, upper=100>[N, 2] t;
}

transformed parameters {
  matrix<lower=0, upper=1>[2, 2] p;
  p[1, 1] = p11;
  p[1, 2] = 1 - p11;
  p[2, 1] = p21;
  p[2, 2] = 1 - p21;
}

model {
  p11 ~ normal(0.8, 0.1);
  p21 ~ normal(0.1, 0.1);

  for (n in 1:N) {
    d[n, 1] ~ normal(t[n, 1] * p[1, 1] + t[n, 2] * p[2, 1], 5);
    d[n, 2] ~ normal(t[n, 1] * p[1, 2] + t[n, 2] * p[2, 2], 5);
  }
}

This results in the following results:

          mean   se_mean        sd       2.5%        25%       50%       75%
p[1,1] 0.4861357 0.2787574 0.3943592 0.08172101 0.09205796 0.4872789 0.8802503
p[1,2] 0.5138643 0.2787574 0.3943592 0.10882753 0.11974966 0.5127211 0.9079420
p[2,1] 0.4808310 0.2788023 0.3944226 0.07620828 0.08655144 0.4804124 0.8750201
p[2,2] 0.5191690 0.2788023 0.3944226 0.11452641 0.12497995 0.5195876 0.9134486
          97.5%    n_eff     Rhat
p[1,1] 0.8911725 2.001387 66.93439
p[1,2] 0.9182790 2.001387 66.93439
p[2,1] 0.8854736 2.001385 67.45610
p[2,2] 0.9237917 2.001385 67.45610

Note that the original p’s are:

 p[1, 1] = 0.8
 p[1, 2] = 0
 p[2, 1] = 0.2
 p[2, 2] = 1

I’m starting to think the original problem is ill-posed. Can anyone point me to what this problem is called/how to go about solving it? It seems similar to deconvolution.

Could it because in the R simulation, you set the scale of the normal to .5 and in the stan model you have a scale of 5?

Is this valid for the original Ps?

p[1, 1] = 0.8
p[1, 2] = 0
p[2, 1] = 0.2
p[2, 2] = 1

Shouldn’t p22 (the water that was in bucket 2 and stays in bucket 2) plus p21 be one? Actual mistake? Or just typo?

p[1, 2] = 0

Estimating an effect that takes a value at the edge of the valid parameter space gets hard. No advice here – just saying it can get tricky to read the values right.

The posteriors you have seem super wide though, and the Rhats are huge and n_eff small. Plot them as histograms (or pop the fit open in shinystan, launch_shinystan(fit)) and see what’s going on. There’s something suspicious here.

Yup that was a typo. So I was able to prove that the model is not estimable in the current form ie. there are infinitely many solutions. It appears like the solution that is finally recovered is just indexing based on noise. The way I was able to prove this was by setting stronger priors on the elements of p. I think I’ve also identified what this class of problems is called - blind source/signal separation.

Thanks for following up.

Sometimes you can get reasonable inference using priors to identify the model. This can happen when there’s some compound function of the parameters that you care about that is identified. Such models can sometimes be reparameterized to be identifiable, but often the non-identifiable one is desirable because it allows things like symmetric priors.

Bob, I’m not sure I follow. Care to elaborate a bit?

Yes, if you have y ~ normal(a + b, 1), then a and b are not identified, which makes the model improper. People (like Andrew and Jennifer in their regression book), would run these improper models with slow Gibbs samplers and then just compute a derived statistic c = a + b that was identified. This won’t work in HMC, because it quickly uncovers the pathological posterior and refuses to behave. But, if you put priors, a, b ~ normal(0, 2) then you get a proper posterior and a + b is properly identified. (Though in this case, you just want to use c as a parameter directly and not fool around with non-identifiable compositions.)

Hmm… interesting. If you just want c, why bother with a + b? Could you point me to the Chapter/section in the regression book that they are doing this? It will probably be the easiest way for me to understand.

It’s usually more involved than this. I don’t know why everyone wants to do this. Andrew’s theory is that people in engineering are trained to look at a lot of problems as data transforms rather than generatively.

Here’s the latest example, which was literally the last message to which I responded before this one:

Presumably the function f they’re using is more complicated than addition, but it presents exactly the same problem.

In the IRT section they fiddle around with a bunch of this stuff because the model’s not identified. Just to be clear, you do not and cannot do what they do in BUGS in Stan—I show that and explain why in the manual chapter on problematic posteriors. I’m not even convinced what Gelman and Hill are doing works in BUGS—they never validated it and the guarantees for Gibbs all go out the window with improper posteriors as you don’t even have a density from which to sample!