I’ve generated some fake data per my data generating process and I’m trying to recover it using Stan. But I’m having some trouble. Let me preface this by saying that I’m actually not sure if the model is estimable. Would love some input on that as well.
Let’s imagine that there are two buckets b_1 and b_2, each with a different quantity of water (t_1 and t_2). Based on some unknown ratio p_12, water is transferred from b_1 to b_2. The remaining water remains in b_1. Similarly, based on an unknown ratio (p_21), water is transferred from b_2 to b_1 and the remaining water remains in b_2. I have the final quantities of water in b_1 and b_2 (d_1 and d_2). I’m trying to recover t_1, t_2, p_12 and p_12. In summary, the DGP is as follows:
d_1 = t_1 * p_11 + t_2 * p_21
d_2 = t_1 * p_12 + t_2 * p_22
I set strong priors on p, just to see if the model is identifiable. But so far i’ve not been able to recover the original parameters. The Stan code is below:
data {
int N;
matrix[N, 2] d;
}
parameters {
matrix<lower=0, upper=1>[2, 2] p;
matrix<lower=0, upper=100>[N, 2] t;
}
model {
p[1, 1] ~ normal(0.8, 0.1);
p[1, 2] ~ normal(0.1, 0.1);
p[2, 1] ~ normal(0.3, 0.1);
p[2, 2] ~ normal(0.9, 0.1);
for (n in 1:N) {
d[n, 1] ~ normal(t[n, 1] * p[1, 1] + t[n, 2] * p[2, 1], 5);
d[n, 2] ~ normal(t[n, 1] * p[1, 2] + t[n, 2] * p[2, 2], 5);
}
}
The R code for the simulated data is below:
library(dplyr)
library(rstan)
p11 <- 0.8
p12 <- 0.2
p21 <- 0
p22 <- 1
n_rows <- 10000
n <- n_rows * 2
p <- matrix(c(p11, p12, p21, p22), ncol = 2)
orig_volume <- matrix(round(runif(n, 0, 100)), ncol = 2)
actual_volume <- matrix(rnorm(length(orig_volume), orig_volume %*% p, 0.5),
ncol=2)
fit <- stan("deconvolve.stan", data = list(N = n_rows,
d = actual_volume),
cores = 4)
The model fits without a hitch. However, the recovered p’s are below:
mean se_mean sd 2.5% 25% 50% 75%
p[1,1] 0.4941585 0.1936347 0.2740242 0.204176857 0.22033892 0.4969178 0.7679620
p[1,2] 0.4936669 0.3318837 0.4695399 0.007956682 0.02413307 0.4966741 0.9630371
p[2,1] 0.4900769 0.1943634 0.2750601 0.198570377 0.21497245 0.4891215 0.7648648
p[2,2] 0.5048441 0.3306266 0.4677627 0.020220718 0.03729904 0.5042397 0.9727109
97.5% n_eff Rhat
p[1,1] 0.7819635 2.002680 32.08681
p[1,2] 0.9765943 2.001581 54.61768
p[2,1] 0.7791202 2.002748 31.65371
p[2,2] 0.9857066 2.001592 54.26038
All the p’s are close to 0.5, even though p11 = 0.8 and p12 = 0.2. What’s going on here?
Also I know the model code can be improved via vectorization. Would love to get some pointers on that as well.