Trouble estimating global variance parameter on a covariance matrix

Hi all,

I want to recover multiple correlated random walks using binomial data. Given that the global variance for the covariance matrix (here phi) may not be known with certainty I set a prior on it (here the same as the one that generated the data). The models seems to have issues getting good draws when the value is very small (like 0.005). I’d appreciate any advice for a better parameterization of the model.

data {
  int N;
  int S;
  int T;
  int s[N];
  int t[N];
  cov_matrix[S] cov_mat_mu_b;
  int n_supp[N];
  int n_resp[N];
}
transformed data {
  cholesky_factor_cov[S] cholesky_cov_mu_b;
  cholesky_cov_mu_b = cholesky_decompose(cov_mat_mu_b)/sqrt(rep_row_vector(1.0, S) * cov_mat_mu_b * rep_vector(1.0, S));
}
parameters {
  matrix[S, T] raw_mu_b;
  real<lower = 0> phi;
}
transformed parameters {
  matrix[S, T] mu_b;
  vector[N] theta;
  cholesky_factor_cov[S] cholesky_cov_mu_b_scaled = cholesky_cov_mu_b * phi;
  mu_b[, 1] =cholesky_cov_mu_b_scaled * raw_mu_b[, 1];
  for (i in 2:T) mu_b[:, i] = cholesky_cov_mu_b_scaled * raw_mu_b[:, i] + mu_b[:, i - 1];
  for (i in 1:N) theta[i] = mu_b[s[i], t[i]];
}
model {
  phi ~ normal(0.02, 0.01);
  to_vector(raw_mu_b) ~ std_normal();
  n_supp ~ binomial_logit(n_resp, theta);
}

What’s happening in the denominator here:

cholesky_cov_mu_b = cholesky_decompose(cov_mat_mu_b)/sqrt(rep_row_vector(1.0, S) * cov_mat_mu_b * rep_vector(1.0, S));

I would expect if it was the cholesky of the cov it would just be cholesky(cov), not the division.

I guess we would expect this model to go fast if the data wasn’t there. I wonder if your data is very informative or something?

Like centered is better than non-centered when you have informative data (8-schools example here: QR decomposition of parameters for multilevel regression? - #5 by bbbales2)

ALso hellooooo long time no talk. Stop by the slack and say hi some time: Mc-stan community slack

I am dividing by the current total variance of the covariance matrix so that I only have to multiply in the transformed parameter block rather than having to divide and multiply.

sqrt(rep_row_vector(1.0, S) * cov_mat_mu_b * rep_vector(1.0, S));

gives the total variance and I can change the total variance of the covariance matrix by multiplying by the square of the new variance divided by the old one. If I decompose it into the cholesky factor and divide I only have to multiply.

The data is drawn directly from the DGP so I’d assume it’s very informative? On the other hand, with only 1000 respondents a standard deviation of 1.5 or so is still a lot as there isn’t that much over time variation.

edit: Also, HI didn’t know that there was a Slack now?

I discovered it a couple months ago. I am trying to use it more to keep in touch with people.

Yeah just give centering a go as a guess and check thing. If there are divergences you know you did a bad thing and we’re just back to where we started.

You mean centering for the random walk? Without having the total variance as a parameter, the model runs as smooth as butter or whatever the expression is in English. I am wondering whether it’s just because the value is so close to 0.

Yeah so like:

mu_b[, 1] ~ multi_normal_cholesky(rep_vector(0, S), cholesky_cov_mu_b_scaled);

And mu_b is a parameter now.

That seems familiar enough.

I expect this would also sample really well if you commented out the likelihood (then it’s just sampling the prior), which makes me wonder if the data is super informative in some way that is not helpful?

Here the R code for the simulation. I would find it hard to believe that the data is that informative. I need a lot of ‘polls’ to recover the random walk and 30 are for 30 time points and 3 units is usually not enough. I am well aware that my R code is horrible.

T_max <- 30
S_max <- 3
phi <- abs(rnorm(1, 0.02, 0.01))
mu_b_cov_mat <- rWishart(1, 100, toeplitz(seq(S_max/2, 1, length.out = S_max)/S_max))[,,1]
national_cov_matrix_error_sd = sqrt(t(rep(1, S_max)) %*% mu_b_cov_mat %*% rep(1, S_max))[1,1]
mu_b_cov_mat <- mu_b_cov_mat * (phi/national_cov_matrix_error_sd)^2
## Previous electoral outcome
prev_ele_days <- rpois(1, 1200)
prev_ele <- mvrnorm(1, runif(S_max, -0.1, 0.1),mu_b_cov_mat *prev_ele_days)
## Random walk
mu_b <- matrix(NA, nrow = T_max, ncol = S_max)
mu_b[1, ] <- mvrnorm(1, prev_ele, mu_b_cov_mat)
for (t in 2:T_max) mu_b[t, ] <- mvrnorm(1, mu_b[t - 1, ], mu_b_cov_mat)
## Organized data for plotting
mu_b_df <- as.data.frame(mu_b)
colnames(mu_b_df) <- as.character(seq(1, S_max))
mu_b_df <- mu_b_df %>% 
  add_column(t = seq(1, T_max)) %>% 
  pivot_longer(-t, names_to = "s", values_to = "p_dem") %>%
  mutate(p_dem = inv.logit(p_dem),
         s = as.integer(s))
ggplot(data = mu_b_df, aes(x = t, y = p_dem, group = as.factor(s), color = as.factor(s))) + 
  geom_line() + 
  theme_bw() +
  scale_x_continuous(breaks = seq(1, T_max, as.integer(T_max/5))) +
  labs(color = "Units", y = "Share") +
  theme(axis.title.x = element_blank())
## Polls
df <- sim_polls(30, mu_b, 1000, 1000)

sim_polls is

sim_polls <- function(N, mu_b, lower_bound, upper_bound, prob_s = NULL){
  T_max <- nrow(mu_b)
  S_max <- ncol(mu_b)
  if (is.null(prob_s)){
    prob_s <- rep(1/S_max, S_max)
  }
  # only mu_b, only state polls
  # N = number of polls
  # T = upper number for sampling
  # mu_b = TxS matrix of state level support
  t_polls <- sample(1:T_max, N, replace = TRUE)
  s_polls <- sample(1:S_max, N, replace = TRUE, prob = prob_s)
  n_polls <- floor(runif(N, lower_bound, upper_bound))
  n_theta <- rep(NA, N)
  for (n in 1:N) n_theta[n] <- mu_b[t_polls[n], s_polls[n]]
  r_polls <- rbinom(N, n_polls, inv.logit(n_theta))
  df_polls <- data.frame(t = t_polls, s = s_polls, n = n_polls, r = r_polls)
  return(df_polls)
}

Running the simulation I get this df:

> df
    t s    n   r
1   9 1 1000 541
2   6 3 1000 530
3  28 2 1000 534
4  30 2 1000 525
5  20 2 1000 573
6   3 3 1000 513
7  19 1 1000 559
8  16 2 1000 524
9   3 2 1000 547
10 13 3 1000 535
11 18 1 1000 547
12 12 3 1000 500
13 22 3 1000 486
14 20 1 1000 553
15 20 2 1000 538
16 11 2 1000 525
17  9 3 1000 500
18 22 1 1000 534
19  4 3 1000 524
20 15 3 1000 519
21 18 1 1000 538
22  8 1 1000 552
23 18 3 1000 496
24 20 3 1000 502
25 28 3 1000 529
26  1 2 1000 532
27 14 3 1000 517
28  8 2 1000 560
29 26 1 1000 554
30 19 1 1000 551

So that’s 1000 responses for each poll, so that seems fairly informative. Like for p = 0.5 we’d expect the standard error to be 0.015. And like if we look at the variation in the responses here, one is 496 and one is 553, so the difference in the means of the MLE estimates there is like 0.06, which is a lot bigger than the standard error estimates, so I think this counts as informative.

Like, in the 8-schools example you really can’t tell the 8 schools apart – the standard errors are too high. Here you really can separate things.

But I guess the other thing here is that these times are sorta ranodmly far apart.

It looks like there are 30 days. Are you sampling random variables at all 30 days or just ones with polls at them?

I am estimating \mu_b for all days just for simplicity here. I’ve another model that skips days without data by multiplying the Cholesky factor by the square root of the time difference but that is not really the MWE.

Yes, time points/units are randomly assigned polls.

You might be in a situation where you want to do centering for the days where you have data and non-centering for the days were you don’t have data.

I was gonna suggest this. Why is this a problem?

It’s not. It’s just not as readable.

Right now I estimate only the days when I’ve data and later fill in the gaps given that I know how the process behaves.

data {
  int N;
  int S;
  int T;
  int t_unique[T - 1];
  int s[N];
  int t[N];
  cov_matrix[S] cov_mat_mu_b;
  int n_supp[N];
  int n_resp[N];
  // prior data
  vector[S] prev_ele;
  int prev_ele_days;
}
transformed data {
  cholesky_factor_cov[S] cholesky_cov_mu_b;
  real sqrt_t_unique[T - 1];
  vector[S] prev_ele_logit;
  real sqrt_prev_ele_days;
  cholesky_cov_mu_b = cholesky_decompose(cov_mat_mu_b)/sqrt(rep_row_vector(1.0, S) * cov_mat_mu_b * rep_vector(1.0, S));
  sqrt_t_unique = sqrt(t_unique);
  prev_ele_logit = logit(prev_ele);
  sqrt_prev_ele_days = sqrt(prev_ele_days);
}
parameters {
  matrix[S, T] raw_mu_b;
  real<lower = 0> phi;
}
transformed parameters {
  matrix[S, T] mu_b;
  vector[N] theta;
  cholesky_factor_cov[S] cholesky_cov_mu_b_scaled = cholesky_cov_mu_b * phi;
  mu_b[, 1] = sqrt_prev_ele_days * cholesky_cov_mu_b_scaled * raw_mu_b[, 1] + prev_ele_logit;
  for (i in 2:T) mu_b[:, i] = sqrt_t_unique[i - 1] * cholesky_cov_mu_b_scaled * raw_mu_b[:, i] + mu_b[:, i - 1];
  for (i in 1:N){
    theta[i] = mu_b[s[i], t[i]];
  }
}
model {
  phi ~ normal(0.02, 0.01);
  to_vector(raw_mu_b) ~ std_normal();
  n_supp ~ binomial_logit(n_resp, theta);
}

Change:

parameters {
  matrix[S, T] raw_mu_b;
  real<lower = 0> phi;
}
transformed parameters {
  matrix[S, T] mu_b;
  vector[N] theta;
  cholesky_factor_cov[S] cholesky_cov_mu_b_scaled = cholesky_cov_mu_b * phi;
  mu_b[, 1] = sqrt_prev_ele_days * cholesky_cov_mu_b_scaled * raw_mu_b[, 1] + prev_ele_logit;
  for (i in 2:T) mu_b[:, i] = sqrt_t_unique[i - 1] * cholesky_cov_mu_b_scaled * raw_mu_b[:, i] + mu_b[:, i - 1];
  for (i in 1:N){
    theta[i] = mu_b[s[i], t[i]];
  }
}
model {
  phi ~ normal(0.02, 0.01);
  to_vector(raw_mu_b) ~ std_normal();
  n_supp ~ binomial_logit(n_resp, theta);
}

To something like:

parameters {
  matrix[S, T] mu_b;
  real<lower = 0> phi;
}
model {
  vector[N] theta;
  cholesky_factor_cov[S] cholesky_cov_mu_b_scaled = cholesky_cov_mu_b * phi;
  mu_b[, 1] ~ multi_normal_cholesky(prev_ele_logit, sqrt_prev_ele_days * cholesky_cov_mu_b_scaled);

  for (i in 2:T) {
    mu_b[:, i] ~ multi_normal_cholesky(mu_b[:, i - 1], sqrt_t_unique[i - 1] * cholesky_cov_mu_b_scaled);
  }
  
  for (i in 1:N) {
      theta[i] = mu_b[s[i], t[i]];
  }

  phi ~ normal(0.02, 0.01);
  n_supp ~ binomial_logit(n_resp, theta);
}

I feel like I got fooled by thinking that non-centered is always better…

Well it very well might still be this is just an easy guess and check to try :P

This code made Stan very unhappy. :/ No divergent transitions anymore but exceeding max tree depth, very low Neff and a whole bunch of other warnings. I just checked though. My model was unable to actually recover \phi with the available data (estimate is far smaller than real value).

Prior and posterior for \phi with the black line being the real value.

Lolol

Hmmm, alright, does the non-centered parameterization work better for estimating phi? Like do you get in the ballpark there?

I get an Neff of 9 for \phi and the recovery isn’t any better.

So since you were asking about small phi – does this model work if phi is large? Edit: Like in the simulated data?

I mean large in this case is like bigger than 0.02 but even then not really. It seems to want to be very small.