Hierarchical model with summations does not fit simulated data

Hello Stan pros,

I’m trying to model outcomes based on weighting different aspects. The model I’m using is essentially

r_d \sim \mathcal{N}\left( \sum^K_i (w_{di} \sum^3_j(\alpha_j a_{dij} )), \sigma^2 \right)

I’ve taken the model from literature where it was fit using EM-algorithms, but I was looking to use Stan to learn about the model and Stan.

There are 1 \leq d \leq N outcomes, K properties and 3 ratings per property. The ratings are the a_{dij}, \alpha_j is the coefficient mixing the ratings and w_{di} are individual weights for each property, which I’m actually interested in. For simplicity and possible performance increases, the outcome is standardised on sample mean and stdev.
Since the properties are correlated, the individual weightings will be too, thus there is a multivariate normal prior on the w's:

w_d \sim \mathcal{N}(\mu, \Sigma)

This resulted in my following Stan Code:

data {
    int<lower=1> N;
    int<lower=1> K;
    vector[N] y;
    real input[N, K, 3];
}

parameters {
    cholesky_factor_corr[K] Lcorr;
    vector<lower=0>[K] sigma;
    vector[K] mu_weight;

    vector[K] weight[N];

    vector[3] alpha;
    real<lower=0> sigma_sq;
}

transformed parameters {
    real mu_rating[N];
    cholesky_factor_cov[K] L_Sigma;

    // Calculate terms and mean for likelihod
    for (i in 1:N) {
        mu_rating[i] = sum(weight[i,] * (to_row_vector(alpha) * to_matrix(input[i,,])')) * sigma_sq;
    }
    
    // Precalc MultiNormal Sigma
    L_Sigma = diag_pre_multiply(sigma, Lcorr);
}

model {
    // Priors
    sigma ~ cauchy(0, 2.5);
    Lcorr ~ lkj_corr_cholesky(1);
    mu_weight ~ normal(0, 5);
    alpha ~ normal(0, 10);
    sigma_sq ~ normal(0, 5);

    // Model for weights
    for (i in 1:N)
        weight[i] ~ multi_normal_cholesky(mu_weight, L_Sigma);

    // Model for outcome
    y ~ normal(mu_rating, 1);
}

The model performs very, very slowly with a lot of divergent transitions and low BFMI - especially as my data-set has N \approx 20000 and K \approx 150. To investigate the problems with my model, I used some simulated data and looked at the resulting posterior:

# Dimensions
N_SIM = 100
K_SIM = 10

# Model scale
sigma_sq_SIM = abs(rcauchy(1, 0, 2.5))

# Mean weightings
mu_weight_SIM = rnorm(K_SIM, 0, 5)
# Mixing coefficients
alpha_SIM = rnorm(3, 0, 10)

# Ratings between 0 and 1
input_SIM = array(rbeta(N_SIM*K_SIM*3, shape1=1, shape2=5), c(N_SIM, K_SIM, 3))

# Generate covariance matrix for weights
A <- matrix(runif(K_SIM^2)*2-1, ncol=K_SIM)
Sigma_SIM = t(A) %*% A

# Individual weightings drawn from posterior
weight_SIM <- rmvnorm(N_SIM, mu_weight_SIM, Sigma_SIM)

# Generating Outcome
mu_rating_SIM <- rep(NA, N_SIM)
y_SIM <- rep(NA, N_SIM)
for (i in 1:N_SIM) {
  mu_rating_SIM[i] <- sum(weight_SIM[i,] * (alpha_SIM %*% t(input_SIM[i,,])))
  y_SIM[i] <- rnorm(1, mean = mu_rating_SIM[i], sd = sigma_sq_SIM)
}

# Running Stan
smfit_simulated <- sampling(stanmodel, data = list(N = N_SIM,
                                                   K = K_SIM,
                                                   y = y_SIM,
                                                   input = input_SIM,
                                                   simulate_only = 0),
                            init_r = 1, chains = 1, iter = 1000)

The model is already fitting quite slowly, but more importantly, the posterior estimates are nowhere near the simulated data. In fact, the mu_weight-parameter posterior estimates are bimodal - which they shouldn’t be as they are generated from a normal distribution.

This results in some questions I am not yet able to answer (trying to learn here…):

  1. Did I not properly set up the model? That is, is there a problem with the model in general that leads to the problems when using Stan?
  2. How can I reparameterize the model for improved quality and performance?
  3. How can I improve the performance in terms of speed aside from the reparameterization?
  4. Is it realistic to have a model that works well on the data-set of the size explained above? Or might a different approach, such as variational Bayes or EM, be more promising?

I’m hoping for some valuable lessons on how to set up better models. Simple regressions from textbooks did not prepare me for this… :)

Thanks!
Chris

What was the E step marginalizing out? I’m used to EM being used for mixtures, but this doesn’t seem to be a mixture so much as something like a regression with some multiplicative terms (like the 2PL IRT model, say).

The main problem you’re going to get in a model like this is multiplicative identifiability. Rewriting your normal location, you get

\sum_{i=1}^K \sum_{j = 1}^3 (w_{d,i} \cdot \alpha_j) \cdot a_{d,i,j}

it’s easy to see that only w \cdot \alpha is identified. For example, you get mode switching by swapping signs on w and \alpha. A typical solution is to give one of them a strong prior, say \mathsf{Normal}(0, 1) constrained to be positive. Then the other one can have a free scale and free sign.

Are there other constraints like that the \alpha sum to zero or form a simplex or something?

Given you’re putting a complex prior on the weight (which should be vectorized for a really big efficiency gain).

Also, rather than doing things like to_matrix and to_row_vector, you might want to just code up the types you want.

Algebra is good for reducing number of arithmetic operations, which take time for the chain rule and autodiff. For example,

sum(weight[i, ] * (to_row_vector(alpha) * to_matrix(input[i,,])')) * sigma_sq;

should be

weight[i, ] * sigma_sq * sum(alpha * input[i]);

when you get the types sorted out.

In general, I’d suggest starting simpler. Rather than that multivariate prior, just use a fixed prior. Then move to something hierarchical. And then move to something multivariate. It’s easier to debug building up stepwise.

P.S. It really helps if the math and code use the same variable names (e.g., w and w rather than w and weight, y and y rather than y and input, etc.). We also find it helpful if the index variables are n in 1:N rather than d in 1:N because it’s easier to line up.

1 Like

Thanks for your very helpful answer, Bob! Sorry for the delayed response. Unfortunately, I wasn’t able to come back to this model sooner…

Your post helped me to understand the difficulties with the model, especially considering the type conversions, and to improve the model in some regards. For now I’m using a hierarchical model without the multivariate normal, which I’d like to improve for speed before extending it further:

data {
    int<lower=1> N_Participants;
    int<lower=1> K_Aspects;
    vector[N_Participants] y_rating;
    matrix[K_Aspects, 3] input[N_Participants];
}

parameters {
    // Weights for aspects
    row_vector<lower=0>[K_Aspects] weight_std[N_Participants];
    // Mixing coefficients for sentiment scores
    row_vector[3] alpha;
    // Hyper parameter for weights
    real<lower=0> mu_weights;
    real<lower=0> sigma_weights;

    // Model variance
    real<lower=0> sigma_sq;
}

transformed parameters {
    real mu_rating[N_Participants];
    row_vector<lower=0>[K_Aspects] weight[N_Participants];

    // Calculate terms and mean for likelihod
    for (n in 1:N_Participants) {
        weight[n] = mu_weights + weight_std[n] * sigma_weights;
        mu_rating[n] = sum(weight[n] .* (alpha * input[n]'));
    }
}

model {
    // Priors
    alpha ~ normal(0, 10);
    sigma_sq ~ normal(0, 5);
    
    mu_weights ~ normal(0, 1);
    sigma_weights ~ normal(0, 5);

    // Model for weights
    for (n in 1:N_Participants) {
      weight_std[n] ~ normal(0, 1);
    }

    // Model for outcome
    y_rating ~ normal(mu_rating, sigma_sq);
}

generated quantities {
    vector[N_Participants] y_sim;

    // Generate simulations
    for (i in 1:N_Participants) {
        y_sim[i] = normal_rng(mu_rating[i], sigma_sq);
    }
}

The model is still quite slow, but is improved over my initial attempt.

To be honest: I’m not sure. The paper says

Using the current parameter estimate, calculate the hidden variables as follows:
For each review d: (1) calculate the aspect rating using equation (11). (2) Calculate w_d by solving the problem given in equation (13) below.

Equation (11) is simply a_{di} = \sum_{j = 1}^3 \alpha_j a_{dij} and (13) is using the MAP estimation for the log of the joint distribution of Pr(r_d, w_d):

f_{MAP}(d) = \log p(w_d | \mu, \Sigma) p(r_d | \sum_{i = 1}^K (w_{di} \sum_{j = 1}^3(\alpha_j a_{dij})), \delta^2)

find the w_d which maximize f_{MAP}(d).

The weights w_d, should sum to one, so a simplex might be appropriate. I assumed, however, that this increases the complexity of the model and is not critical in the first place.

Is there anything else I can to in order to improve the efficiency of the sampling?

Step (1) is presumably the E-step where the aspect rating is whatever’s getting marginalized. In some simple settings the expectations can just be plugged in and used as values.

It actually reduces the complexity of the model by one degree of freedom. A K-simplex involves only K - 1 unconstrained parameters. It adds some computation, but extra computation is almost always worthwhile if the model’s simplified.

1 Like