I’m trying to build a Stan model to analyse the rankings produced by recruitment panels who are reviewing different types of CVs. I want to look at the effect of CV format - but first I am trying to get a simple model that takes the ranking data from panel members and estimates the latent ranking of the candidates. The overall research project (Action Research on Research Culture - Bennett Institute for Public Policy) is looking at scientific recruitment and whether changing the format of CVs (eg Résumé for Research and Innovation (R4RI) guidance – UKRI) affects the shortlisting.
I have based the code on Thurstonian model post and have looked at the other posts on Thurstonian modelling (I have added more comments to help me understand the code - or show where I haven’t understood it…)
The model nearly works but for most data sets two of the candidates are flipped in order - there are not necessarily candidates who are next to each other in the ranking - eg the 7th and 10th candidate might be flipped, so I don’t think the flipping is because of error in the estimation. Different pairs of candidates are flipped depending on the seed
used for the randomly assigned abilities of the candidates.
I would welcome suggestions on how to improve the model to make it return the correct ordering of candidates using the dummy data. This thanks for your expertise.
The code is split into three chunks - because it is in a Quarto document - I explain the chunks below, and at the end include on R script that includes all the code.
Below is the R code to generate the synthetic data - for 10 candidates (labelled a-j) assessed by 4 recruitment panel members (labelled w-z). It can add measurement error for each panel member (panel_sc_sd) but currently that is set to zero - so all the recruitment panel supply the same ranking and I am simply trying to get back that ranking.
The ranking is slightly counter intuitive (when thinking about ranking job candidates) in that the lower the ability score the better/lower the rank - but as this is the default sorting for the mathematical functions I have used approach to avoid confusion.
pacman::p_load("tidyverse", "rstan", "reprex") # Package loading here to make it a reproducible example
set.seed(573) # Standard seed is 573
n <- 10 # Number of candidates
p <- 4 # Number of recruiting panel members
candidate_details <- tibble(name = letters[1:n], # Generate letter names for each candidate taken from start of alphabet
sc_ability = runif(n, min = 0, max = 10), # Generate scientific ability for each candidate by random sampling
sc_rank = rank(sc_ability), # Find the rank of each candidate, use negative as we want highest ability to be first
sc_order = order(sc_ability)) # Find the order in which the candidates have to be selected to put the highest ability first, ie in the first row is the index of the candidate with highest ability
panel_sc_sd <- 0.0 # Standard deviation of recruiters estimate of sc_ability
panel_names <- letters[(27-p):26] # Generate letter names for each member of the recruiting panel taken from end of alphabet
# Define function to produce panel estimates of scores
find_member_scores <- function(sc_ability){
sc_ability + rnorm(length(sc_ability), mean = 0, sd = panel_sc_sd)
}
# Add panel member latent scores
for(panel_member in seq_along(panel_names)){ # * TO DO * It would be great to replace this with Purrr calls
panelist <- paste0("p_sc_", panel_names[panel_member])
candidate_details <- candidate_details %>%
mutate({{panelist}} := find_member_scores(sc_ability))
}
# Work out how each panel member ranks the candidates
candidate_details <- candidate_details %>%
mutate(across(starts_with("p_sc"), ~ dense_rank(.), .names = "rank_{.col}"))
The Stan model based on Thurstonian models in the Stan forums is:
//
// This Stan program is the simpler working model from this thread of discussion: https://discourse.mc-stan.org/t/thurstonian-model/1735
// It does not allow sigma to vary between raters
data {
int<lower=0> K; // number of items being ranked
int<lower=0> J; // number of raters
int<lower=1,upper=K> y[J,K]; // *1* y[i] - ie a row of the matrix is a vector of the ranks given to the candidates by rater J QU or is it an array?
real<lower = 0> mu_sd_prior; // the standard deviation of the prior for the mu values ¿of raters?
}
transformed data{
int y_argsort[J, K]; // a matrix to contain indices to sort the J rows into ascending order, so item 1 is '2' if the the 2nd item in y in row J is the 1st
for (i in 1:J){
y_argsort[i] = sort_indices_asc(y[i]); // apply the function 'sort_indices_asc' to sort ascending indices as detailed in line *1*
}
print("sorted arguments:", y_argsort);
}
parameters {
vector[K-1] mu_hat; // these are the deduced values of mu for each item used in the construction of the mu vector below, the construction is necessary to identify the model by ensuring all the mu's sum to 0.
//real<lower=0> sigma[J]; // This is the prior for allowing sigma to vary by rater.
ordered[K] z[J]; // I think this is an ?array? of J ordered vectors of length K ie the set of rankings produced by the raters called z. This stores the deduced values that rater J sampled for applicant K
// ------K-----
// | z z z z z
// J z z z z z
// | z z z z z
// using these values we then deduce the values of mu (the latent score of each applicant) that gave rise to the values in z
}
transformed parameters{ // this builds a vector of K values of mu with the lowest value fixed to 0 to identify the model (as the mu's can be translated along the number line without changing the model)
vector[K] mu;
mu = append_row(mu_hat, -sum(mu_hat)); // Initial thoughts on adding a centring constraint - although this doesn't constrain the scale, although neither does the previous approach
}
model {
mu_hat ~ normal(0, mu_sd_prior); // vectorised assignment into mu-hat and therefore also mu? Adrian's model uses a Beta distribution, does this force the model to be identified as it limits mu's to 0-1
//sigma ~ cauchy(0, 1);
for (j in 1:J){ // for each rater
z[j] ~ normal(mu[y_argsort[j]], 1); // The intention was for this to be: z[j] ~ normal(mu[y_argsort[j]], sigma[j]); but the author couldn't make the model identified like this
// mu[y_argsort[j]] sorts the mu's into the order that they were rated by rater J and, I think, because z is defined as ordered this provides the constraint that
// the values must be in the order that rater J put them.
}
}
And here is the R code to prepare the simulated data, run the model and display the flipped ranking. The Stan code is in a Quarto document, so that chunk is read in and compiled using output.var = "model_1"
so the R code that samples from the model using sampling
rather than stan
. I’ve added a version of the code that can be copied and pasted directly at the end of the post.
# Prepare data for Stan
# Select the ranking rows from candidate_details data frame, make into a matrix and transpose
panel_ranking_data <- candidate_details %>%
select(starts_with("rank")) %>%
as.matrix() %>%
t()
mu_sd_prior <- 0.5
data_list <- list(
K = n,
J = p,
y = panel_ranking_data,
mu_sd_prior = mu_sd_prior
)
fit1 <- sampling(
model_1, # Stan program object from previous chunk
data = data_list, # named list of data
chains = 4, # number of Markov chains
warmup = 1000, # number of warmup iterations per chain
iter = 2000, # total number of iterations per chain
cores = 4, # number of cores (could use one per chain)
refresh = 100, # progress shown
verbose = FALSE # TRUE gives more logging
)
# Get a list of mu estimates and their ranks so can put it into the candidates details frame for comparisons
est_ranks <- summary(fit1)$summary %>% # * TO DO * This feels very clunky surely there is an easier way to get a dataframe of parameters
as.data.frame() %>% # If I change this to as_tibble I seem to loose the parameter names
mutate(parameter = rownames(.)) %>%
select(parameter, everything()) %>%
as_data_frame() %>%
filter(str_detect(parameter, "mu\\[")) %>%
mutate(name = letters[as.numeric(str_extract(parameter, "(?<=\\[).+?(?=\\])"))]) %>% # Extract the number of the mu parameter ie candidate from the parameter name
mutate(est_sc_rank = dense_rank(mean)) %>%
mutate(est_sc_order = order(mean)) %>%
select(name, mean, est_sc_rank, est_sc_order)
candidate_details <- candidate_details %>%
left_join(est_ranks, by = ("name" = "name"))
ggplot(candidate_details, aes(x = sc_order, y = est_sc_order)) +
geom_point() +
xlim(0, 10) +
ylim(0, 10)
Here is all the code combined into one R script:
pacman::p_load("tidyverse", "rstan", "reprex") # Package loading here to make it a reproducible example
set.seed(573) # Standard seed is 573
n <- 10 # Number of candidates
p <- 4 # Number of recruiting panel members
candidate_details <- tibble(name = letters[1:n], # Generate letter names for each candidate taken from start of alphabet
sc_ability = runif(n, min = 0, max = 10), # Generate scientific ability for each candidate by random sampling
sc_rank = rank(sc_ability), # Find the rank of each candidate, use negative as we want highest ability to be first
sc_order = order(sc_ability)) # Find the order in which the candidates have to be selected to put the highest ability first, ie in the first row is the index of the candidate with highest ability
panel_sc_sd <- 0.0 # Standard deviation of recruiters estimate of sc_ability
panel_names <- letters[(27-p):26] # Generate letter names for each member of the recruiting panel taken from end of alphabet
# Define function to produce panel estimates of scores
find_member_scores <- function(sc_ability){
sc_ability + rnorm(length(sc_ability), mean = 0, sd = panel_sc_sd)
}
# Add panel member latent scores
for(panel_member in seq_along(panel_names)){ # * TO DO * It would be great to replace this with Purrr calls
panelist <- paste0("p_sc_", panel_names[panel_member])
candidate_details <- candidate_details %>%
mutate({{panelist}} := find_member_scores(sc_ability))
}
# Work out how each panel member ranks the candidates
candidate_details <- candidate_details %>%
mutate(across(starts_with("p_sc"), ~ dense_rank(.), .names = "rank_{.col}"))
model_1_text <- "
//
// This Stan program is the simpler working model from this thread of discussion: https://discourse.mc-stan.org/t/thurstonian-model/1735
// It does not allow sigma to vary between raters
data {
int<lower=0> K; // number of items being ranked
int<lower=0> J; // number of raters
int<lower=1,upper=K> y[J,K]; // *1* y[i] - ie a row of the matrix is a vector of the ranks given to the candidates by rater J QU or is it an array?
real<lower = 0> mu_sd_prior; // the standard deviation of the prior for the mu values ¿of raters?
}
transformed data{
int y_argsort[J, K]; // a matrix to contain indices to sort the J rows into ascending order, so item 1 is '2' if the the 2nd item in y in row J is the 1st
for (i in 1:J){
y_argsort[i] = sort_indices_asc(y[i]); // apply the function 'sort_indices_asc' to sort ascending indices as detailed in line *1*
}
}
parameters {
vector[K-1] mu_hat; // these are the deduced values of mu for each item used in the construction of the mu vector below, the construction is necessary to identify the model by ensuring all the mu's sum to 0.
//real<lower=0> sigma[J]; // This is the prior for allowing sigma to vary by rater.
ordered[K] z[J]; // I think this is an ?array? of J ordered vectors of length K ie the set of rankings produced by the raters called z. This stores the deduced values that rater J sampled for applicant K
// ------K-----
// | z z z z z
// J z z z z z
// | z z z z z
// using these values we then deduce the values of mu (the latent score of each applicant) that gave rise to the values in z
}
transformed parameters{ // this builds a vector of K values of mu with the lowest value fixed to 0 to identify the model (as the mu's can be translated along the number line without changing the model)
vector[K] mu;
mu = append_row(mu_hat, -sum(mu_hat)); // Initial thoughts on adding a centring constraint - although this doesn't constrain the scale, although neither does the previous approach
}
model {
mu_hat ~ normal(0, mu_sd_prior); // vectorised assignment into mu-hat and therefore also mu? Adrian's model uses a Beta distribution, does this force the model to be identified as it limits mu's to 0-1
//sigma ~ cauchy(0, 1);
for (j in 1:J){ // for each rater
z[j] ~ normal(mu[y_argsort[j]], 1); // The intention was for this to be: z[j] ~ normal(mu[y_argsort[j]], sigma[j]); but the author couldn't make the model identified like this
// mu[y_argsort[j]] sorts the mu's into the order that they were rated by rater J and, I think, because z is defined as ordered this provides the constraint that
// the values must be in the order that rater J put them.
}
}
"
# Prepare data for Stan
# Select the ranking rows from candidate_details data frame, make into a matrix and transpose
panel_ranking_data <- candidate_details %>%
select(starts_with("rank")) %>%
as.matrix() %>%
t()
mu_sd_prior <- 0.5
data_list <- list(
K = n,
J = p,
y = panel_ranking_data,
mu_sd_prior = mu_sd_prior
)
fit1 <- stan(
model_code = model_1_text, # Stan program as character vector
data = data_list, # named list of data
chains = 4, # number of Markov chains
warmup = 1000, # number of warmup iterations per chain
iter = 2000, # total number of iterations per chain
cores = 4, # number of cores (could use one per chain)
refresh = 100, # progress shown
verbose = FALSE # TRUE gives more logging
)
# Get a list of mu estimates and their ranks so can put it into the candidates details frame for comparisons
est_ranks <- summary(fit1)$summary %>% # * TO DO * This feels very clunky surely there is an easier way to get a dataframe of parameters
as.data.frame() %>% # If I change this to as_tibble I seem to loose the parameter names
mutate(parameter = rownames(.)) %>%
select(parameter, everything()) %>%
as_data_frame() %>%
filter(str_detect(parameter, "mu\\[")) %>%
mutate(name = letters[as.numeric(str_extract(parameter, "(?<=\\[).+?(?=\\])"))]) %>% # Extract the number of the mu parameter ie candidate from the parameter name
mutate(est_sc_rank = dense_rank(mean)) %>%
mutate(est_sc_order = order(mean)) %>%
select(name, mean, est_sc_rank, est_sc_order)
candidate_details <- candidate_details %>%
left_join(est_ranks, by = ("name" = "name"))
ggplot(candidate_details, aes(x = sc_order, y = est_sc_order)) +
geom_point() +
xlim(0, 10) +
ylim(0, 10)
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.3 (2023-03-15)
#> os macOS Big Sur ... 10.16
#> system x86_64, darwin17.0
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/London
#> date 2023-04-13
#> pandoc 2.19.2 @ /Applications/RStudio - 2023.03.0-386.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.0)
#> digest 0.6.30 2022-10-18 [1] CRAN (R 4.2.0)
#> evaluate 0.18 2022-11-07 [1] CRAN (R 4.2.0)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
#> htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0)
#> knitr 1.41 2022-11-18 [1] CRAN (R 4.2.0)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0)
#> rlang 1.1.0 2023-03-14 [1] CRAN (R 4.2.0)
#> rmarkdown 2.18 2022-11-09 [1] CRAN (R 4.2.0)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0)
#> stringr 1.5.0 2022-12-02 [1] CRAN (R 4.2.0)
#> vctrs 0.6.1 2023-03-22 [1] CRAN (R 4.2.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.35 2022-11-16 [1] CRAN (R 4.2.0)
#> yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────