I’ve been experimenting with a changepoint model for stylometry, heavily based on the changepoint model in the user guide, and this article by Riba and Ginebra that aims to detect a known but debated change in authorship in the 15th century novel Tirant lo Blanc by Joanot Martorell.
The basic stylometric approach used in the article relies on the frequency of short words in the text, which is known to act as a stylistic ‘fingerprint’ to distinguish between authors’ writing styles.
I’ve implemented the model in Stan, adapting the changepoint example in the User’s Guide to a multinomial representing word frequencies. Applied to Tirant lo Blanc, the Stan version successfully reproduces the expected changepoint somewhere around Chapter 371.
To test a counter-example where a changepoint isn’t expected, I’ve also run the model on Mary Shelley’s Frankenstein. In this case, the model fits successfully, but provides a far more inconclusive set of results, with many possible changepoints each with lower probability. That doesn’t seem particularly surprising, so I’m happy with that.
The purpose of this experiment, though, was ultimately to apply it to the Voynich Manuscript. I’ve run the model on that data, but in this case I’m getting some extremely high \hat{R} values, and a definite lack of mixing between the chains when I assess the traceplot. I started off with a ‘flat’ Dirichlet prior with the \alpha values at 1, as used for the test models; I’ve toyed with this to some extent to express the expected shape of the multinomial of word-length frequencies in natural language, but this doesn’t seem to be of much help.
I’m now a little out of ideas. My naive suspicion is that there isn’t enough text in the Voynich data for this kind of analysis. If that’s the case, is there anything I can do to improve things? I saw some suggestions to use a different prior on the multinomial than a Dirichlet, but couldn’t find any good examples or advice on that.
More generally, this relates to knowing when to ‘give up’ on a model. The Stan output suggests that running more iterations may help, but I don’t have a good feeling for when I might be throwing good computation after bad. (Tens of thousands of iterations? Hundreds of thousands?) I know there aren’t strictly general rules, but I’d love to hear rules of thumb! Any advice or help on this front would be greatly appreciated!
I’ve reproduced the Stan code for the multinomial changepoint model in this post, but I’ve uploaded the full code, with .r files and associated data to github here: https://github.com/WeirdDataScience/weirddatascience/tree/master/voynich04-changepoint
data {
int<lower=0> num_obs; // Number of observations (rows/pages) in data.
int<lower=0> num_cats; // Number of categories in data.
int y[num_obs, num_cats]; // Matrix of observations.
vector<lower=0>[num_cats] alpha; // Dirichlet prior values.
}
transformed data {
// Uniform prior across all time points for changepoint.
real log_unif;
log_unif = -log(num_obs);
}
parameters {
// Two sets of parameters.
// One (early) before changepoint, one (late) for after.
simplex[num_cats] theta_e;
simplex[num_cats] theta_l;
}
transformed parameters {
// This uses dynamic programming to reduce runtime from quadratic to linear in num_obs.
// See <https://mc-stan.org/docs/2_19/stan-users-guide/change-point-section.html>
vector[num_obs] log_p;
{
vector[num_obs + 1] log_p_e;
vector[num_obs + 1] log_p_l;
log_p_e[1] = 0;
log_p_l[1] = 0;
for( i in 1:num_obs ) {
log_p_e[i + 1] = log_p_e[i] + multinomial_lpmf(y[i,] | theta_e );
log_p_l[i + 1] = log_p_l[i] + multinomial_lpmf(y[i,] | theta_l );
}
log_p =
rep_vector( -log(num_obs) + log_p_l[num_obs + 1], num_obs) +
head(log_p_e, num_obs) - head(log_p_l, num_obs);
}
}
model {
// Priors
theta_e ~ dirichlet( alpha );
theta_l ~ dirichlet( alpha );
target += log_sum_exp( log_p );
}
generated quantities {
simplex[num_obs] changepoint_simplex; // Simplex of locations for changepoint.
// Convert the log posterior to a simplex.
changepoint_simplex = softmax( log_p );
}