Help with Multinomial Changepoint Model

I’ve been experimenting with a changepoint model for stylometry, heavily based on the changepoint model in the user guide, and this article by Riba and Ginebra that aims to detect a known but debated change in authorship in the 15th century novel Tirant lo Blanc by Joanot Martorell.

The basic stylometric approach used in the article relies on the frequency of short words in the text, which is known to act as a stylistic ‘fingerprint’ to distinguish between authors’ writing styles.

I’ve implemented the model in Stan, adapting the changepoint example in the User’s Guide to a multinomial representing word frequencies. Applied to Tirant lo Blanc, the Stan version successfully reproduces the expected changepoint somewhere around Chapter 371.

To test a counter-example where a changepoint isn’t expected, I’ve also run the model on Mary Shelley’s Frankenstein. In this case, the model fits successfully, but provides a far more inconclusive set of results, with many possible changepoints each with lower probability. That doesn’t seem particularly surprising, so I’m happy with that.

The purpose of this experiment, though, was ultimately to apply it to the Voynich Manuscript. I’ve run the model on that data, but in this case I’m getting some extremely high \hat{R} values, and a definite lack of mixing between the chains when I assess the traceplot. I started off with a ‘flat’ Dirichlet prior with the \alpha values at 1, as used for the test models; I’ve toyed with this to some extent to express the expected shape of the multinomial of word-length frequencies in natural language, but this doesn’t seem to be of much help.

I’m now a little out of ideas. My naive suspicion is that there isn’t enough text in the Voynich data for this kind of analysis. If that’s the case, is there anything I can do to improve things? I saw some suggestions to use a different prior on the multinomial than a Dirichlet, but couldn’t find any good examples or advice on that.

More generally, this relates to knowing when to ‘give up’ on a model. The Stan output suggests that running more iterations may help, but I don’t have a good feeling for when I might be throwing good computation after bad. (Tens of thousands of iterations? Hundreds of thousands?) I know there aren’t strictly general rules, but I’d love to hear rules of thumb! Any advice or help on this front would be greatly appreciated!

I’ve reproduced the Stan code for the multinomial changepoint model in this post, but I’ve uploaded the full code, with .r files and associated data to github here: https://github.com/WeirdDataScience/weirddatascience/tree/master/voynich04-changepoint


data {

	int<lower=0> num_obs;					// Number of observations (rows/pages) in data.
	int<lower=0> num_cats;					// Number of categories in data.
	int y[num_obs, num_cats];  			// Matrix of observations.

	vector<lower=0>[num_cats] alpha;		// Dirichlet prior values.

}

transformed data {

	// Uniform prior across all time points for changepoint.
	real log_unif;
	log_unif = -log(num_obs);

}

parameters {

	// Two sets of parameters.
	// One (early) before changepoint, one (late) for after.
	simplex[num_cats] theta_e;
	simplex[num_cats] theta_l;

}

transformed parameters {

	// This uses dynamic programming to reduce runtime from quadratic to linear in num_obs.
	// See <https://mc-stan.org/docs/2_19/stan-users-guide/change-point-section.html>
	vector[num_obs] log_p;
	{
		vector[num_obs + 1] log_p_e;
		vector[num_obs + 1] log_p_l;

		log_p_e[1] = 0;
		log_p_l[1] = 0;

		for( i in 1:num_obs ) {
			log_p_e[i + 1] = log_p_e[i] + multinomial_lpmf(y[i,] | theta_e );
			log_p_l[i + 1] = log_p_l[i] + multinomial_lpmf(y[i,] | theta_l );
		}

		log_p = 	
			rep_vector( -log(num_obs) + log_p_l[num_obs + 1], num_obs) + 
			head(log_p_e, num_obs) - head(log_p_l, num_obs);
	}
}

model {

	// Priors
	theta_e ~ dirichlet( alpha );
	theta_l ~ dirichlet( alpha );

	target += log_sum_exp( log_p );

}

generated quantities {

	simplex[num_obs] changepoint_simplex;	// Simplex of locations for changepoint.

	// Convert the log posterior to a simplex. 
	changepoint_simplex = softmax( log_p );

}

Are you getting a multimodal posterior?

What do plots of vector[num_obs] log_p; look like for difference chains?

Thanks for the quick reply. I’m almost sure that I am getting a multimodal posterior.

Sorry for the ignorance, but how exactly would I construct that plot?

I’ve had a play, and I think that this is the appropriate plot. (At least, it’s the one that makes most sense to me.) I’ve included the code used to generate it, in case I’ve misinterpreted.

Plot code
library( tidyverse )
library( tidyselect )
library( magrittr )

library( rstan ) 

library( ggplot2 )
library( ggthemes )

# Read fit object
voynich_multinomial_fit <- 
	readRDS( "work/multinomial_changepoint_voynich_fit.rds" )

# Extract per-chain log_p for vector[ num_obs ] log_p;
log_p_data <-
	as.data.frame( rstan::extract( voynich_multinomial_fit, pars="log_p", permuted=FALSE ) ) %>% as_tibble %>%
	pivot_longer( cols=peek_vars(), names_to=c("chain", "folio"), names_pattern="chain:(.)\\.log_p\\[(.*)\\]", values_to="log_p",  ) %>%
	mutate( folio = as.numeric( folio ) )

# Plot
gp_log_p <-
	ggplot( log_p_data, aes( x=folio, y=log_p, colour=chain ) ) +
	geom_point( ) +
	labs( x="Folio", y="log_p" ) +
	theme_few() + 
	scale_colour_brewer( palette="Dark2" ) +
	theme( 	axis.title.x = element_text( angle=0 ),
				axis.title.y = element_text( angle=90 ),
	) 

ggsave( "output/voynich_multinomial_fit-chains_log_p_plot.png", width=16, height=9 )

I have a sneaking suspicion that Chain 4 shouldn’t look like that.

Hmm, hard to think about this on a log scale. Can you plot exp(log_p)?

Oops! Sorry for not responding. I was sort of asking to reproduce this analysis for the new book:

I was assuming this was done graphically?

Oh, you mean the plot for the changepoint probabilities? I’m just plotting across the simplex generated from log_p in the generated quantities block. So, for the Tirant lo Blanc analysis the output looks like this:

If that’s the right sort of thing, I can definitely split that across chains for the problematic model.

(And thank you for replying at all! Given the level of expertise and discussion here, I find it quite intimidating to be asking questions here at all.)

Yeah that’s the plot I’m curious about.

You’re always welcome to ask questions here! If we had all the answers we’d just write it in the manual and we wouldn’t need forums! Questions are the point!

Also overestimate our expertise at your own risk – at least a lot of what I say here are guesses :D.

Just IMHO, but if you suspect there isn’t enough data then reducing the number of categories in your multinomials may help. I.e just make them more coarse.

Thanks! That’s good advice, and I may try that as a last resort. Given that the model is looking at frequencies of usage of short words, though, joining categories might destroy the idea of the model.

True but maybe worth trying quickly because it may confirm your suspicion that the parameter space is too big for the data.

The question then becomes a slightly different one because the problem becomes more clear.

1 Like

Here’s the chain-by-chain breakdown:

This is, really, not at all surprising. A suspicion of one changepoint, with an alternative that there is no changepoint, shown by the concentration at the last couple of pages.

My confusion really, is what I can ‘do’ with this output. My understanding has always been that if the chains don’t mix then the validity of all chains is in doubt. Looking at this, I’m wary of the interpretation that these could both be possibilities to report.

Appreciated! :)

If the model isn’t fitting as well as you’d like, then it’s investigation time.

Maybe start with what are the differences between the thetas. Like, there’ll presumably be the theta_e/theta_l values for the chains where the changepoint happens earlier, but also the theta_e for the chains where the changepoint is right at the end.

Are they the same? Are they different? Does that give any clues about what happens?

Maybe try more/less categories if that is an easy thing to scale up down just to see what happens.

I’m getting a nice set of \hat{R}'s of 1 across all parameters when I combine words of length 1 and 2 into one category (although that effectively makes this a binomial). The result being that no changepoint is detected.

This idea, though, and the tip to look at the \theta_e and \theta_l values, is enough for me to chew on for a little while. Thanks for the help so far, @bbbales2 and @emiruz, and I’ll be back either when I see something working satisfactorily or when I’ve run out of ideas again!

Here are some more thoughts (IMHO as usual). It may be that your model is identifiable only (1) when there is a change point (2) by accident. I think if you draw the words at random such that there is no change point you may have an identifiability issue because nothing prevents simple label switching between your theta parameters. I.e. maybe it’s not lack of data but just lack of change point and identifiability problems.

I’d suggest trying something that prevents label switching from being likely. For example, just to test you could try giving your theta’s different priors so that they are not equivalent.

I actually did some tests with random word assignments earlier in this process and, in general, my memory is that I was finding a “changepoint” detected at either in the first few pages, or the last page or so. That seemed, to me, to be the expected behaviour when there wasn’t a changepoint. (I believe that the Riba and Genebra article suggests that this would be the case.)

My latest run of the model combines, based on your suggestion, the category for words of length 1 and 2, and also includes words up to length 4. (For some reason, I seem to have mistakenly removed words of length 4 in earlier runs.)

I also gave the Dirichlet prior \alpha's of 0.6, with the intention that the multinomial distribution should tend towards a slight dominance for one category over the others. (This is based on the distribution of word length frequencies expected in natural languages, and seen in the Voynich, which I discuss in this post: https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/)

With these changes, I’m now again getting a nicely mixed set of chains, and a single changepoint detected in a similar place to the earlier, multimodal, plots around Folio 30:

I’m going to carry on playing with this, but for now I’m tentatively happy that the model is meeting some of my expectations in the first instance. (This changepoint falls in the same place as a similar analysis applied to the results of an LDA topic model rather than word-length frequencies, so I have a suspicion that there’s something going on there.)

I’d be more than happy to hear any more suggestions or caveats, but even at this stage I’m incredibly grateful for all the help!

This confuses me a bit. What if the change point genuinely is in the first few or the last few pages? How will you tell between that and a random result?

My suggestion was to give theta_e and theta_l slightly different priors so they can’t take each others place.

I’d suggest testing your models sensitivity to the prior generally. If the prior ends up being instrumental to your result then a different prior might give you a different result …

Definitely something I’ll think about in detail. Within reason, though, I’m not sure how strong the different priors on the \theta's would have to be to solve the problem. (If I understand it correctly, which I may not).

Thinking about it, I actually don’t really see how \theta_e and \theta_l could take each others’ place, given the model. It’s not just fitting two multinomials, it’s fitting one multinomial before a given point in the sequence of folios, and the other from that point. Maybe I’m misunderstanding the question.

(This is probably more clearly put across in the Users’ Guide Poisson example via this line:

lp[s] = lp[s] + poisson_lpmf(D[t] | t < s ? e : l);

… which explicity puts across the before/after element.)

This (perhaps mis-)understanding is why I’m not surprised to see the “non-changepoint” occurring at final page or so. It seems, intuitively, to fit with the interpretation that a single \theta_e is fitting well for the majority of the document and so the \theta_l applies only to the last folio.

I’ll admit to having put a lot of faith in the changepoint model from the Users’ Guide, though, although that doesn’t go into detail on what would happen when there isn’t a significant changepoint in the data. Very happy to hear from more experienced changepoint experts on this!

I ended up using the \alpha=1 when pulling the model to pieces because it was conceptually the simplest. The \alpha<1 was where I wanted the prior to be due to the expected concentration of the multinomial for certain word lengths. In this case it’s not affecting the model output per se in terms of the location of the changepoint, but it does seem to help with getting the model to converge.

Say there’s no change in a text. Then for any given change point you’d expect the distribution before and after to be the same . If that’s the case then they can take each other’s place and in that event the model isn’t identifiable (AFAIK).

Notably also in that event the change point could be anywhere ; it’d make no difference , so you’d have to test more than just for where the change point is; you’d also need to test whether the multinomial distributions are significantly different
to each other (IMHO).

I’m not familiar with the data. Aren’t these word counts?

So counting words of length 1 and 2 is still counts.

Also what’s a word of length 1? It’s not letters is it?

Yes, this is a count of the number of words of a given length on a given page. So words of length one, in English, would obviously be “I” and “a”, for example. (Given an unknown alphabet, language, and set of symbols you could also theoretically have things like “&” and possibly even single digit numbers.)

Specifically, this is only looking at short words of length \leq4. (I had originally, and mistakenly, only used words of length \leq3, which is why combining words of length 1 and 2 resulted in only two categories.)