Search engine optimization for the Stan docs

Reposting from thread Where to flag (small) problems in User Guide? - #5 by mcol where the discussion derailed a bit from the original topic. The main question is whether we should have docs/latest or similar as an alias to docs for latest version.

The discussion started with this observation by @Max_Mantei and with @Bob_Carpenter asking for suggestions on behalf of @mitzimorris

This provoked some discussion on whether we should change how we serve the docs to improve indexing.

Disclaimer: My understanding of SEO is superficial from running several hobby/non-profit sites, I may easily be wrong about stuff.

Some points to recap from the original discussion + my commentary:

@mitzimorris noted that

we have “unversioned” links, e.g.: “https://mc-stan.org/docs/stan-users-guide/index.html ”

this redirects to latest by virtue of a script which rewrites the redirects every time we do a release.

I see two potential reasons why this proves insufficient:

  • Since people are redirected, they copy-paste the target (versioned) link when sharing on the web. So all/most of the links found around the web are to a specific version.
  • My understanding is that Google will follow the redirects and index the target of the redirect (counting all “unversioned” links as links to a specific version). This means that even if the Google robot enters the docs via an unversioned link, it will see all further links as versioned. Also, speculatively, even if everybody used the “unversioned” links, those would count as links to older versions until the sources/the links themselves are reindexed which may take time.

As an alternative, we could use an alias, i.e. that today, under docs/latest, there would be the same content as under docs/2_21, except that all links within documentation would also link to docs/latest, while all links under docs/2_21 would keep pointing to docs/2_21. (if all our links within documentation are relative, this could be achieved via a single Alias directive in Apache configuration).

This can improve search results in exchange for a risk of breaking links as the docs are updated… Additionally, people would need to take care to share version-specifc links when this is necessary, while now care is needed to share a link to the latest content.

@mitzimorris further noted that

Google suggests “canonical links” - would this work?
How to Specify a Canonical with rel="canonical" and Other Methods | Google Search Central  |  Documentation  |  Google for Developers

Which I don’t think is a right abstraction - docs/latest is semantically different from docs/2_21 as the latter is considered to be final and unchanging while the former should update and it would not therefore be useful to tell Google to treat them (temporarily) as the same page.

Final note from @mitzimorris:

perhaps it’s a third order detail, but a distinct URL for ‘latest’ would lose having the version in the URL. by contrast, in Boost, clicking on the “take me to latest link” brings me to a versioned URL - e.g. - search “boost regex” shows 1_66 -
“https://www.boost.org/doc/libs/1_66_0/libs/regex/doc/html/index.html”
clicking on the “take me to latest” links takes me to:
“https://www.boost.org/doc/libs/1_71_0/libs/regex/doc/html/index.html”

It is true that we will lose the version from most URL and there is a tradeoff. When I read an old blogpost that links to our docs, do I want to actually see the latest or the then-latest version? (hard to say IMHO). And do we care about latest docs being high in google results?

In the wild, I’ve seen both approaches, for example Elm, has latest as an alias. For example: elm-geometry 4.0.0 Version appears in the URL only after you explicitly click a specific version. Googling for “elm geometry” shows me the latest link on top, which brings me to version 3.1.0, even though it has been released just a few days ago. The Google result actually has wrong title (says its version 3.0.0), so the page was not reindexed since 3.1.0 release.

As Mitzi notes, boost doesn’t do this, they have release as a redirect to 1.71 (as of this writing) and when I search “boost regex” the top link is for the 1.66 version.

Also related, @nhuurre noted that

When I look at Stan Reference Manual page I don’t see any indication that it’s old. The Overview page says version 2.18 but doesn’t tell how recent that is or where to find the other versions. Shouldn’t the old docs link to the latest version?

And @mitzimorris let us know this is being worked on: bookdown header for old version of docs should have links to latest. ¡ Issue #106 ¡ stan-dev/docs ¡ GitHub and this is how it looks like:

Which I think is great for both redirect/alias approaches!

3 Likes

Surprisingly enough to me, googling “stan rep_matrix” today gave me https://mc-stan.org/docs/2_21/functions-reference/matrix-broadcast.html (the current version of the functions reference!) as first link. Did something already change? (I don’t want to distract from the broader discussion, though.)

1 Like

I think that how much “stale” the results get is a function of how frequently the doc page is used and linked to over time. So, e.g. “stan inverse wishart distribution” gives me 2.18 on top while “stan wishart distribution” gives me 2.21 on top, presumably because people link to Wishart more than to Inverse Wishart making the new links to proliferate faster. I would also guess that Google notices when the pages are the same between versions and somehow clusters them, possibly favoring the older links as a side effect.

1 Like

we don’t have control over the apache configuration. we are using GitHub pages to serve the docs which limits what we can do. furthermore, the docs are built using bookdown/gitbook which is just useful enough to warrent not trying to roll our own.

my experience with building static doc sites comes from my previous job where I wrestled bad unversioned docs from Atlassian’s Confluence to a jekyll-based docset. we had control of the server, so more was possible. so I think I have a good understanding of what we’re up against here. however, it would be great to find an experienced website designer who knows jekyll and/or hugo and understands the limitations of github pages websites to help us sort out these issues.

my preference is to follow boost and keep everything versioned, but going the ‘latest’ route is acceptible.

1 Like

Added this to all the docs - old versions now link to latest - try it for yourself - the following
is a link to “https://mc-stan.org/docs/2_18/reference-manual/”

3 Likes

Excellent!

Thinking about this more, the “old version” label makes the difference between latest as redirect or alias very small, so it is IMHO not reasonable to spend any energy to move from the status quo. Thanks @mitzimorris for doing this.

The only suggestion I would have would be to consider making the “old version” label more prominent somehow. But that is also not very high priority.