CI: automatically check links in online docs
Motivation
The online docs contain hundreds of links, internal and external. It is desirable to check these automatically, at least before releasing a new version, and preferably as part of the CI.
A full check that includes external URLs is time consuming. It cannot be done for each MR. At most it can be done a a nightly check.
It is much more important to regularly check internal URLs because they are more easily broken. Not rarely we move or replace pages, or reorganize parts of the page hierarchy, which easily breaks links (unless we write and always use dedicated scripts for moving pages and maintaining links - but that would be more effort than writing and automatically running a link checker).
To conclude, we need fast automatic checking of internal links. This is more easily done for Markdown sources than for HTML pages because the latter contain a lot of navigation links that are auto-generated by Hugo and therefore must not be checked.
Of extant link checkers, I tried checklink (Debian package w3c-linkchecker) and linkinator (installed via npm). The latter even has a Markdown mode. Results were disappointing. Lots of crawling, but an obvious broken link was just not found.
Therefore I conclude that we have to write our own simple, dedicated internal-links checker for Hugo markup.
Desired solution
-
Create Python script
checkhugolinks.py
, to be used as
checkhugolinks.py <dir>
which recursively reads all files in or under the given directory, and checks existence of internal links.
- Integrate this check in the BornAgain tests.
Steps towards solution
-
Create python script that reads argument
<dir>
from command line. If there is no command-line argument, then print "usage" info, and exit. -
Print list of all files in and under directory
<dir>
. - Read each file into a single text string.
-
Use loop
for m in re.finditer(...)
to print all occurences of a certain pattern. Try patternr'\[.*?\]\((.*?)\)
to find all links. - Use further pattern matching to distiguish internal from external links
- For each internally linked page, check existence.