Find and fix unresolved template links within an Episode XML body

Links like [link text]({{ page.root }}/destination.html) are not parsed correctly by our commonmark parser and are output as text. Use this to find these missing links and transform them into link or image elements.

Usage

fix_links(body)

find_broken_links(body)

fix_broken_links(fragments)

make_link_patterns(ns = "md:")

get_link_fragment_nodes(node)

fix_broken_link(nodes)

links_within_text_regex()

text_to_links(txt, ns = NULL, type, sourcepos = NULL)

make_link(txt, pattern, type = "rel_link")

find_between_nodes(a, b, include = TRUE)

Arguments

body: an XML document.
ns: a namespace object
node: a node determined to be a text representation of a link destination
txt: text derived from xml2::xml_text()
type: either "image" or "link".
sourcepos: defaults to NULL. If this is not NULL, it's the sourcepos attribute of the text node(s) and will be applied to the new nodes.
pattern: a regular expression that is used for splitting the link from the surrounding text.

Value

fix_links(): the modified body

find_broken_link(): a list where each element represents a fragmented link. Inside each element are two elements:
parent: the parent paragraph node for the link
nodes: the series of four or five nodes that make up the link text

get_link_fragments(): the preceding three or four nodes, which will be the text of the link or the alt text of the image.

text_to_links(): if ns is NULL: a character vector of XML text nodes, otherwise, new XML text nodes.

get_link_fragments(): the preceding three or four nodes, which will be the text of the link or the alt text of the image.

Details

Motivation

Jekyll implements the liquid template language, which can break some syntax expected by commonmark. If this syntax appears in a link context, that link is rendred as text. Carpentries Lessons created before 2023 use Jekyll and have this templating embedded for many links.

In order to convert a pre-workbench lesson to use The Workbench, we need to make sure all the links are accurately represented to avoid invalid syntax and broken links from sneaking into the lesson.

Implementation Details

For example, a valid line with a link that looks like [Home](index.html) and other text will appear in XML as:

...
<link destination="index.html">Home</link>
<text> and other text</text>
...

However, if a link uses liquid templating for a variable such as: [Home]({{ page.root }}/index.html) and other text, it will appear in XML as

...
<text asis="true">[</text>
<text>Home</text>
<text asis="true">]</text>
<text>({{ page.root }}/index.html) and other text</text>
...

Note: the nodes with asis elements are from tinkr protecting square brackets. When we run fix_links(), these nodes are collapsed into a link:

...
<link destination="{{ page.root }}/index.html">Home</link>
<text> and other text</text>
...

And with that we can further transform the link to replace the liquid templating with something that makes sense in sandpaper.

find_broken_links() uses the pattern generated by make_link_patterns() to search for potential links.

fix_broken_links() uses the output of find_broken_links() to replace the node fragments with links.

make_link_patterns() a generator to create an XPath query that will search for liquid markup following a closing bracket.

get_link_fragment_nodes(): Get the source for the link node fragments

fix_broken_link() takes a set of nodes that comprises a single link and recomposes them into a link or image node.

links_within_text_regex(): finding different types of links within markdown text can be challenging because it involves characters used in regex for grouping and character classes. In general, I want to do two things with text that I get back from a document:

split the links out from the text
identify which parts of the resulting vector are links.

This way, I can convert the links to links and the text to text.

text_to_links(): Splits links away from text and returns a nodeset to insert

make_link(): makes a link depending on the link type

Examples

loop <- fs::path(lesson_fragment(), "_episodes", "14-looping-data-sets.md")
e <- Episode$new(loop, fix_links = FALSE)
e$links  # five links
#> {xml_nodeset (5)}
#> [1] <link sourcepos="36:8-36:75" destination="https://docs.python.org/3/libra ...
#> [2] <link sourcepos="42:25-42:77" destination="https://docs.python.org/3/libr ...
#> [3] <link sourcepos="43:9-43:61" destination="https://docs.python.org/3/libra ...
#> [4] <link sourcepos="125:17-125:118" destination="https://pandas.pydata.org/p ...
#> [5] <link sourcepos="148:62-148:129" destination="https://docs.python.org/3/l ...
e$images # four images
#> {xml_nodeset (4)}
#> [1] <html_block sourcepos="174:1-174:86" xml:space="preserve">&lt;img src="ht ...
#> [2] <html_block sourcepos="176:1-176:49" xml:space="preserve">&lt;img src=".. ...
#> [3] <image sourcepos="180:1-180:74" destination="https://carpentries.org/asse ...
#> [4] <image sourcepos="182:1-182:38" destination="../no-workie.svg" title="">\ ...

# fix_links() ---------------------------------------------------------------
e$body <- asNamespace("pegboard")$fix_links(e$body)
e$links  # eight links
#> {xml_nodeset (8)}
#> [1] <link sourcepos="36:8-36:75" destination="https://docs.python.org/3/libra ...
#> [2] <link sourcepos="42:25-42:77" destination="https://docs.python.org/3/libr ...
#> [3] <link sourcepos="43:9-43:61" destination="https://docs.python.org/3/libra ...
#> [4] <link sourcepos="125:17-125:118" destination="https://pandas.pydata.org/p ...
#> [5] <link sourcepos="148:62-148:129" destination="https://docs.python.org/3/l ...
#> [6] <link destination="{{ page.root }}/index.html" sourcepos="178:1-178:92">\ ...
#> [7] <link destination="{{ site.swc_pages }}/shell-novice" sourcepos="178:1-17 ...
#> [8] <link destination="{{ page.root }}{% link index.md %}" sourcepos="186:1-1 ...
e$images # five images
#> {xml_nodeset (5)}
#> [1] <html_block sourcepos="174:1-174:86" xml:space="preserve">&lt;img src="ht ...
#> [2] <html_block sourcepos="176:1-176:49" xml:space="preserve">&lt;img src=".. ...
#> [3] <image sourcepos="180:1-180:74" destination="https://carpentries.org/asse ...
#> [4] <image sourcepos="182:1-182:38" destination="../no-workie.svg" title="">\ ...
#> [5] <image destination="{{ page.root }}/no-workie.svg" sourcepos="184:1-184:7 ...

asNamespace("pegboard")$make_link_patterns()
#> .//md:text[@asis][text()=']']/following-sibling::md:text[(contains(text(), '({{') and contains(text(), '}}'))]

# links_within_text_regex() -------------------------------------------------
helpers <- pegboard:::links_within_text_regex()
helpers
#>                                         to_split 
#> "(?<!(\\]|\\)|\\!))\\[|\\](?!(\\]|\\[|\\())|\\)" 
#>                                       find_links 
#>                           "(?<!\\[)\\](\\[|\\()" 
txt <- "text ![image text](a.png) with [a link](b.org) and text"
res <- strsplit(txt, helpers["to_split"], perl = TRUE)[[1]]
data.frame(res)
#>                        res
#> 1 text ![image text](a.png
#> 2                    with 
#> 3            a link](b.org
#> 4                 and text
grepl(helpers["find_links"], res, perl = TRUE)
#> [1]  TRUE FALSE  TRUE FALSE

# text_to_links() -----------------------------------------------------------
txt <- "Some text [and _a link_]({{ page.root }}/link.to#thing), 
some other text."
pegboard:::text_to_links(txt, type = "link")
#> [1] "<text>Some text </text>"                                                           
#> [2] "<link destination='{{ page.root }}/link.to#thing'><text>and _a link_</text></link>"
#> [3] "<text>, \nsome other text.</text>"                                                 
md <- c(md = "http://commonmark.org/xml/1.0")
class(md) <- "xml_namespace"
pegboard:::text_to_links(txt, md, "link")
#> {xml_nodeset (3)}
#> [1] <text>Some text </text>
#> [2] <link destination="{{ page.root }}/link.to#thing">\n  <text>and _a link_< ...
#> [3] <text>, \nsome other text.</text>