Find and fix unresolved template links within an Episode XML body
Source:R/fix_links.R
, R/utils.R
fix_links.Rd
Links like [link text]({{ page.root }}/destination.html)
are not parsed
correctly by our commonmark parser and are output as text. Use this to find
these missing links and transform them into link or image elements.
Usage
fix_links(body)
find_broken_links(body)
fix_broken_links(fragments)
make_link_patterns(ns = "md:")
get_link_fragment_nodes(node)
fix_broken_link(nodes)
links_within_text_regex()
text_to_links(txt, ns = NULL, type, sourcepos = NULL)
make_link(txt, pattern, type = "rel_link")
find_between_nodes(a, b, include = TRUE)
Arguments
- body
an XML document.
- ns
a namespace object
- node
a node determined to be a text representation of a link destination
- txt
text derived from
xml2::xml_text()
- type
either "image" or "link".
- sourcepos
defaults to NULL. If this is not NULL, it's the sourcepos attribute of the text node(s) and will be applied to the new nodes.
- pattern
a regular expression that is used for splitting the link from the surrounding text.
Value
fix_links()
: the modified body
find_broken_link()
: a list where each element represents a fragmented link. Inside each element are two elements:parent: the parent paragraph node for the link
nodes: the series of four or five nodes that make up the link text
get_link_fragments()
: the preceding three or four nodes, which will be the text of the link or the alt text of the image.
text_to_links()
: if ns is NULL: a character vector of XML text
nodes, otherwise, new XML text nodes.
get_link_fragments()
: the preceding three or four nodes, which will be the text of the link or the alt text of the image.
Details
Motivation
Jekyll implements the liquid template language, which can break some syntax expected by commonmark. If this syntax appears in a link context, that link is rendred as text. Carpentries Lessons created before 2023 use Jekyll and have this templating embedded for many links.
In order to convert a pre-workbench lesson to use The Workbench, we need to make sure all the links are accurately represented to avoid invalid syntax and broken links from sneaking into the lesson.
Implementation Details
For example, a valid line with a link that looks
like [Home](index.html) and other text
will appear in XML as:
However, if a link uses liquid templating for a variable such as:
[Home]({{ page.root }}/index.html) and other text
, it will appear in XML as
...
<text asis="true">[</text>
<text>Home</text>
<text asis="true">]</text>
<text>({{ page.root }}/index.html) and other text</text>
...
Note: the nodes with asis
elements are from tinkr protecting square
brackets. When we run fix_links()
, these nodes are collapsed into a link:
And with that we can further transform the link to replace the liquid templating with something that makes sense in sandpaper.
find_broken_links()
uses the pattern generated by make_link_patterns()
to search for potential links.
fix_broken_links()
uses the output of find_broken_links()
to replace the
node fragments with links.
make_link_patterns()
a generator to create an XPath query that will search
for liquid markup following a closing bracket.
get_link_fragment_nodes()
: Get the source for the link node fragments
fix_broken_link()
takes a set of nodes that comprises a single link and
recomposes them into a link or image node.
links_within_text_regex()
: finding different types of links within markdown
text can be challenging because it involves characters used in regex for
grouping and character classes. In general, I want to do two things with text
that I get back from a document:
split the links out from the text
identify which parts of the resulting vector are links.
This way, I can convert the links to links and the text to text.
text_to_links()
: Splits links away from text and returns a nodeset to insert
make_link()
: makes a link depending on the link type
Examples
loop <- fs::path(lesson_fragment(), "_episodes", "14-looping-data-sets.md")
e <- Episode$new(loop, fix_links = FALSE)
e$links # five links
#> {xml_nodeset (5)}
#> [1] <link sourcepos="36:8-36:75" destination="https://docs.python.org/3/libra ...
#> [2] <link sourcepos="42:25-42:77" destination="https://docs.python.org/3/libr ...
#> [3] <link sourcepos="43:9-43:61" destination="https://docs.python.org/3/libra ...
#> [4] <link sourcepos="125:17-125:118" destination="https://pandas.pydata.org/p ...
#> [5] <link sourcepos="148:62-148:129" destination="https://docs.python.org/3/l ...
e$images # four images
#> {xml_nodeset (4)}
#> [1] <html_block sourcepos="174:1-174:86" xml:space="preserve"><img src="ht ...
#> [2] <html_block sourcepos="176:1-176:49" xml:space="preserve"><img src=".. ...
#> [3] <image sourcepos="180:1-180:74" destination="https://carpentries.org/asse ...
#> [4] <image sourcepos="182:1-182:38" destination="../no-workie.svg" title="">\ ...
# fix_links() ---------------------------------------------------------------
e$body <- asNamespace("pegboard")$fix_links(e$body)
e$links # eight links
#> {xml_nodeset (8)}
#> [1] <link sourcepos="36:8-36:75" destination="https://docs.python.org/3/libra ...
#> [2] <link sourcepos="42:25-42:77" destination="https://docs.python.org/3/libr ...
#> [3] <link sourcepos="43:9-43:61" destination="https://docs.python.org/3/libra ...
#> [4] <link sourcepos="125:17-125:118" destination="https://pandas.pydata.org/p ...
#> [5] <link sourcepos="148:62-148:129" destination="https://docs.python.org/3/l ...
#> [6] <link destination="{{ page.root }}/index.html" sourcepos="178:1-178:92">\ ...
#> [7] <link destination="{{ site.swc_pages }}/shell-novice" sourcepos="178:1-17 ...
#> [8] <link destination="{{ page.root }}{% link index.md %}" sourcepos="186:1-1 ...
e$images # five images
#> {xml_nodeset (5)}
#> [1] <html_block sourcepos="174:1-174:86" xml:space="preserve"><img src="ht ...
#> [2] <html_block sourcepos="176:1-176:49" xml:space="preserve"><img src=".. ...
#> [3] <image sourcepos="180:1-180:74" destination="https://carpentries.org/asse ...
#> [4] <image sourcepos="182:1-182:38" destination="../no-workie.svg" title="">\ ...
#> [5] <image destination="{{ page.root }}/no-workie.svg" sourcepos="184:1-184:7 ...
asNamespace("pegboard")$make_link_patterns()
#> .//md:text[@asis][text()=']']/following-sibling::md:text[(contains(text(), '({{') and contains(text(), '}}'))]
# links_within_text_regex() -------------------------------------------------
helpers <- pegboard:::links_within_text_regex()
helpers
#> to_split
#> "(?<!(\\]|\\)|\\!))\\[|\\](?!(\\]|\\[|\\())|\\)"
#> find_links
#> "(?<!\\[)\\](\\[|\\()"
txt <- "text ![image text](a.png) with [a link](b.org) and text"
res <- strsplit(txt, helpers["to_split"], perl = TRUE)[[1]]
data.frame(res)
#> res
#> 1 text ![image text](a.png
#> 2 with
#> 3 a link](b.org
#> 4 and text
grepl(helpers["find_links"], res, perl = TRUE)
#> [1] TRUE FALSE TRUE FALSE
# text_to_links() -----------------------------------------------------------
txt <- "Some text [and _a link_]({{ page.root }}/link.to#thing),
some other text."
pegboard:::text_to_links(txt, type = "link")
#> [1] "<text>Some text </text>"
#> [2] "<link destination='{{ page.root }}/link.to#thing'><text>and _a link_</text></link>"
#> [3] "<text>, \nsome other text.</text>"
md <- c(md = "http://commonmark.org/xml/1.0")
class(md) <- "xml_namespace"
pegboard:::text_to_links(txt, md, "link")
#> {xml_nodeset (3)}
#> [1] <text>Some text </text>
#> [2] <link destination="{{ page.root }}/link.to#thing">\n <text>and _a link_< ...
#> [3] <text>, \nsome other text.</text>