16 Deployment
As we saw in the previous chapter, deployment of lesson content requires two things:
- a system provisioned with R, pandoc, and Git that is capable of installing any R packages (including those that require compillation or external C libraries).
- access to a remote Git repository with the ability to create and push branches.
Point 2 is easily taken care of by any remote Git hosting service such as GitHub, GitLab, or GitTea. Point 1 is also fairly straightforward because many hosting services will be able to run Ubuntu or some other flavour of Linux1. The challenge is: how do you do this in a way that is reliable, up-to-date, and fast? When you consider the fact that some lessons will be using R Markdown to render content with an arbitrary set of packages and that those packages are not known until build time, this challenge becomes even more difficult.
Whatever workflow is building the workbench needs to do these things, which are covered in the sections below:
- provision R
- provision pandoc
- install and cache a Workbench installation
- install and cache the required packages for building markdown lessons
16.1 Provisioning R
We use the r-lib/actions/setup-r@v2 action to set up R’s environmental variables. The way we use it, it does not actually install R because R comes installed by default on GitHub’s Ubuntu runners 2, saving about a minute’s worth of installation time.
Another alternative is to use containers from the Rocker project or the base image of the R-universe if you want to work with Docker containers.
16.2 Provisioning pandoc
Again, we use r-lib for this task, with the r-lib/actions/setup-pandoc@v2 action. Note: we use the default installation of pandoc, which is 2.19.2 as of this writing, and is expected to match the version of pandoc used in RStudio.
Alternative ways to provision pandoc can be found in pandoc’s installation guide.
16.3 Caching
In order to ensure that the process moves as rapidly as possible, we need to be able to cache the packages used in our workflow. This is a bit of a complex topic, but GitHub has written up a guide for caching dependencies. Effectively, our strategy for the cache is to ensure we can restore the packages that we have previously downloaded and (in the case of provisioning the workbench) installing only the updates we need. To do this, we need to define the folder to restore from and two keys:
- Folder: path to the R package library used, which for a normal R installation is the
${R_LIBS_USER}
environment variable. For a {renv} cache, it will be in the${RENV_PATHS_ROOT}
environment variable. key
a very specific key to restore a valid cache that can be used immediatelyrestore-keys
a less-specific key to restore an invalid cache that might be able to be updated
16.3.1 Restoring an outdated cache with restore-keys
In the case of R packages, you need to be concerned about two things to restore a cache, valid or not:
- The OS version
- The R version
Remember that R packages in the libary are built for the specific operating system and thus, if the operating system changes, the cache cannot be recovered. Similarly, if the R version has changed, a package may work, but there is no guarantee that it will work and thus the cache cannot be recovered.
Thus, when we define the prefix for the restore-keys
and the key
, we define it like this:
restore-keys: ${{ os-version }}-${{ r-version }}-
The important part of having the restore-keys
is that, when it is hit, the cache will be updated when the workflow completes, thus recording the updates that were found.
cache-version
key
You might see reference to a $CACHE_VERSION
key in some of the workflows. This is an outdated feature that will be removed in the future. There are times when a cache becomes borked and it needs to be reset quickly. Sometime in 2022, GitHub added the ability for maintainers to delete a cache with a button click in the “actions” tab, so this became trivial.
Before then, we had to rely on a trick that had maintainers register a CACHE_VERSION
secret, which was recommended to set it to the date. This was tacked on to the end of the restore key so that if a cache needed to be reset, a maintainer could update the CACHE_VERSION
secret and the cache would be invalidated completely.
16.3.2 Restoring a valid cache with key
A valid cache can be restored and used immediately without needing to reinstall any resources. To determine the validity of a particular cache, it’s important to understand where your updates are coming from.
16.3.2.1 key
For The Workbench
In terms of The Workbench, you want to make sure you pull in any updates from the R-universe and CRAN, because they will have any bugfixes that we did not consider. In practise, the way we do this is by saving a file from the output of remotes::package_deps()
, which will get the recursive dependencies for The Workbench, check their package versions against the versions of the installed package and report which packages have updates available.
However, you might be noticing something that is amiss: if you are checking for outdated packages before any packages are installed, then how does this help with invalidating the cache if you don’t know the package versions you have installed before you restore from the cache? The trick is that whenever you query the packages in this way, you are always comparing against a (near) empty R library, so the result will be the same across runs if and only if none of the packages have updated in the upstream repositories.
For example, this is the output of checking for dependencies of the {cowsay} package:
> deps <- remotes::package_deps("cowsay")
> deps
-----------------------------
Needs update
package installed available is_cran remoteNA 0.0.3 TRUE CRAN
rmsfact NA 1.5-4 TRUE CRAN
fortunes NA 1.5.2 TRUE CRAN
crayon NA 0.8.2 TRUE CRAN cowsay
When the data frame is saved in the workspace and the file is hashed, it will produce the same hash across builds because the installed
the only column that will vary will be the available
column (unless “cowsay” changes dependencies).
It might be possible for us to centralize this caching so that we run remotes::package_deps("sandpaper")
outside of the workflow in https://files.carpentries.org so that we can save a few seconds of time installing remotes and querying dependencies if the restore is successful.
16.3.2.2 key
for the {renv} package cache
The goal of the {renv} package cache is only to reliably restore it for reproducibility (say that five times fast), thus, the only key that we need to validate the cache is the renv/profiles/lesson-requirements/renv.lock
file itself. If the hash for that is identical across runs, then the hash is valid and we can restore the cache as usual.
16.4 Provisioning The Workbench
The Workbench Packages are provisioned with the carpentries/actions/setup-sandpaper@main composite action.
This workflow does the following:
- Sets up environment variables and options to allow us to fetch from the R-universe and RSPM
- Fetches the system dependencies json record from the R-universe and installs them 3.
- Restores the Cache and Installs any missing/outdated dependencies
- Installs any custom versions of sandpaper/varnish
16.5 Provisioning The Package Cache
The {renv} package cache is provisioned with the carpentries/actions/setup-lesson-deps@main composite action. This action relies on the {vise} package to determine and install the system dependencies for the packages in the cache (e.g. it’s the reason why the spatial packages can be installed for the spatial lessons).
This has the following steps and is run only if the renv/
folder exists.
- Sets up environment variables and options to allow us to fetch from the R-universe and RSPM
- Determine and install system dependencies from the lockfile4
- provision the packages with
sandpaper::manage_deps()
16.6 A bit of History
In theory, these can all be taken care of with a Docker container and, indeed, we have written a Dockerfile
to do just this, piggybacking off of the R-universe base image. You might be wondering: why don’t we use a Docker image to build the lessons? Why do we use the runners for GitHub Actions? When we initially built The Workbench, building R packages on Ubuntu always required compilation, so we used macOS runners so that we could get compiled packages most of the time. The key point here is most of the time.
The release cycle of R packages on CRAN will release the source of the package first and build the binaries for macOS and Windows in the few days following. Importantly, these binaries would have the C libraries bundled with the packages that required them, so the installation would just work. During these few days, users will be prompted with a message asking them if they would like to install the binary version or compile the latest source. However, on GitHub runners, the machine always defaulted to the latest version, so sometimes, just after a package updated, we would get issues where a package (e.g. {stringi}) would fail to compile because the proper C library was not installed. This was especially problematic for a situation where we needed to provision an arbitrary set of packages for R-based lessons.
In November 2021, we officially switched our runners over to use Ubuntu with carpentries/actions#31 and carpentries/sandpaper#211. These allowed us to use the binary packages from the Posit Package Manger (previously RStudio Package Manager) to provision our builds and parse the necessary system dependencies.
This code initially lived inside of the github workflow, but the code got complex enough that it was worthwhile to port it to a specific package, which eventually became {vise}. This package was intended as a way to split off the {renv} system from {sandpaper} into its own package (akin to something like {capsule}, but I never had the bandwitdh to properly separate the {renv} components from the {sandpaper} components (though that may be easier now that the {flow} package exists for analysis of code pathways).
Perhaps not RedHat or CentOS, which are notoriously strict about updating their C libraries.↩︎
see https://github.com/carpentries/sandpaper/pull/279 for information about how we found that out.↩︎
See the ssl error of July 2023 for one consequence of this↩︎
At the moment, this is sort of complex because {remotes} package has not been updated on CRAN since 2021 and does not know about ubuntu 22.04, which is the version of runners that GitHub uses. We do this because installing from CRAN is faster.↩︎