December 14, 2021

DevOps for ROS Projects Part 2

by Tyler Weaver
DevOps for ROS Projects Part 2

We at PickNik are in a unique position, having had the experience of developing dozens of ROS projects with various DevOps toolchains. Often when we start with a new client we get asked about what tooling we like. This will specifically be about the tools we’ve chosen to use for our open-source software (OSS) projects and how you might copy our pattern for your projects. If you haven’t read ROS DevOps part I be sure to review before continuing.

Code Repository

If you read our first post about DevOps tools you know that we love GitHub for hosting projects (both open source and closed source). We are not alone in this, it has become the place where “The world builds software.” We have had the opportunity to use many other tools for hosting repositories and nothing really compares to GitHub in user experience, features, defaults, and community adoption. You can read our previous take on GitHub if you want a more detailed comparison to its competitors.

Continuous Integration

Continuous integration (CI) is the primary thing that has changed for us in the last year. Previously we were using Travis and were happy with it up until the end of 2020 when it took a sharp nosedive. You can read about what happened to Travis in many places:

Whether the downfall of Travis was crypto mining or a change in ownership and funding is irrelevant. The fact of the matter was that Travis changed its policies and pricing structure in a way that made it useless for us and many other OSS projects that were happily using Travis.

Travis even became unusable for our closed source projects because of how the pricing structure changed. We would much rather have spent time working on our client projects or open-source contributions to robotics instead of migrating away from Travis but they basically gave us no choice.

Background

Once we knew we had to leave Travis we investigated several options. The first was copying much of the setup that Navigation2 was using and adapting that for MoveIt2 and our other OSS projects. This year Ruffin White gave an excellent talk at ROS World 2021 about how he designed the Nav2 project’s CI. Here you can watch that talk and here is some excellent documentation on how his system works. I recommend you go watch and read his work regardless of how your project does CI as you will learn about how you can improve your use of Docker, caching, pipelining your builds and tests, and designing a CI system that can be easily debugged.

My initial attempt at migrating off Travis was to copy the setup that Navigation2 uses. To explain why that approach didn’t work well for us I have to explain some differences between the Navigation2 and MoveIt projects. Navigation2 is a ROS 2 only project and was not a port of move_base from ROS 1. As a result, it does not require support for building a ROS 1 version. I could have adapted the approach for ROS 1 projects but I found it much more efficient to build our system on top of the industrial_ci project that has a goal of having no real interface change between ROS 1 and ROS 2. This means that the same CI config for MoveIt on ROS 1 works on MoveIt2 and all the smaller repos.

The second major difference in our approach compared to Navigation2 is that we are using GitHub Actions. This is a much smaller difference than that of copying Navigation2’s CI config vs writing one that uses industrial_ci. CircleCI has these clear benefits over GitHub actions:

On the other hand, GitHub has these advantages over CircleCI:

First, let’s talk about what we are losing by not using CircleCI and why it didn’t matter as much to us relative to what we were getting by using GitHub Actions. Debugging via SSH is a killer feature of CircleCI and combined with their caching system makes it really easy to spin up a runner exactly where your job failed, ssh into it, and debug the system. The reason this doesn’t matter as much to us is one of the killer features of industrial_ci replaces that workflow almost entirely for us. With industrial_ci you can run it locally and reproduce nearly the exact workflow that happens on the CI runner. I wrote a short guide for how to use that feature here on the moveit.ros.org website. What this does not replace is debugging the action config itself. So far we have mostly gotten by debugging the action config itself by making changes and observing them running in actions. There however is this project for testing GitHub actions locally themselves.

The second thing that CirceCI does better currently is pipelining and caching. Specifically, there is no 5gb per repo cap on the cache and the system for invalidating caches and pipelining is fancier. The only place where this has been a major drawback for us is with our code coverage testing. Currently, that is the slowest part of our CI runs on the MoveIt project because the caches are too big to store and restore with the 5gb limit. We hope to find techniques to overcome this in the future.

Now onto the upsides. The better integration into GitHub manifests itself in a couple of ways. The first is that contributors do not have to log into a separate website and give that website permissions on their GitHub account to view the results of the CI runs. Navigation2 has partially overcome this with Bots that post the results of the CI runs onto GitHub. However, there are still many limitations for users who do not want to give CircleCI permissions on their accounts. As developers, we already use so many different tools so it is always nice when we can use fewer of them.

Secondly, GitHub has a much better system for building custom actions that other projects can re-use. This is how we use industrial_ci and a handful of other actions to simplify our configs. As an example here is all we need to do to invoke industrial CI within our workflow:

   - uses: 'ros-industrial/industrial_ci@master'
    env: ${{matrix.env}}

This checks out the industrial CI action from their repo and passes in the variables we’ve set for the job. Long term we hope to standardize the config we use across packages more so it can be made even more simple in each repo. Keeping your configuration per-repo as simple and standard as possible is nice because it means that once you have a setup that works well it can be shared between many repos. This is the magic of modular software for your CI config.

Third, there are no limits on the amount of CI actions you can run on your public open source repos. CircleCI gives a fixed amount of CI per Github org. Navigation2 itself is using about one-third of the free CI from CircleCI and we did not want to impact their ability to use CircleCI or be forced to migrate to different GitHub organizations by using it ourselves too. The reason limits like this need to exist are to make sure people are using the CI for what it is intended for, not mining crypto. For now, at least, Microsoft seems to be absorbing the cost of unlimited OSS CI. We hope for our sake that their techniques for identifying people that are not following their terms of service and removing them continue to work well enough.

Lastly, there is the GitHub container registry. In the past DockerHub used to have free regular builds for OSS projects that worked well for us. As DockerHub has explored various ways to monetize this system they have limited its usefulness to us and others in several ways. As a result, both the Navigation2 project and Moveit2 have migrated to building our docker images on GitHub and storing them in the GitHub container registry.

Implementation

Currently, we have five different GitHub workflows that we use on our projects. Here you can find them in the MoveIt2 repo. They are:

  • ci.yaml - The primary workflow that builds and runs tests
  • docker.yaml - Builds docker images
  • format.yaml - Runs pre-commit to check format linters
  • prerelease.yaml - For running ROS pre-release tests

ci.yaml

This is the primary workhorse of CI. ci.yaml may look complex at first but I will break down what each part of it does to make it clear how you could reuse it for your project. The block under on defines when it will run. For this workflow we want to be able to run manually, on pull requests, and when there is a push into the main branch of MoveIt2 after a PR lands.

The second section defines the jobs. At the top of this, there is a matrix config that defines a few different environment variables that split into 4 separate parallel jobs. Specifically, we test on both the main and testing versions of ROS galactic and rolling. We also run code coverage reporting, an IK fast test, and clang-tidy on different jobs.

After that, we have the env section that defines the environment variables used by each run. These are used in the various steps to configure the behavior of both caching and industrial_ci. For a detailed explanation of many of these take a look at extensive documentation for industrial_ci. In the middle are a couple of specifically tricky ones I’ll single out to make it clear what they are doing.

   CCACHE_DIR: ${{ github.workspace }}/.ccache

Within the GitHub action runner, there are only a few directories which you should write into. The primary one is defined by this workspace variable. We create and put our ccache directory here. ccache is a tool for caching the output of the compiler and is one of the key pieces to how we speed up our builds.

   BASEDIR: ${{ github.workspace }}/.work

The BASEDIR is a variable within industrial_ci that defines where it will create the three different ROS workspace directories during its run. Defining this is critical to our caching techniques that you will see later.

   CLANG_TIDY_BASE_REF: ${{ github.base_ref || github.ref }}
   CC: ${{ matrix.env.CLANG_TIDY && 'clang' }}
   CXX: ${{ matrix.env.CLANG_TIDY && 'clang++' }}

This section uses a fancy feature that we helped add to industrial_ci to speed up clang-tidy testing. clang-tidy takes a ton of time to execute and therefore we want to only run clang-tidy on ROS projects in our repo with code changes in the PR that is being tested. With this feature, the typical clang-tidy job in a PR takes about 20 minutes to execute as opposed to the hour it takes to execute the clang-tidy job on main after a push. When a PR is merged we run clang-tidy on all the packages within MoveIt2. This feature offers a nice speedup for repos with many different ROS packages.

The section under steps defines what happens in each job. The first block titled “Free up disk space” deletes a bunch of stuff on the GitHub runner we don’t use to free up disk space for our code-coverage runs which take a ton of disk space for debug symbols. You can copy this into your project if you find yourself getting errors that you ran out of disk space.

After the standard action to check out the repo we have the workspace caching steps. Here we restore a cache of the upstream workspace if there has been no change to the repos files or the CI workflow file. After the upstream workspace, we have a similar one for the target workspace. This can help give you some of the benefits of incremental builds. This is one area where our approach is much less nuanced than the approach in the Navigation2 project.

Next, you have the Ccache caching step. You will notice we are not using the standard cache action from GitHub and are instead using one from someone else. This is because the standard GitHub cache action will not store a cache on a failed CI run. In the case of the ccache cache, we want it to be uploaded every time to speed up subsequent builds.

Next, you have the “generate ikfast packages” step. This is specific to MoveIt but gives you a simple example of how a step can be as basic as running a script within your repo.

After all that setup there is the big one, the industrial_ci step. This is where all the real magic happens in building and testing our code. Everything leading up to this has just been to configure this step or to cache various things to speed up this step.

After that, we have steps for uploading the results of the tests. This is nice because in any action where there was a failure the test results files are uploaded so they can easily be downloaded and looked at separately from the one giant log that is produced by the run. Secondly, there is the action for uploading our code coverage reporting to codecov.io. This is what generates the nice code coverage reports in each PR.

Lastly, there is a step to delete some of the contents of the target workspace directory before it is cached. You’ll note that we can’t do this for code coverage runs yet because they would be much too big for the 5gb per-repo cache limit. The primary reason for this is debugging symbols take up a ton of space.

docker.yaml

This one is the action that builds all of our docker images, including those used by the ci.yaml workflow. These are then uploaded into both DockerHub and GitHub container registry. This action runs each time a PR is merged into main and on a regular schedule each week to make sure our images stay up to date with upstream changes.

format.yaml

This one is another great example of using an action created by someone else. pre-commit itself is a magical tool that makes running formatters like clang-tidy and black on your projects stupid easy. We have a guide on how to use it on moveit.ros.org. This action just tests that the developer did this and that their PR passes all the formatting linters. Here is the config we use.

pre-release.yaml

This is the last one, and it is only set up to be run manually with a workflow dispatch. The awesome project industrial_ci has built-in features for running the ROS pre-release tests for every version of ROS and having this action greatly speeds up preparing a release as we can run it on actions, it doesn’t tie up any one of our machines while it runs, and when it does fail the output logs are in a public place so we can easily collaborate over fixing the pre-release test.

Automating Backporting

Additionally, we have been automating some of the work of backporting PRs from the main branch into release branches with Mergify. Our configuration for that integration can be found here.

Future

Our approach to CI is constantly evolving as we have time to improve it. In the long term we hope to find ways to improve caching and pipelining to reduce our build times to be closer to what the Navigation2 project has achieved. We will also continue to contribute features to industrial_ci so it can be used by all of our projects seamlessly.

Client and Internal Projects

One huge advantage to this approach for us is that much of what we have done for our OSS projects have been able to seamlessly benefit our client and internal projects as well. Even those that are not on GitHub itself benefit as anything we do in industrial_ci applies wherever the CI system works. We encourage you to learn from our setup and let us know if you find ways to improve upon it.

Conclusion

Our approach to DevOps is constantly evolving. Due to the nature of our business, we have used many of the tools available on the market with various customers. Every tool and approach has tradeoffs. Our current approach reflects what we believe provides us and our customers the most value at the lowest cost. If you find anything we are doing to be confusing, see an area we could improve, or need help with your setup please reach out to us. We are always looking for feedback and are happy to help.