My vision of Nixpkgs CI.

Posted on 2021-08-21

Nixpkgs has a major CI problem. We want to be able to quickly merge changes to a huge number of packages and ensure that the system is healthy. The problem isn’t just that Nixpkgs is the largest repository of fresh packages, that certainly doesn’t help, but there are also unique challenges because of how Nix and nixpkgs work.

Problems

Pessimistic Rebuilds

One of the unique features of Nix is that each package explicitly references exact configurations of all of its dependencies. This means that a package indirectly references the exact versions and build configuration of all transient dependencies. This provides a huge benefit, as you can configure common dependencies different for different packages without any conflicts. The downside is that a package must be rebuilt when any of its dependencies change. This means that changes to core packages such as the compiler or shell require just about every package in the collection to be rebuilt. This is extremely expensive and extremely time consuming. Without proper management, updating a core package could stall releases for days even if all the required builds and tests were successful.

(This will be improved with a content-addressed store, however this is not coming soon and still can’t help if there are legitimate differences, for example a new compiler that generates a different binary than the previous one.)

Atomicity

Nixpkgs is distributed as an atomic snapshot of the source tree. This means that as packages are updated in the Nixpkgs repo they are released to users in the same order. A packages that is slow to build, or broken for a week, can’t be released when it is ready. The only option is stalling the entire release pipeline. Incrementally releasing changes in the source so that updates are always small is also impossible as Pessimistic Rebuilds mean that a tiny indivisible change can trigger a mass rebuild.

Breakage Intolerance

This is somewhat related to Atomicity. If a broken package is released, any system that uses that package will be unable to update. This is because a NixOS system is a single atomic package (and many dependencies). If any dependency is broken the new system can’t build. This means that shipping broken releases is very damaging. It doesn’t just prevent updating the broken packages, it prevents updating any packages. (Without users doing some significant hack and slash to splice packages between versions.)

This could be resolved by relaxing atomicity, but there are real benefits of an atomic system that would be nice to preserve if possible.

Goals

Low Latency Changes

I would like it to be possible to make small changes rapidly. Anyone should be able to submit a patch to Nixpkgs and have it released within a day. Not only does this make contributing more satisfying but it is also important for time-sensitive security updates. For now I am going to ignore review time (if you want to help review patches we can always use some hands) so I’ll set the following, mostly arbitrary, goals for the time from pressing the “Accept” button (whatever that exact control ends up looking like) on a “small” patch to the change being available in the nixos-unstable channel. I will define “small” as less than 10h of total build time and less than 2h sequential build time.

Percentile	Latency
50	6h
95	12h
99	48h

These goals are very aggressive. Currently the channel is only scheduled to update once a day and sees about 100 commits in that timeframe.

Little Manual Effort

Right now there are two main paths for a patch. Direct-to-master and the staging workflow. If your change affects few enough packages, you can manually build all dependants to show that nothing broke. If you do that the change can be merged straight to master. If your change affects too many packages packages and mere mortals can’t build all of the changed dependants, then you just build enough packages to reasonably test the change, then merge to the staging branch.

The staging workflow is a necessary but painful process. A snapshot of staging is pulled into staging-next and built. If it builds it is merged to master and the process starts over. If it doesn’t (it often doesn’t) then humans inspect the failures, make patches or rollbacks, and try again. This is a largely manual and tedious process. However it does allow anyone to make massive changes to Nixpkgs even if they don’t have a build farm at their disposal. It also means that by the time the staging changes are merged to master they have already been built so that the channel blockage is minimal. (In general you will need to rebuild the packages changed since staging was last updated from master, but it shouldn’t require a mass rebuild.)

Methods

Merge Queue

Instead of merging into the target branch then checking that the build succeeds (you do check that your change built after merging right?) a Merge Queue style system should be used. I wrote about this in the past but the TL;DR is that once a patch is approved, it is merged into a “Candidate Commit” and CI is run on that Candidate. The Candidate is only pushed to the target branch once CI is successful. If the build fails the merge is aborted and the PR author is notified. This shifts the responsibility for merge conflicts from some build-cop who monitors the target branch to the PR author. I think this leads to a much more scalable and stable system not only because it spreads out the work, but also because the PR author likely has more context and motivation.

Batching

In order to make this feasible changes are batched before being tested, if the batch doesn’t succeed the result is automatically bisected to identify the incompatible change and it is removed from the queue. The rest of the changes are then retested and eventually merged without human intervention.

Merge Queue vs Revert Commits

A common question is the difference between a Merge Queue and the channel-tracking branches such as nixos-unstable as these branches are only updated after CI passes as well. I think that there are no fundamental differences but I think how it works in practise is completely different for the following reasons:

A Merge Queue has auto-rollback. While auto-rollback could be added to master upon breakage it isn’t as clean as bouncing back the change without ever marking it as “merged”. It keeps all of the discussion in one place rather than applying a revert commit and trying to notify everyone interested.
master is marked as the default branch so it is often used as the base of a patch. This can be annoying as it is broken or not built yet. By making sure that the default branch is always green and always built it makes a better contributor experience.
It is easier to set policies for individual changes as the Merge Queue can re-order them to be suitable. For example a different priority for mass rebuilds (to allow more batching) or asking one change to be merged without batching (maybe it is an important security update and should be pushed ASAP).

Current Tools

The most popular tool for implementing this workflow on GitHub is Bors. However this tool has some limitations that make it unusable for Nixpkgs. Hopefully we can contribute the required features to avoid creating a second bot. The problems are discussed below.

Required Features

Optimized bisection

Bisection in Nix is quite easy. Since builds are automatically strongly cached and depend only on the expressions used to build them git bisect run nix build -f . target-package will reliably find the commit that triggered the failure. However all of the Merge Queue tools that I have evaluated use the exact same CI workflow for bisection as they do for the regular run. This leaves a huge performance concern for Nixpkgs as simply evaluating all of the packages takes up to 5 minutes. However since Nix is a lazy language evaluating a specific package typically takes only a fraction of a second. This means that even if the package that failed is the first one that is built (and it may not be) it will take 5min to even check if that build is in the cache, or start building it. By using the knowledge of which package (or packages) failed those could be tested first, this would lead to very quick bisection in most cases. (It could still be slow if you are bisecting between a lot of mass-rebuilds as you still need to rebuild every changed dependency of that packages, but at least you aren’t wasting time when the package hasn’t changed.)

Mass Rebuilds

As nice as it would be to kill staging and rely on the Merge Queue to keep master green this is not feasible. Mass rebuilds take days so blindly scheduling these in the same queue as smaller updates will result in huge delays. Simply building these before queuing them for a merge is also infeasible as there are not enough CI resources to build each mass-rebuild separately. Batching is key for these releases.

In order to handle this case we will continue the staging workflow and it will work very similarly to how it works today. Mass-rebuilds will use a two-stage Merge Queue. First a batch of changes will be merged together on top of the latest master. Then this batch will be built (and bisected to remove bad changes as necessary). Once the build succeeds it will be put into the master queue like any small change. Since everything has already been built this build shouldn’t be too slow and shouldn’t block other updates for too long. If this merge is rejected (most likely due to an incompatibility with new master changes) the whole process will be restarted.

Note that there isn’t a forward-moving staging branch anymore. That branch is constantly updated and rewritten by the Merge Queue system. Ideally the PR author wouldn’t need to do anything differently for small or large changes. The bot would just check the number of changed packages and put it in the small or slow queue automatically.

Marking Packages Broken

While keeping absolutely everything working on the master branch is an attractive proposition, it is unfortunately infeasible for a large project like Nix. Necessary updates to core packages like compilers and core libraries will occasionally break dependant packages. While it would be nice to fix all of these breakages it is often too much work for the upstream maintainer and sometimes these packages are maintainerless or their maintainers are inactive.

In order to keep these foundational packages fresh it must be acceptable to mark these packages and tests as broken (when it makes sense and after a reasonable grace period). This allows CI to proceed without these packages, unblocking the change without requiring the upstream maintainer to adopt every dependant package.

There is a lot of nuance here. Of course we can’t just let the GCC maintainer mark GRUB as broken to get a compiler upgrade through. That would break a default NixOS system! But we shouldn’t be blocking important updates because a package used by 2 users is broken. The exact heuristics need to be nailed down and this is especially difficult as we don’t have detailed statistics like this. However making these judgement calls is critical so that we can balance stability and freshness.

I have an open RFC for a Nixpkgs breaking change policy as I see it a critical policy change before we can start to implement the technical changes mentioned above.