TFS: Quarantine Builds

Continuous Integration is the idea of continually compiling and testing source code changes made by teams, usually on a central server, to enable the team to build better quality software. In practice, through tools like CruiseControl.NET or TFS Integrator, this usually means running through a build script whenever a checkin to the central source code repository changes. As an example, when I check code into Readify’s Team Foundation Server, TFS Integrator will receive a notification of that checkin and trigger a Team Build that compiles all of my code, tests it, and packages it up into an installer ready to be deployed.

If you’ve ever worked on a project as part of a team, you’ve no doubt experienced one of the biggest annoyances that working on a group software project can offer:

  1. Someone checked in broken code, possibly caused by not doing as much testing as they could have, or by not doing a “get-latest” before checking in.
  2. You’ve just done a “get-latest”
  3. Now you can’t compile, and you have to wait until they fix their code and check it in again before you can continue

Continuous Integration can go a long way towards eliminating this problem. By automatically kicking off the build process on every checkin, the team can keep an eye on the build system through programs like CCTray for CruiseControl.NET, or my Team Build Monitor Gadget for TFS. Before you perform a “get-latest”, you only have to look to see whether the latest build was successful before continuing. It’s a good way to tell whether the code in the source code repository is “safe” to get.

However, as projects and solutions start to grow, builds can start to take a long time to complete. At a project I worked on last year, we had about 20 developers and over 60 projects in one of the solutions. The continuous build process did a compile (in both debug and release mode), ran unit tests (debug and release), ran FXCop and code coverage analysis and did a whole bunch of other things. From memory, the whole process took about 30 minutes to complete, and with so many people on the team the server was just constantly building.

This long build process meant that we couldn’t simply look to see whether the last build was successful, because the server was almost always building. Remember that continuous integration usually happens “after the fact”, that is, after the code has been committed to the central source code repository. So even with continuous integration, it was very common for me to perform a “get latest” only to find that the code wouldn’t compile, or that some of the unit tests had been broken.

Quarantine Builds

This got me thinking - how do you ensure that when someone performs a “get-latest”, they are always getting code that has passed all of the build quality checks? One of the ideas I came up with was that of a quarantine build. The idea of a quarantine build is that as new checkins are made, they are made into a “quarantined” source control repository, then the build process takes place, and then they are “promoted” into the central source code repository. Here’s a diagram:

The key principle is: nothing makes it into the central source control repository unless it has been tested first.

If, in the diagram, Developer #3 checked in some rogue code, the quarantine build would fail, and the code would not be promoted to the central source control repository. Developer #3 would then be notified that their code did not pass quarantine, and would have to fix it before the code is “released from quarantine”.

In most environments a continuous integration server is configured to wait for a short time after a checkin before commencing a build. This is done so that a batch of checkins made at around the same time can be built together, to save on having to kick off build after build every few minutes. A quarantine build, on the other hand, would have to build each checkin separately. That way if Developers #1, #2 and #3 checkin around the same time, but Developer #3’s code fails, #1 and #2 would still be promoted.

This means that checkins would need to be queued and executed in sequence, because if multiple quarantine builds are performed at the same time, the two changes might work when tested individually, but they may be incompatible if they were promoted at the same time - bringing us back to the original problem.

This approach enables two things:

  • Any code that could possibly pollute the central repository is rejected, before it even gets close. You’ll never need to ask “who broke the build?”, because each checkin is evaluated in isolation.
  • The central repository will always be guaranteed to be in a working state. Whenever someone performs a “get latest”, they can be assured that the code will work.

Of course, the quarantine builds would need to be pretty fast in order to make this design workable. I’d suggest splitting the builds:

  • Quarantine build: Compiles code (only in Debug mode) and performs tests (only in debug mode).
  • Integration build: As above, but also in Release mode. Also runs code coverage stats, FXCop checks, bundles the installer, deploys to a testing server, and any other “non-essential” tasks.

The integration build would be performed against the central source code repository, and could batch checkins before proceeding. Quarantine builds would be optimized for speed - the aim would be to keep each quarantine build to under ~5 minutes. I’m sure that by only performing a compile and running unit tests, with the help of some expensive hardware, this would be achievable. If spending money on hardware makes you cringe, imagine how much money is wasted when even just 2 developers have to wait half an hour because someone “broke the build”, remembering that this happens continually.

Sounds great, how do I do it?

I dunno. I haven’t thought that far ahead yet :)

If you’re familiar with Team Foundation Server, this might be accomplished by performing a build based on a shelveset instead of a changeset, and then automatically promoting shelvesets to changesets when they succeed. However, the few people I’ve spoken to about this give me the impression that you can’t perform a build from a shelveset (I’m sure you could with a lot of out-of-the-box code, but it won’t be easy).

You could then give developers permission to create shelvesets, but not changesets, and only grant that permission to the quarantine service. This would protect your central repository, although it would make it difficult to see who was responsible for a change by examining the item history. It would also make it hard to use shelvesets as shelvesets (rather than “quarantined checkins”), unless the service was smart enough to know which shelvesets to quarantine and promote, and which to ignore.

In short, I can’t think of a way to accomplish this with TFS out of the box, though I’m up for suggestions.

I am convinced that introducing quarantined builds would see the end of wondering “who broke the build?”. Of course, no build system is going to ensure that your software actually does what the customer wants it to, but at least it guarantees that the code in the repository passes the tests and measures that you put in place. Whether continuous integration systems of the future will adapt to enforce these is something that I’m keen to see.

10 Responses to “TFS: Quarantine Builds”

  1. JetBrains have hit on this concepts in their Team City product:
    http://www.jetbrains.com/teamcity/features/ide_integrations.html#Pre-tested_delayed_Commit

    They call it delayed commit.

  2. This is a horrible idea Paul. Every dev’s workstation should be the quarantine area.
    You are presenting a technical solution to a social problem, and they will never work in the long run. If you have a dev that is checking in code without running unit tests and continually breaking the build, then you deal with that developer. Spending a bunch of time building out a quarantine server is just putting in more roadblocks for the good devs and their tested code.

  3. This is pretty much similar to how they build windows… each team has its own branch in their own build server and only once its built ok and its integration time do they merge it into the main branch. You just take that idea and substitute team for individual!

  4. In that case high level languages and IDE’s are a “technical problem to a social solution” … those pesky humans, always making mistakes eh?! tschh

  5. Ryan - I think that hits it on the head. Thanks for the link.

    Damian - Unfortunately even on the best, most dogmatic teams, broken builds do happen. This is just a solution to stop those broken builds affecting the rest of the team.

    Jack - That’s right, but it’s not just individual branches though, it’s individual checkins. If a developer had a private branch there’s still nothing to stop them polluting the shared repository when they are integrating their changes. If you do builds for each checkin, and don’t allow parallel builds, you can keep the codebase clean and working at all times.

  6. What a timing. I’m actually working on something like that. We are using AccuRev as a main source control. As a quarantine I’m working on TFS or combination TFS + CC.NET. Generally I would add one more element to that puzzle. In many companies some quality management software is used, such as HP Quality Center etc. I’m thinking about creating pipeline between quarantine server and QC and back.

    Cheers

    J.

  7. Clearcase has the concept of a “recommended” build. Each build label can be tagged with a status. We start with INITIAL which, after compilation (but prior to unit test) can be promoted to BUILT, and finally, after unit testing is successful, is promoted to RECOMMENDED.

    When you rebase (or get latest) you get the RECOMMENDED version.

    @Damian - you still need to be pragmatic - if it’s affecting the team - it’s good to be able to reduce the hurt.

  8. I agree that builds break. I find the best thing to do here is a culture of the person responsible for breaking the build fixing it before anything else. Admittedly this works best on smaller teams / projects where the build happens in a matter of minutes, not hours.

  9. We work against what someone called “branches-for-purpose”. All code changes are made in these branches, and all testing is made against builds from these branches. Changes from “Main” are integrated into these branches as necessary, and all changes are tested togethor in the branches before promoting back to the main.

    Only the dev-leads have the rights to merge back into “Main”. When any changes are made against Main (which shouldn’t be as often as changes to branches), a continuous integration build kicks off. This is also the time when projects-in-flight do their integration merges to get the “latest.”

    This way we know that whatever is in the Continuous Integration drop-box is known-good bits, ready to release.

  10. Hi Paul,
    This is exactly what i was looking for-quarantine builds ! Is there some way we can do the same ? On a TFS specific note, can we use TFS Integrator and some custom MS build scripts on TFS to pick up shelvesets and build them with the rest of the source code ? Please do let me know..
    Thanks,
    Deepthi

Leave a Reply