Zettabyte Storage

Friday, January 05, 2007

Waiting around is hard work. Let me explain: we take data backup seriously. A large part of that seriousness comes into play with having the discipline to build and apply good tests and good testing procedures against our core backup code. The tests suites we run against Perseus (our file mirroring agent) are split into four main components: feature tests, unit tests, upgrade tests, and the integration test. Before a release of Perseus gets anywhere near the Zettabits patch network it has to run successfully on all of these tests. Once we are done vetting a release against our testsuite, we push the changes out to our 'testing' network, on which we run our internal dev machines. After we poke and prod it in a production-like environment to our satisfaction, we push it out to our Beta network. The zBox that hosts our giant code repository runs on the Beta patch network, so by the time we get code to this stage, we're staking our own data on it's stability and correctness. Before we push a patch live, we always do a full restore to a fresh zBox from our own backups. Although this process generally produces exemplary code, it can take a frustratingly long time to get changes into the field.

I think the best way to put the test suite in perspective is with a simple line count: the test suite is 3.5 times larger than the core code.

The feature test suite is the first test set we instrumented against Perseus, before we had even a line of code. Each of the feature tests looks for a single specific feature (e.g. unicode filename support for directories) and does a complete backup restore cycle, checking the results at each stage for correctness. As we add features to Perseus, our feature tests give us quick feedback about our progress implementing that feature.

On the other hand, the unit test suite picks at individual bits of code. Generally, this involves overriding much of the rest of the system with dummy modules. These modules then lie their interfaces off to the tested module in the hopes of getting out a wrong result.

The integration test suite is our "big-bang" test. This test is multi-tiered. As the test runs, we add files, remove file, rename files, change and update files, backing up and restoring to verify the content several times over the course of the test. This test attempts to catch every use case that we can imagine and rolls it into a single big cruncher that we can run and get a yes/no answer out of.

The upgrade tests are smaller versions of the integration test. They work similarly to the integration test; however, they change the version of Perseus in-between test phases. This ensures that when we push a new version of Perseus, no matter what version happens to be running on a client's box in the field, it will cleanly transition to the new code. The upgrade test runs for every version of perseus that has been in the field to the current version.

Between these tests, we have a pretty good idea of how well we are doing when working on Perseus. On my desktop and on the pro edition zbox, these will all run in about an hour; on our standard edition zBoxes, this takes more like two or three hours. The longest test is the restore we run against our own massive archives. Even on our business-class cable connection, this takes several days. Of course, if any of the tests fails, we have to start over at the beginning.

Waiting for tests to finish can be trying when we have so much work invested in the code - I want to know if it works now. On the other hand, the assurance of having such a rigorous test suite makes the wait well worth it.

0 Comments:

Post a Comment

<< Home