Zettabyte Storage

Wednesday, March 07, 2007

I hate to brag, but....


The case that I designed, and thoroughly bragged about already last week, was awarded Protocase's "Case of the Month" for January. I held off congratulating myself about it (at least in this forum) before now because the official page had not gone up. Now that it has been officially announced, I feel no qualms about praising my accolades in a less surreptitious manner. Specifically, go here - read and be amazed. We spent a long time (probably too long) poking and discussing the text. I think we do a good job of generating marketing material, the problem is that it takes much longer and feels far less productive than coding, so it is always the last thing we want to do. Let us know what you think.

Wednesday, February 21, 2007

switch( Protocase ) { case 'awesome': return true; }


The main reason we were so busy last week is that we were doing final prep on a new case design. The new Pro unit is smaller, sleeker, quieter and more power efficient. In fact, it's even pretty awesome looking. That, however, is not why I am writing this post. I am writing this to give props to Protocase. They have excellent customer service. The first prototype unit they sent us had the silkscreening on backwards.
Whoops! (Click To Download Larger Image)
They got a new unit into production just hours after we shot them an email explaining the problem.
Much Better! (Click To Download Larger Image)

The fixed case arrived just two days after the original and we had enough time to finish the case by our expected deadline (although we did have to stretch it all the way to the end).

Fully assembled, we are quite proud of it. Admittedly, it is hard to describe with only a picture. The real beauty of the case is that it is built like a tank - 14 ga solid steel construction, a solid all-metal button, and no flimsy plastic to wear or break. You really have to hold one to get a feel for the level of quality that has gone into its construction. The units are quite small and densely packed with high-tech goodies - just enough room left for good airflow, something we have given a great deal of thought to. Since we have switched to an external power converter, the box is massively overcooled; we like that trait in a case. The unit only puts out 30W at full tilt with the new power system; a vast improvement over our old power supply. The two intake fans providing cooling are low-speed variety Panaflow NMB series - with the power supply moved outside and lower speed fans it is amazingly quite. The fans are covered on the outside with anodized black mesh filters for easy cleaning; we know because we have been using one of the earlier model cases to house ZBS's core router for the past 8 months and it is phenomenally easy to clean and maintain.

And of course, I can hardly fail to mention the shiny black powder-coated, scratch resistant finish with our gorgeous logo emblazoned on the top.

Friday, February 09, 2007

XMLRPC WTF

XMLRPC is an XML based RPC (remote procedure call) mechanism (naturally enough). The purpose of an RPC is to make transparent the rough details of doing networking stuff. You call an RPC function and it looks to the programmer as if they made any other function call. Of course, it isn't another function call (latency, potential for failure, etc), but it shouldn't really matter to the programmer. Thus, any mechanism that calls itself RPC has the explicit design constraint of being totally and utterly transparent to the user.

This problem is that, with default settings, XMLRPC will corrupt most any non-ASCII data that you put through it.

The core of the problem is that XMLRPC's default character encoding is ISO-8859-1 and data chunks are not by default placed in CDATA sections. ISO-8859-1 is a code-paged character encoding. The way this works is that if the high bit (non base ASCII) is set on a byte, then it represents a multi-byte code and the actual character is encoded as a two byte sequence. This two byte sequence refers to a non-ASCII character in the current codepage.* The problem occurs because any character in an XML document that is not in a CDATA section must be in the character set of the XML document itself. Thus, the character set that your xmlrpc toolkit thinks you are using will change the internal representation of the character. Even if you are using the same character set on both sides of the call, you will get corrupted text out if that character set is non ISO-8859-1. Let's see an example to clarify.

Say we want to send the string "₢" across an XMLRPC connection. That symbol (I originally copied it off of some Unicode FAQ or other, so don't blame me if it is a vile invective in your native language), is encoded in UTF-8 as the bytes 0xe2, 0x82, 0xa2. This can be discovered by running 'echo -n "₢" > test.txt && hexdump -C test.txt' with your console in UTF-8 mode (so you can use some other character if you find "₢" to be particularly offensive). What we want XMLRPC to do is to encode the three byte sequence for "₢" as a whole, however, our toolkit doesn't know that our character set is UTF-8, so it handles each byte independently. Ergo, 0xe2 (b1110,0010) does not map to an ISO-8859-1 character (note that the high bit is set), so the implementation encodes it as the numerical equivalent of itself: â. A correct implementation** would encode this character as ₢ - the equivalent character in ISO-8859-1. The real magic is what happens on the other side. Since the XML document is in ISO-8859-1 mode, the decoding process will take each of the three numbers and decode them independently to their correct coded representation in ISO-8859-1. This comes out as 0xc3, 0xa2, 0xc2, 0x82, 0xc2, 0xa2, 0x0a.*** This looks like "â‚¢" when translated back into UTF-8 and printed - not very like "₢" at all really.

By now I'm sure you are asking why this error is XMLRPC's fault: aren't I the one that failed to encode my strings correctly before sending them to the XMLRPC handler? No, this is not my fault: remember what RPC means. As soon as the remote procedure call looses its transparency, it is no longer a procedure call, it is a networking protocol. The purpose of an RPC implementation is to give operational transparency; by default XMLRPC does not give operation transparency because it will mangle binary data that you put into it. What then if it was not a character (like ₢) that we wanted to send, but a bitmap image. Should we localize a bitmap to a specific codepage before we send it? Will the localization function crash when it comes up against crazy new sequences of random binary data? Probably. This is definitely the fault of the XMLRPC protocol and not the programmer or the implementer.

Fortunately, not all is lost. XML is incredibly robust. Any sane implementation of XMLRPC is using an off-the-shelf XML implementation so it will decode UTF-8 internals and CDATA sections without batting an eye. The RPC mechanism doesn't need to know that we're tricking it into working correctly. There are two things we can do at the toolkit level to make XMLRPC mostly work. The first is to tell our toolkit to use CDATA sections to enclose our data. This will ensure that the data is not cross-literated into random other character codings; however, it does have a downside. The internet is not 8-bit character safe most of the time****, so if you are sending this xml document "into the wild" over the internet it may get shredded - the 8-bit characters in the XML document may be mutilated by other internet technologies that we don't have control over. Our other option is to set the character encoding of the document to the character encoding of the data we are packing into it. This will also prevent any weird transliteration. The problem with this method is that it doesn't help us with random binary data, or if we have multiple encodings to send.

For our purposes, setting CDATA works as a perfect fix because we have complete control of the channel. In other situations, your mileage may vary.


* - You will probably immediately notice that this extended range will not nearly cover all characters in a language; thus, there is a code page for practically every language in existence. (Asian symbol based languages are the notable exception here since even the extended range cannot cover even a tenth of the codepoints they need to represent.)

** - PHP code is:

<?php
$xml = simplexml_load_string('<?xml version="1.0" encoding="iso-8859-1"?><test></test>');
$xml->addChild( 'raw', "₢" );
print( "XML: " . $xml->asXML() . "\n" );
?>


*** - PHP code is:

<?php
$raw = '<?xml version="1.0" encoding="ISO-8859-1"?><test><raw>&#226;&#130;&#162;</raw></test>';
$xml = simplexml_load_string( $raw );
print( "CODE: " . $xml->raw . "\n" );
?>


Edit (23 Feb 2007): My entities were getting escaped correctly by HTML leading to the wrong (err.. correct) displayed characters onscreen. The blogger interface is such that I sometimes forget that it will not always do the hard work of writing the actual web-page for me. I have manually encoded what I wanted to print into another layer of entities so that they will show up correctly.

Edit (13 Oct 2010):
**** - I've heard that pigeons hate 8-bit encodings. Otherwise, this statement has not been true enough in at least 20 years to merit even passing mention. I'm not even sure where I first heard it or why I believed it to be true, since as far as I can tell, it's not.

Sorry I haven't put up a blog post in a couple of weeks. Things have been crazy busy here and I haven't had time to even upload some of the posts that I had proto-finished. We have a momentary lull between more heavy work though, so I'm going to take some time this week to finish up and post all the stuff that has been laying around waiting to get the final touches. Expect a great deal more blogging in the interim.

Sunday, January 21, 2007

A Perfunctory Semantic Note about ECMAScript Javascript

There are probably quite a few of you out there that have started reading my series on Making Javascript Useful that are wondering why I insist on referring to the ECMAScript language as Javascript. There are a few reasons.

First, almost nobody knows what ECMAScript is, aside from myself and you, the one reader whose eyes have not yet glazed over. If I were to call it, somewhat more properly, ECMAScript, then essentially nobody would see it, and even if they did, they probably wouldn't know or care what ECMAScript is because they only know how to program in Javascript. Everyone who knows what ECMAScript is, knows that Javascript is ECMAScript; however, the opposite is certainly not true.

The second, and more important, reason is mere pedantry. ECMAScript is a language definition; Javascript is the language that is available "in the wild" on essentially every computer in the world. Although it might be true to say that SpiderMonkey (name your own favorite backend here, I don't care) is an implementation of the ECMAScript Language, SpiderMonkey is not the program that runs your scripts, Firefox (or other appropriate browser) is the program that runs your scripts. This program comes with baggage like DOM, AJAX, Netscape2 compatibility, W3C Event Model, etc, whereas the ECMAScript language doesn't really care one way or the other what interfaces your program provides to the interpreter. Since we only really care (at the moment at least) about making our ECMAScripts run "in the wild", we are really worried about Javascript. ECMAScript will end up coming along for the ride, most certainly, but Javascript is the real target of our adventure.

Tuesday, January 16, 2007

Making Javascript Useful: Part 1, Simple Classes

Before I get into this blog, I should mention that I am not the first person to write what is essentially the first three articles of this series. It's also probably not the best discussion of the topic. When I got started seriously using Javascript, much of this information did not exist (at least on the web where I looked). After writing Part 0 and most of Part 1 (this post), I realized that the internet is both (1) not static and (2) tends towards ever greater knowledge. Sure enough, when I looked again, there was much more information than was available only a year ago. Now that I no longer need the information, it is more plentifully available than the grains of sand on a beach - it prickles similarly too.

From my peremptory scan of Google's top picks, one of the best places to go looking for advanced Javascript is: http://javascript.crockford.com - The article "JavaScript: The World's Most Misunderstood Programming Language" even has a section "Lisp in C's Clothing". Not only am I late to the party, but Mr. Crockford has much catchier wording. All is not lost, however; there is as yet very little (good) information about polymorphism out there. Besides, this is all really just introductory material for the real topic of this series: making javascript and the web a real programming and distribution environment - this goes well beyond OOP and should provide me with interesting things to say well into the future.

Sadly, I couldn't find the article that originally pointed out how to do some very basic OOP things with Javascript. So props go to whoever the mystery hacker happened to be, even if I can't remember their name.

And now on to our main feature:

What it all comes down to is playing up to your talents. When you get down to it, Javascript has a very odd set of talents.

The things we want to do with Javascript are essentially the things we want to do with any other programming language: build tools powerful enough that doing the really hard work can be made entirely someone else's problem. What are the tools that javascript* comes with "out of the box"?

  1. generic Functions and Objects and the things we like doing with them (closures, hashing, et.al.)
  2. a plethora of high and low level syntax (operators, if/then, while, function, et.al.)
  3. free and easy memory management (when and if it works at all)
  4. a powerful string library
There are other things that come to mind (and probably some important ones that don't), but these are the big 4. This is actually a much bigger set of functionality than most environments give us. For instance, C only gives us 2**. Everyone that builds a significantly large piece of code in C ends up reimplementing 1, 3, and 4 out of hand; Apache has it's pools and buckets, gtk has GObject, etc.

The one thing that Javascript doesn't give us (and, ironically, the BigThing for its namesake Java) is OOP. Of course we have the "Object" object, but what we really want is a class template, instances, public and private data, interfaces, super/subclassing, etc.

Without further ado, let's define a class:


function MyClass()
{
    <blah>
}


You are probably wondering why our "class" is defined with the 'function' keyword. Remember that the 'function' keyword is just a shortcut for declaring something like Function( "" ). Since the Function object is just another 'Object', we can tack properties (hash-table keys) onto it willy-nilly; if those properties happen to be other functions, so much the better. Specifically, we are going to use the 'this' keyword from inside the function to tack properties on the function object that represents the class. Thus, like with a real class, our "class" will be a wrapper for function definitions, like so:

Defining public functions:


function MyClass()
{
    this.MyPublicMethod = function()
    {
    };
}


The problem here is the value of 'this'. I've already mentioned that the 'this' refers to the function object that we are creating, but it is not clear why that should be the case. If you just call the MyClass function (it is really a function after all), you would expect 'this' to refer to nothing, since MyClass resides in the top-level containing block which has no 'this', since 'this' always refers to an Object(ish). Enter the 'new' keyword. The 'new' keyword copies the Function object created by 'function MyClass' and calls the new MyClass Function with 'this' set to the new Function object. The 'new' keyword is what creates the class's instance - a Function object.

We instantiate the class like so:


var my_class_instance = new MyClass();
/*
typeof my_class_instance == Function
typeof my_class_instance[ 'MyPublicMethod' ] == Function
*/


Note: in the definition of MyClass there is a semicolon after the definition of MyPublicMethod. Consider this a litmus test: if you understand what we're doing with the Function object, it should be obvious why this is needed. If you don't understand why there is a semicolon here, you should think about it until you do: it is important.***

Defining Public Data:


function MyClass()
{
    this._myPublicData = 0;
}


Since you understand that 'this' is just tagging keys into a Function object, it should be pretty obvious that we can do the same thing with other types of data as well.

Private functions and variables are a little tricker. First, let's review the concept of the "closure". I think I must have accidentally slept through the lecture where this was defined formally, but it's actually really easy. Just go read Wikipedia's article on it if you are not familiar. (http://en.wikipedia.org/wiki/Closure_%28computer_science%29) One of the things that I purposely failed to mention earlier about the 'new' operator is that it also forms a "closure" with the function. Essentially, what this boils down to is that the variables you use in the function are, simultaneously, kept around by the Function object with their current values, and are available to any functions you define within the class 'function'. Incidentally, they are not available to anyone who is not declared inside the function.

Defining Private Stuff:


function MyClass()
{
    var _myPrivateData = 0;
    var _myPrivateFunction = function()
    {
        _myPrivateData++;
    };
}

var my_instance = new MyClass();
/* window.alert( my_instance._myPrivateData ); // javascript dies */
/* window.alert( my_instance._myPrivateFunction() ); // javascript dies */


These examples should give us everything that we need to know to define, instance, and use basic (non polymorphic) objects. In my next article, I will finish up with the basics with an article implementing polymorphism in javascript.

* - I don't care about your browser's DOM tree and its cute little HTML renderer - we are talking about core ecmascript here

** - Keep in mind that I said a "powerful" string library; the cstdlib hardly counts.

*** - You are creating a Function object with the 'function' keyword and assigning it to a property in the class's Function object. The semicolon goes after the property assignment, even if it is a function you are assigning. Actually, you don't really _need_ a semicolon here: a conforming Javascript interpreter will be able to deduce the property ending correctly. Of course, that notion presuposes that there exists some conforming Javascript interpreter. In general, if a semicolon can go there, it should go there, because it will almost always make the interpreter's job easier.

Sunday, January 07, 2007

Making Javascript Useful: Part 0, Taking off the Training Wheels

I am growing fond of Javascript.

Javascript is a fascinating language. In this respect, most serious* languages are fairly ordinary: in the first five minutes of reading a language tutorial, you generally have a grasp of that language's big "thing".** For instance, if you are coding in Ruby, you know that Everything-is-an-Object, so you put on your Noun hat and get to work. In Python, Spacing-is-Blocks-so-you-Better-Make-Small-Modules, so you open your file manager alongside your editor. In PHP, PHP-Interoperates-With-Apache, so you open your ssh terminal and ftp client. In general, this is a "Good Thing." The unifying feature of the language helps the nascent acolyte learn the language by giving her mind something to put a handle on - a solid, defensible concept to inform and direct the learning process.

I don't want to give the impression that a language's big selling point is the language. I am not saying that these languages are that "thing". That said, it is important to note that they almost always start that way.

With most languages, the initial concept grows on the programmer until that framework of thinking enables her to do wonderful things within the language. This almost always happens (to me at least) when I start to get a real "feel" for how the language works and know without having to look at a reference manual how it is going to handle something I haven't seen before. If you have ever known that gloating sensation of supreme competence when something new works the first time, you know what I am talking about. When you get to this stage, your abstractions become clear and concise as a matter of habit and the code almost writes itself.

Although this competence comes faster with every language I learn, it generally comes slowly over the course of several months. With Javascript, the competence came overnight, after a full nine months of learning. I'm not bragging about this: it is actually rather sad that I didn't "get it" sooner. The article that really made it click was this one. This is the Mozilla Foundation's Core Javascript reference on Functions. If you work with javascript at all regularly, you owe it to yourself to read and understand that article.

So, have you all gone and read the article? No? Well go do so, it is quite fascinating. You disagree? I suppose you can disagree, but you might miss an important point.

"That was certainly interesting," I can hear you saying***, "but what is the point?" The point is: this is somewhat different from what you will hear about javascript in any tutorial and almost every book about the language. Most presentations of javascript that I have seen point out that javascript looks pretty much like C and leave it at that. So javascript becomes the Javascript-Is-Very-Much-Like-Java-:-That-Hip-And-Popular-Language-You-Actually-Have-Heard-About-And-We-Kinda-Look-Like-That-Too-If-You-Squint-And-Don't-Do-Anything-Too-Complicated language. So coming into javascript programming, you really have No Idea what to expect, except that it's something like C and Java, which, if you've actually used one of these languages is very obviously, poignantly untrue. Thus, you define your one-deep functions with the "function" keyword and use 'if' statements and 'for' loops as if they were the garden variety C variants of those constructs and hope for the best. Perhaps, you will brush up against something more esoteric in the cataclysmic depths of the ecmascript standard; you may notice the odd property that 'everything is a hash table.' Odd things though you may see, it doesn't really keep us from thinking about javascript as a quirky, clunky C'ish variant.

Well, what changes when you read the article? Probably nothing, if you just skimmed it. If you didn't notice it right off, try replacing the word "Function" in that article with the word "Lambda." Notice how the javascript Function object allows you do Lambda calculusy things with your code. As it turns out, in many ways, javascript has more in common with Lisp than C.

Consider the naive recursive Fibonacci number generator in "vanilla" javascript - it looks almost like the C equivalent:


function fib1( n )
{

if( n < 2 ) {

return n;

}
return fib1( n - 1 ) + fib1( n - 2 );
}


Now we can write this as a Function object:


var fib2_code = "if( n < 2 ) {return n;} return fib2( n - 1 ) + fib2( n - 2 );"
var fib2 = Function( "n", fib2_code );


Now it looks like poorly written C code. Of course, this doesn't really get us anything new or different; however, since this code is a string, we can modify it like a string, similar to the way we can edit Lisp s-expressions:


var myfunc = Function( "n", "if( n < 4 ) { return n - 2; } return n * 2;" );
var fib3_code = fib2_code.replace( /fib2/g, "myfunc" );
var fib3 = Function( "n", fib3_code );


Naturally, this is a rather silly example, but if you use your imagination, you can think of some rather clever (perhaps even devious) constructs that you can build with this technique. At the very least, it expands the "typical" javascript toolbox to include both Lisp'ish and C'ish constructs. With this realization, it should be relatively clear why I think javascript so extraordinary: Javascript combines the mathematical generality and flexibility of Lisp with the high-level ease-of-use of C.

Of course, the fun hardly ends here. I'll write an article soon that deals with some of the ramifications of these internals. As it turns out, this will have a dramatic impact on how we implement, abstract, factor, and secure our client-side code.


* - I love BrainF*ck too, but until I see a web-server written in it, it fails my litmus test for being a "serious" language.

** - I am not insulting your favorite language! I love your favorite language too! However, to an outsider, a language's big selling point _is_ the language. Bear with me.

*** - Yes, that was (almost certainly) ventriloquism.

Friday, January 05, 2007

Waiting around is hard work. Let me explain: we take data backup seriously. A large part of that seriousness comes into play with having the discipline to build and apply good tests and good testing procedures against our core backup code. The tests suites we run against Perseus (our file mirroring agent) are split into four main components: feature tests, unit tests, upgrade tests, and the integration test. Before a release of Perseus gets anywhere near the Zettabits patch network it has to run successfully on all of these tests. Once we are done vetting a release against our testsuite, we push the changes out to our 'testing' network, on which we run our internal dev machines. After we poke and prod it in a production-like environment to our satisfaction, we push it out to our Beta network. The zBox that hosts our giant code repository runs on the Beta patch network, so by the time we get code to this stage, we're staking our own data on it's stability and correctness. Before we push a patch live, we always do a full restore to a fresh zBox from our own backups. Although this process generally produces exemplary code, it can take a frustratingly long time to get changes into the field.

I think the best way to put the test suite in perspective is with a simple line count: the test suite is 3.5 times larger than the core code.

The feature test suite is the first test set we instrumented against Perseus, before we had even a line of code. Each of the feature tests looks for a single specific feature (e.g. unicode filename support for directories) and does a complete backup restore cycle, checking the results at each stage for correctness. As we add features to Perseus, our feature tests give us quick feedback about our progress implementing that feature.

On the other hand, the unit test suite picks at individual bits of code. Generally, this involves overriding much of the rest of the system with dummy modules. These modules then lie their interfaces off to the tested module in the hopes of getting out a wrong result.

The integration test suite is our "big-bang" test. This test is multi-tiered. As the test runs, we add files, remove file, rename files, change and update files, backing up and restoring to verify the content several times over the course of the test. This test attempts to catch every use case that we can imagine and rolls it into a single big cruncher that we can run and get a yes/no answer out of.

The upgrade tests are smaller versions of the integration test. They work similarly to the integration test; however, they change the version of Perseus in-between test phases. This ensures that when we push a new version of Perseus, no matter what version happens to be running on a client's box in the field, it will cleanly transition to the new code. The upgrade test runs for every version of perseus that has been in the field to the current version.

Between these tests, we have a pretty good idea of how well we are doing when working on Perseus. On my desktop and on the pro edition zbox, these will all run in about an hour; on our standard edition zBoxes, this takes more like two or three hours. The longest test is the restore we run against our own massive archives. Even on our business-class cable connection, this takes several days. Of course, if any of the tests fails, we have to start over at the beginning.

Waiting for tests to finish can be trying when we have so much work invested in the code - I want to know if it works now. On the other hand, the assurance of having such a rigorous test suite makes the wait well worth it.