Zettabyte Storage

Friday, February 09, 2007


XMLRPC is an XML based RPC (remote procedure call) mechanism (naturally enough). The purpose of an RPC is to make transparent the rough details of doing networking stuff. You call an RPC function and it looks to the programmer as if they made any other function call. Of course, it isn't another function call (latency, potential for failure, etc), but it shouldn't really matter to the programmer. Thus, any mechanism that calls itself RPC has the explicit design constraint of being totally and utterly transparent to the user.

This problem is that, with default settings, XMLRPC will corrupt most any non-ASCII data that you put through it.

The core of the problem is that XMLRPC's default character encoding is ISO-8859-1 and data chunks are not by default placed in CDATA sections. ISO-8859-1 is a code-paged character encoding. The way this works is that if the high bit (non base ASCII) is set on a byte, then it represents a multi-byte code and the actual character is encoded as a two byte sequence. This two byte sequence refers to a non-ASCII character in the current codepage.* The problem occurs because any character in an XML document that is not in a CDATA section must be in the character set of the XML document itself. Thus, the character set that your xmlrpc toolkit thinks you are using will change the internal representation of the character. Even if you are using the same character set on both sides of the call, you will get corrupted text out if that character set is non ISO-8859-1. Let's see an example to clarify.

Say we want to send the string "₢" across an XMLRPC connection. That symbol (I originally copied it off of some Unicode FAQ or other, so don't blame me if it is a vile invective in your native language), is encoded in UTF-8 as the bytes 0xe2, 0x82, 0xa2. This can be discovered by running 'echo -n "₢" > test.txt && hexdump -C test.txt' with your console in UTF-8 mode (so you can use some other character if you find "₢" to be particularly offensive). What we want XMLRPC to do is to encode the three byte sequence for "₢" as a whole, however, our toolkit doesn't know that our character set is UTF-8, so it handles each byte independently. Ergo, 0xe2 (b1110,0010) does not map to an ISO-8859-1 character (note that the high bit is set), so the implementation encodes it as the numerical equivalent of itself: â. A correct implementation** would encode this character as ₢ - the equivalent character in ISO-8859-1. The real magic is what happens on the other side. Since the XML document is in ISO-8859-1 mode, the decoding process will take each of the three numbers and decode them independently to their correct coded representation in ISO-8859-1. This comes out as 0xc3, 0xa2, 0xc2, 0x82, 0xc2, 0xa2, 0x0a.*** This looks like "â‚¢" when translated back into UTF-8 and printed - not very like "₢" at all really.

By now I'm sure you are asking why this error is XMLRPC's fault: aren't I the one that failed to encode my strings correctly before sending them to the XMLRPC handler? No, this is not my fault: remember what RPC means. As soon as the remote procedure call looses its transparency, it is no longer a procedure call, it is a networking protocol. The purpose of an RPC implementation is to give operational transparency; by default XMLRPC does not give operation transparency because it will mangle binary data that you put into it. What then if it was not a character (like ₢) that we wanted to send, but a bitmap image. Should we localize a bitmap to a specific codepage before we send it? Will the localization function crash when it comes up against crazy new sequences of random binary data? Probably. This is definitely the fault of the XMLRPC protocol and not the programmer or the implementer.

Fortunately, not all is lost. XML is incredibly robust. Any sane implementation of XMLRPC is using an off-the-shelf XML implementation so it will decode UTF-8 internals and CDATA sections without batting an eye. The RPC mechanism doesn't need to know that we're tricking it into working correctly. There are two things we can do at the toolkit level to make XMLRPC mostly work. The first is to tell our toolkit to use CDATA sections to enclose our data. This will ensure that the data is not cross-literated into random other character codings; however, it does have a downside. The internet is not 8-bit character safe most of the time****, so if you are sending this xml document "into the wild" over the internet it may get shredded - the 8-bit characters in the XML document may be mutilated by other internet technologies that we don't have control over. Our other option is to set the character encoding of the document to the character encoding of the data we are packing into it. This will also prevent any weird transliteration. The problem with this method is that it doesn't help us with random binary data, or if we have multiple encodings to send.

For our purposes, setting CDATA works as a perfect fix because we have complete control of the channel. In other situations, your mileage may vary.

* - You will probably immediately notice that this extended range will not nearly cover all characters in a language; thus, there is a code page for practically every language in existence. (Asian symbol based languages are the notable exception here since even the extended range cannot cover even a tenth of the codepoints they need to represent.)

** - PHP code is:

$xml = simplexml_load_string('<?xml version="1.0" encoding="iso-8859-1"?><test></test>');
$xml->addChild( 'raw', "₢" );
print( "XML: " . $xml->asXML() . "\n" );

*** - PHP code is:

$raw = '<?xml version="1.0" encoding="ISO-8859-1"?><test><raw>&#226;&#130;&#162;</raw></test>';
$xml = simplexml_load_string( $raw );
print( "CODE: " . $xml->raw . "\n" );

Edit (23 Feb 2007): My entities were getting escaped correctly by HTML leading to the wrong (err.. correct) displayed characters onscreen. The blogger interface is such that I sometimes forget that it will not always do the hard work of writing the actual web-page for me. I have manually encoded what I wanted to print into another layer of entities so that they will show up correctly.

Edit (13 Oct 2010):
**** - I've heard that pigeons hate 8-bit encodings. Otherwise, this statement has not been true enough in at least 20 years to merit even passing mention. I'm not even sure where I first heard it or why I believed it to be true, since as far as I can tell, it's not.


Post a Comment

Links to this post:

Create a Link

<< Home