11 Aug 2010

Does XML really threaten big data?

Dataspora (or Michael Driscoll) has just put out a piece of on why XML is incomptabile with the idea of big data. Some interesting points raised, and well worth a read to fuel your own rage at XML schemas that make no sense at all (we’ve all seen them).

My main bone of contention with XML, as Michael mentions, is the storage and transaction overheads you have to endure for what could be a trivial volume of data. You could easily triple the number of lines in a file and double the file storage for an XML document, whilst the purity of CSV/TSV files gives you just the data.

It is this last point, however, that I am attracted to the notion of XML. It gives you some structure; and I don’t mean trying to determine the structure from column headings in a flat file. It is for this reason that I don’t see the demise of XML in places like the London Data Store, even though some people seem to be calling for it.

As with most stories, there’s two sides for the use of XML to store sata. Schemas dictates that data be structured, encoded and formatted in a strict manner, but the reality of working in a world of data is that whatever rules you apply to a process, someone will always manage to break them and upset your parsers.

Subjective foot note: Consuming Google Docs services via XML is horrendously long winded; that could do with changing.