12 Aug 2010

Data Mining and Crowdsourcing UK Benefit Cheats

The coallition government, and in particular David Cameron, have been lambasted lately over the clampdown on benefit cheats/frauds in the UK. Undoubtedly a serious issue in the UK, which loses approximately £3bn. every year from false benefit claims.

The main point of contention is that the government plans to engage (and has already done so) commercial personal credit rating companies (e.g. Experian) to do the digging on their behalf. There are quotes floating around that Experian could save the government £17mn. from housing benefit cheats alone (a deal brokered by Labour). Obviously great news, but the value of the contract that Experian picked up appears to be undisclosed (queue a FOI request, someone).

Instead of further lining private company pockets with taxpayers money, this is a task that could easily be carried out by the good will of the general public. Ministers have been quoted that the “public sector cannot do this kind of work”, and that private companies have the skills. This may be so, but there are a good deal of people in the UK (myself included) who would gladly put in a bit of volunteer time to trawl through the various datasets (DWP, Housing Agency) and look for patterns and trends that suggest benefit fraud.

As a first step, the government would need to anonymize the personal information (straight forward per-record random hash, assuming integrity is maintained accross data sets). This data can then get pushed out via BitTorrent (minimising costs) to those willing to take a look. Individuals could be placed under contracts to not disclose findings, and the government can collate and assess the patterns from each set of results.

Alternatively, the data could be pushed through a system along the lines of the London Datastore, enabling people to comment, review and assess the data as it comes in. Take a few of the clever chaps at The Guardian who were responsible for publishing the crowdsourcing site for MPs expenses, and we could start getting a feel for who the most likely cheats are, allowing the DWP to fine tune their targetting radar, all for a dramatically lower price than Experian are likely to be charging.

Tim O'Reilly has been an advocate of government-wide data revolutions, so why can’t we take that a step further and do something good with all this data, and not restrict it to pretty plots of historical trends.

11 Aug 2010

Does XML really threaten big data?

Dataspora (or Michael Driscoll) has just put out a piece of on why XML is incomptabile with the idea of big data. Some interesting points raised, and well worth a read to fuel your own rage at XML schemas that make no sense at all (we’ve all seen them).

My main bone of contention with XML, as Michael mentions, is the storage and transaction overheads you have to endure for what could be a trivial volume of data. You could easily triple the number of lines in a file and double the file storage for an XML document, whilst the purity of CSV/TSV files gives you just the data.

It is this last point, however, that I am attracted to the notion of XML. It gives you some structure; and I don’t mean trying to determine the structure from column headings in a flat file. It is for this reason that I don’t see the demise of XML in places like the London Data Store, even though some people seem to be calling for it.

As with most stories, there’s two sides for the use of XML to store sata. Schemas dictates that data be structured, encoded and formatted in a strict manner, but the reality of working in a world of data is that whatever rules you apply to a process, someone will always manage to break them and upset your parsers.

Subjective foot note: Consuming Google Docs services via XML is horrendously long winded; that could do with changing.

10 Aug 2010

Data Scientist - thoughts and reflection

There has been an increased emphasis on the job role “data scientist” of late, and it has been a pleasure reading through the flurry of excitement as self professed data scientists come forth and confess their virtues.

Perhaps most interesting (for me) was this piece by The Practical Quant:

For the moment, data scientists thrive in smaller startups, internet companies, and other organizations where there is less emphasis on defined roles and tasks. But there really is no reason why large and mature organizations can’t join the fun.

(Link to article)

Whilst I agree that this is the most accurate representation of where we are today, I think the role definition for data scientists is going to fork in two in the years to come.

The first strand will be the group who are able to produce beautifully animated representations of data set trends, and are very proficient at pulling data through R or Python and munging into a visualisation environment. This is the fork that will get the most attention – and already does – as it publishes graphics and statistics that are visually impressive.

The second strand – or fork – will be eternally immersed in masses of raw data. People taking their careers down this route are likely to be spending their days churning through data feeds and data sets and getting them into an environment fit for analysis and reporting. As the above quote mentions, the cool side of data science is easier in smaller startups because they have more appetite for the new tools that make this easier. Unfortunately there are few mid-to-large size commercial organisations that are going to buy into the “big data” hardware and software stack anytime soon – there will need to be a sea change to enable some truly talented people to make the most of massive data.

Those data rich organisations – Facebook, Google, Twitter, LinkedIn – have fantastic investment in big data stacks, because that is their core business. Commercial organisations that do not have data at the core of their business are slower to the game, and may not even get there. For the time being, most of us working in the world of data can only drool (and weep) when we read up about such infrastructures as LinkedIn’s.

2 Aug 2010

Welcome to the Big Data Blog

I am introducing the new Big Data Blog with a traditionally non-informative posting to say welcome.

There seem to be a few blogs and sites out there at the moment, and my aim is get this blog to be among the best of these.

For future reference, this blog on Posterous will focus on big data, the data deluge, the technologies and systems that allow us to work day to day with big data. Perhaps more importantly, though, is the focus on what this all means on the ground, and not recanted by a tech journalist (although they write better than I do).

See you shortly!

Big Data's Posterous

Working with and being fascinated with how we manage big data and the data deluge, I couldn't help but start up a blog to write about the challenges, technologies and tools that can help us all when it comes to inordinately large data sets.