10 Aug 2010

Data Scientist - thoughts and reflection

There has been an increased emphasis on the job role “data scientist” of late, and it has been a pleasure reading through the flurry of excitement as self professed data scientists come forth and confess their virtues.

Perhaps most interesting (for me) was this piece by The Practical Quant:

For the moment, data scientists thrive in smaller startups, internet companies, and other organizations where there is less emphasis on defined roles and tasks. But there really is no reason why large and mature organizations can’t join the fun.

(Link to article)

Whilst I agree that this is the most accurate representation of where we are today, I think the role definition for data scientists is going to fork in two in the years to come.

The first strand will be the group who are able to produce beautifully animated representations of data set trends, and are very proficient at pulling data through R or Python and munging into a visualisation environment. This is the fork that will get the most attention – and already does – as it publishes graphics and statistics that are visually impressive.

The second strand – or fork – will be eternally immersed in masses of raw data. People taking their careers down this route are likely to be spending their days churning through data feeds and data sets and getting them into an environment fit for analysis and reporting. As the above quote mentions, the cool side of data science is easier in smaller startups because they have more appetite for the new tools that make this easier. Unfortunately there are few mid-to-large size commercial organisations that are going to buy into the “big data” hardware and software stack anytime soon – there will need to be a sea change to enable some truly talented people to make the most of massive data.

Those data rich organisations – Facebook, Google, Twitter, LinkedIn – have fantastic investment in big data stacks, because that is their core business. Commercial organisations that do not have data at the core of their business are slower to the game, and may not even get there. For the time being, most of us working in the world of data can only drool (and weep) when we read up about such infrastructures as LinkedIn’s.