Big Data for non-geeks
Following on from some of the work I’ve done since the beginning of the year on Trends for 2012, I’ve been looking at the concept of ‘big data’ recently and want to share my thoughts.
Essentially, this is a scene-setter and jumping off point for further questions. It's intended for a management audience, not a technical one, but feedback from any devs and sys guys out there is of course extremely welcome. I am not a data expert, but I am trying to determine how important it is and how it might drive innovation in businesses.
I'm going to talk about the following:
- What is big data, and what does it mean?
- What kinds of uses can the data be put to?
- What challenges and imperatives are organisations facing?.
- What technologies exist to help in this?
What is Big Data?
Essentially, organisations and individuals are now producing more data than conventional database systems are able to effectively deal with. And the more connected we are, the more we use our phones and social media, the more sensors are deployed, the more business processes are automated and monitored, the more data is being generated. It’s increasing exponentially.
IDC, the International Data Corporation, define it thus:
"Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture, discovery, and/or analysis"
That seems like a pretty good definition to me.
Now some technology vendors are saying that big data is no different to existing data warehousing and business information analysis - that it's just a buzz word. The consensus from technology analysts, though, seems to be that we really are now in a new realm - the speed at which information flows into organisations, the number of customer contacts organisations have to deal with, the acceptable interval between receiving information, processing it and producing insight that you can base decisions on, and the amount of data that is available for combination are all an order of magnitude greater than they used to be.
That said, the real change is that analysing data of this magnitude has until now only been available to large corporations because it costs so much. Now though, the combination of low-cost but powerful hardware, cloud computing, new open source software tools and advances in processing and storage methods, has made working with big data cheap enough for all organisations to consider doing it.
And they'll need to because the ones that do will get a massive advantage. In a presentation given at the Strata conference in September 2011, McKinsey & Company showed 10-year category growth rate differences between businesses that smartly use their big data and those that do not. And the big data users outperformed their competitors in all categories, in some cases by miles.
Most organisations are probably already engaged in large scale data without necessarily realising it - most are storing data, or having data about them stored by third parties, but are not analysing it yet.
How much data are we talking about here?
A lot. Really, a mindblowing amount.
According to McKinsey, 15 out of 17 industry sectors in the US have more data stored per large company (that's 1000 employees or more) than the library of congress (which had 235 petabytes last time anyone counted).
To clarify, a petabyte is roughly a thousand terabytes or a million gigabytes. Google is now processing roughly a petabyte of data per hour. For comparison, 5 petabytes is the entire throughput of the US postal service each year. 20 petabytes is about the total amount of hard disk storage that was manufactured in 1995. 235 petabytes, a library of congress-worth, is roughly 200 tennis courts completely covered in 32Gb iPhones...
In 2010 total data storage in the US increased by more than 3.5 exabytes, and in Europe it increased by more than 2. An exabyte is a thousand petabytes. It’s hard to conceive of this amount. It’s reckoned that 5 exabytes would be enough to store (as text) everything ever said by human beings since the dawn of language.
To add to the complexity, an awful lot of this data is 'unstructured', which means video, photos, audio, free text conversations, etc. I.e. not stuff that people enter on forms.
What kinds of uses can this stuff be put to?
This is not an exhaustive list, and I have no doubt that there will be impacts of this technology that no one has yet imagined. But the impact will certainly be felt in two main areas: decision-making and innovation.
In terms of decision-making: it has the potential to give people a better overview - by making it easier to find information and by combining data from multiple sources into better dashboards and visualisations, even in real-time. It should encourage a more experimental and analytical approach to decision-making and help organisations become more agile and curious. You can look for natural patterns that show up in the data, and run controlled experiments to test out hypotheses and analyse feedback much more quickly. And it can support human decision making with automated algorithms - think Siri but tailored for your specific business. The hope from some observers (like MIT Sloan) is that this will allow organisations to move away from management by HiPPO - the Highest Paid Person's Opinion...
In terms of innovation: it will enable entirely new business models and applications, for example by identifying smaller groups of people to customise products and experiences for, even down to the individual. Imagine car insurance based on how an individual actually drives rather than an average for the whole 'segment' - the AA announced recently that they are using GPS data to do precisely that.
An important thing to remember is that a huge amount of value gets captured by consumers, not by service providers. And that combining data from different sources can produce a great deal of additional value.
McKinsey have identified applications of big data analysis across all core internal corporate functions - this means that the big data revolution is going to create change pretty much everywhere.
So, what should Organisations be doing about this?
There is lots of advice out there, but, having reviewed a lot of it, I think these are the most important recommendations:
- Inventory data assets (including external data, public data, data owned by consumers, and data available on the commercial market).
- Identify value opportunities (say by organising innovation workshops to discover and prototype ideas.)
- Gain broad awareness of the regulatory situation around data collection and use.
- Address corporate data policy issues as there are significant privacy and ownership implications. It's important to be aware of what’s legally required *and* what sorts of implicit contracts exist between customers and other stakeholders.
- Develop the ability to deploy the technologies that aggregate and manipulate these kinds of data.
- And finally: start building analytic capability - which is a problem because there's a big shortage of talent in this area:
McKinsey estimate the US shortfall in 'deep analytical talent' - that's people who understand statistics and machine learning - at between 140,000 and 190,000 people by 2018 at current rates of graduation; while the shortfall in managers able to consume big data analyses and experiment with data is estimated at 1.5m people - that's the gap between the number of people in management who will need these capabilities and those being created at current course and speed.
The situation in the UK is presumably similar, although one interesting recent development is that the Open Knowledge Foundation announced recently that they are creating a School of Data in collaboration with Peer2Peer University. This might be worth keeping an eye on.
What technologies are we talking about here?
There's a new technology stack for big data that starts at the bottom with data storage infrastructure at the data centre and network level and rises up via open source database and distribution technologies like Apache's Hadoop distributed file system, Google's MapReduce for data processing, and new "noSQL"-style databases that can handle unstructured content and rapid, distributed retrieval. The stack continues on through data-cleaning and management technologies, then analysis, querying and real-time delivery systems via APIs. And then up into services, visualisations, real-time dashboards and the like.
At the visualisation end there is also the question of whether to use existing visualisation engines, or design and build bespoke solutions closely tailored to the organisation and its users. A combination of both, i.e. commercial analytics packages with bespoke dashboards customised to the context, is probably a good approach.
The ultimate goal, of course, is real insight, delivered to the right person, at the right time, in the right way. There's an argument to be made that telling stories from data, beyond just visualising it, is really at the top of the stack. Kris Hammond, the CTO of Narrative Science (no affiliation) discusses this in a number of posts at his blog "Just to Clarify".
What the technology stack is in any given situation is going to depend on a number of factors - development language, types of data being stored, integration with existing data infrastructure and Business Information systems, availability of project resources, predictions of future scale, etc.
In addition to the stack there are a range of 3rd party services and startups being created to take advantage of this market. Two of the more interesting are Kaggle, which is a data science competition site where organisations can publish data sets and set challenges for analysts to solve; and ClearStory Data which seeks to extract valule from organisations' legacy and web data. (Again, we have no affiliation with either of those companies).
In a Nutshell...
One of the main questions I wanted to investigate was whether or not all the talk around Big Data is warranted - whether it really is a big deal or not. And I think it is, the reason being that it's not just about the 'four Vs' but about new uses of data becoming available to all organisations due to the falling cost of computation and storage.
Currently though, it's still located pretty high up on the Gartner Hype Cycle's peak of inflated expectations. It's going to take time and ingenuity to unlock the potential, and it's not feasible for all organisations to simply become 'data-driven' overnight. However, my recommendation for those organisations who are starting on this journey, and even for those who already have significant business information capabilities but currently ignore much of the unstructured data that is available to them, is to experiment. See what data you've got, what you can get, and get some brains in a room for a few days to see if you can figure out some powerful uses for it. It may not change the way you do business immediately, but change will accumulate over time as you learn through doing...
I haven't referenced every single piece of information in this post, just the most important. Instead, I've created a bookmark stack in Delicious with the most important sources I used. Please ask me if you would like to know more specifically where I got some information from.
Title image: Big Data: water wordscape by Marius B on Flickr.