Turning Big Data into Knowledge
In this two-part blog post we first look at the emergence of Big Data and the challenges it brings. In the next post we take a look at how these challenges are being addressed and the benefits this will unlock.
The emerging challenge of Big Data
Over the first years of the third millennium, worldwide digital data experienced huge growth, from scarce to super-abundant. Produced either by high-tech, scientific experiments or simply compiled from the now ubiquitous sources of automatic data collection through ordinary, every day transactions, this new reality of “Big data” -or being visually precise: “BIG DATA”- has resulted in the need for large-scale management and storage of data which cannot be handled with conventional tools.
Data management tools and hard drive capacity is not increasing fast enough to keep up with with this explosion in digital data world wide. While in economic production we are increasingly asked to “do more with less”, in contrast, in relation to data we are increasingly asked to “do more with more”.
What is the impact of this new reality and its potential benefits for the world of scientific research? Living in a world where economies, political freedom, social welfare and cultural growth increasingly depend on our technological capabilities, Big Data management, and most importantly, the knowledge that can be obtained from it, has enormous potential to benefit individual organizations.
There will be 2 parts covering this interesting reality: the first one including the current introduction and main big data sources, the second part will explain the Big Data challenges and benefits
Sources of Big Data
There are two common provenances of Big Data: On one hand, scientific experiments and tools, which were the first origin of Big Data specific study, mostly from the physics field, involving either macro or micro spatial scales. In the natural science field there is also, latterly, some biology studies, in particular, the DNA research field, starting to make use of Big Data.
On the other hand, one of the other significant sources is simply “everyday” data, the vast quantity of information that is now collected everyday at a million points of citizen interactions, collected through billions of worldwide embedded sensors.
Prepare for some big numbers:
Physics: Large Hadron Collider (LHC):
The world’s largest and highest-energy particle accelerator and one of the greatest engineering milestones ever achieved, the LHC produces around 25 petabytes of raw data per year capturing information for the over 300 (3×10^14) trillion proton-proton collisions. The information management is not easy even making use of the world largest computer grid (170 computing centres in 36 countries). The extraction of information and knowledge from these particularly huge datasets enabled the recent discovery of the Higgs Boson or “god particle” , a discovery that will probably result in the team behind the discovery being awarded the 2013 Physics Nobel prize.
When the telescope from the Sloan Digital Sky Survey (SDSS) opened in 2000, it collected in one week more data than had been amassed in the entire history of astronomy. The new Large Synoptic Survey Telescope (LSST) commencing in 2020, will store in 5 days the same amount of data that SDSS will have collected over the 13 years since its inception. The storing and processing of these massive data sets from the gigapixel telescopes on the earth’s surface and in space, requires very specific tools that have been beyond the current state of the art. Consequently, astronomy, while trying to extract knowledge to create the most accurate “universe map”, is one of the leading protaganists in the field of Big Data.
This is the kind of data collected by countless automatic recording devices that collect data on what, how, and where we purchase, where we go and more. Its really outstanding how our lives have changed in the last couple of decades. All of these improvements and the inherent multiplication in consumption and goods, the resulting transactions, communications and more are being captured through hundreds of receptors. In addition, user-generated content like digital media files, video, photos and blogs are being generated and stored on an unprecedented scale. Our locations (GPS-GLONASS-Galileo), money transactions (credit card, NFC payments etc), several different forms of communication and even what we think and we do in our free time (via social networks) is being collected by different corporate and government bodies.
One of the most accurate ever studies, published in the journal, ‘Science’ in 2007, revealed that humanity might store in that year around 295 exabytes (1 exabyte = 1,000,000terabytes) of data. The global data of 2009 was calculated to have reached 800 exabytes, meanwhile by the end of 2013 it is forecast to reach more than 3 zettabytes, (3*10^21 bytes, 3000000000000000000000 bytes). Impressive. Many challenges obviously arise with a, roughly, 60% yearly increase in data to be handled and issues abound in relation to how to process and extract useful information from what is 95% raw data.
In our next post we’ll look at how these challenges are being addressed.