Roadmap for data-driven applications, from a concept to a fully integrated system

Written by Innaxis on 9 May 2014. Posted in Innaxis.

Doubtless Data Science and Big Data applications have been growing fast in the last years, the blooming of new data sources and the emergence of accessible and affordable cloud infrastructures have contributed widely to this movement. However only a few applications reach the necessary level of maturity to become fully functional systems. Most of data driven applications reach its maximum level of maturity as a “proof of concept,” and one of the main reasons being it does not have a solid integration program with current systems or a successful validation plan.

The path from raw data to wisdom is long and complex. Moving from data to information requires comprehension on data relations and the overall context. Then, knowledge is reached only after fully understanding the patterns within information. Finally, it is required to get deep into the details (the underlying principles), to gain insight and be able to move from knowledge to real and “applicable” wisdom.

There are three pillars in any Data-Driven application, namely: Data Acquisition (DA), Information Processing (IP) and Knowledge Discovery (KD). The Data Acquisition should cover not only the technical aspects of consuming services or data sources with different formats, coverage or scope but also tackle the sociological, legal and limiting aspects of every data source (e.g. data provenance). The Information Processing should be built over a solid mathematical framework, from Data Mining algorithms to Simulation Tools. Information Processing should also provide answers in terms of performance and precision to support the application concept. Lastly, Knowledge Discovery should serve as a interface from processed data to human perception, from metrics to representation of those (e.g. dashboards). In some cases integration of those into the current system may be critical.

The application must be developed at the three fronts simultaneously, starting with a concept, a purely speculative idea, to a proof of concept, a prototype application tested only over a simplified set of data. Then the proof of concept should be further tested in a laboratory environment, in which data samples are produced artificially to feed the application and carry on performance and reliability tests and then leveraged into a relevant environment. In a relevant environment the application should be capable of working with real-time data feeds although not yet at an operational level, and robustness and stability should be assessed at this stage. Finally, applications should be moved for a relevant environment to the actual operations.

Big Data, Data Science