Big data is a very fashionable trend nowadays. There is indeed in most firms a pool of unexploited internal (and external) data that can be leverage to help make strategical decisions. But the steps to success are not that straightforward. There are already available online lessons for best practices on how to deal with new Big Data project. So why not focus this time on the worse things that one can think of to screw your project?
1/ Don’t bother with your goal, this will naturally emerge from data
Without a goal, one will naturally seek the most accessible data. That is the best way to be drown with massive and irrelevant data without any real guidance on what to do with it. The first criteria for seeking new data is really not their accessibility, rather their relevance with respect to business issues.
2/ Expect Big Data to unravel the business answer that you already have in mind via classical means
Once the right question is asked, one should not really expect any given answer. That is precisely the advantage of Big Data to unravel new relations within data, to bring new insights that are otherwise not visible. Furthermore, the business questions themselves should not be engraved in stone. By acquiring more insights into your business activity, you can refine these questions and increase your chance to dig out more precise and more relevant answers.
3/ Start with a « Hydra » datalake and put as much data as you can in it, think later about the business questions you want to answer
Unify physically the data from various sources is really not a prerequisite to highlight the added value of putting them in common. A quick and dirty POC can already test and suggest the priority of informatics investments represented by such a migration. But this, of course, requires to ask the right business questions first (which, in any case you did not do, if you followed the first counter-indication).
4/ Think Big from the start
The price for housing a Big Data infrastructure (licenses, human skills, etc.) can be high. In principle, multiplying data sources to link them is a good practice. But again this does not replace a good old expression of the needs. And even though these needs are precisely determined, to ensure a good return on investment, one should rather enrich these data progressively, starting with breaking down the silos of internal data, not necessarily massive but intrinsically rich and offering the advantage of leveraging existing infrastructures.
5/ Be afraid of the Cloud
Moving out your data onto the Cloud is actually a convenient way to optimize your infrastructure costs, in case your data volume increases to deal with new business needs.
6/ Keep data scientists and business experts apart
One has already heard of the “tunnel effect”, when an expression of the needs is made among business experts alone on one side, and the conception and realization of a solution to answer these needs are made for months by developers on the other side. The risk is to be completely off the initial needs, because of bad interpretations, or various choices made on the way for some reasons. In a Big Data project, the main part of the job for a Data Scientist is to enrich and deploy all resources behind different data sources. Having participated to business discussions from the very start can help him identify important treatments that have to be made, or simply provide him with leads on how to implement them. Also that can be helpful to figure out how to display results meaningful for the business users. One good practice is, for instance, to choose an Agile approach, where the ones who ask business questions and the ones who adjust the algorithmic answers work and progress together hands in hands.
7/ Don’t bother too much with data cleansing and structuration
A significant part of the job, once the goals and the proper data sources are identified, is actually to work around the accumulated data to make it clean and consistent. That can be responsible for headaches but obviously, taking these steps for granted is the best way to face senseless answers. Another part is to structure the data, by category like products, departments or geographical origins, to help select the data you really need to answer specific needs and to ease extracting valuable information from it.
8/ Spend as much time as you need for treating the data, don’t bother too much thinking about how to restitute your conclusions to the requesters
These two phases are equally important. At the end, the requesters only see and make use of the outcome of the second one. Neglecting the restitution may render the time spend on treatment and analysis meaningless.
9/ Think that correlation means causality
One can correlate the divorce rate in Main with the consumption of margarine, or the total revenue by game arcades with computer science doctorates awarded in US. These obviously fortuitous correlations tell us that correlation doesn’t mean causality. One has to keep that in mind, otherwise one could easily draw erroneous results that lead to unfruitful decisions.
10/ Last but not least, just rely on machine learning, not human learning
Big Data is a tool, that works well into proper hands and applied to the right problems. Often, a Big Data solution answers only one single aspect of a problem, leaving a bigger picture terra incognita. That is where a human brain can invest its creativity exploiting the data to bring new solutions to other parts of the problem.
– Written by Selim SEDDIKI, Data Scientist – AEROW Decision