Dipping in the Data Lake

Over the past couple of years the term data lake has increasingly come into prominence. It was first used by James Dixon, the Chief Technology Officer of Pentaho back in 2010, and more recently the concept has been hyped up a lot by a number of technology product vendors to promote their offerings in this space.

It’s well known by now that although most enterprises collect and store a lot of data in warehouses a lot of data is not collected at all, and certainly not managed at all. With so much focus on data being the new oil the data lake concept was proffered as a way of centrally storing all possible kinds of data in native formats so that if it was needed at any time it could be accessed from the lake. A data lake stores all types of data as it is ingested from any source that is hooked up to, and stored without needing to conform to a model or structure. In addition, the data in the lake, since it is centrally available, would be visible to all departments, and therefore not be siloed in the traditional manner. The availability of Hadoop makes this possible to do, with massively parallel processing being done at the data level followed by storage.

There have been debates about whether all of this makes sense and whether the hype has gone too far in focusing on the creation of the lake rather than on how it could be made useful. Some advocate the concept of data reservoirs, where data is not completely ungoverned, but is subject to a level of curation, with the purpose of becoming available to downstream applications. There is obvious merit in doing this as well.

Personally, I prefer to go back to basics in order to make the choice. A data lake is an implementation of the concept of big data storage. As such, it is something that technology has made available, with the message whatever data you want, you’ll find it in there. But technology must have a cause, otherwise it could well be just a white elephant. And that, to me, is why we have to go back to the business and start from there with a question that needs an answer.

From there, we go to data science, which brings together a business person who must creatively figure out what bits of data to bring together for study and a solution, an IT person who does the technical work of extracting and helping cleanse and integrate the needed data, and a statistician who actually does the mathematical analysis on the data that’s made available. In theory, it would be nice if whatever data was needed is already available in a data lake, but in practice, would it really be efficient to maintain a lake? It would probably be better to either source the data that’s actually needed (if it’s not already available), or maybe to maintain limited varieties of data on which metadata is maintained, and is managed in terms of security and ageing.

Even if Hadoop has made the collection and storage of Big Data affordable, it might still be a waste to keep ingesting and collecting every type of data simply because it’s now possible to do so. As new data continuously flows in the older data would need to be archived at regular intervals. Once that’s done, what are the chances that it’ll ever really be looked up again? Slim to none, in all likelihood. So why bother collecting too much of it unless it’s required? With data from a growing number of IoT streams already around the corner, and the amount of multimedia data from social media alone being amazingly huge, the risk of drowning in a data lake set up without a strategic objective beyond data availability becomes rather real.

A more prudent first step might be to create a data reservoir for the kinds of data already needed, and have it work as a source for existing applications. After that, new data sources can be added as and when needed.

Dipping in the Data Lake