Are Data Lakes Just Dumping Grounds?

Stan Christiaens Stan Christiaens
October 15, 2018 Big Data, Cloud & DevOps

Ready to learn Big Data? Browse Big Data Training and Certification Courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.

Big data. Although the term is ubiquitous today, it wasn’t so long ago that “big data” wasn’t part of the everyday lexicon. The growth of data, in all its forms, during the last five years has been dizzying, and has caught many organizations worldwide flat-footed.

In light of this, finding a way to deal with all of this data has become big business. IDC estimates that worldwide revenue for big data and business analytics will grow to more than $260 billion by 2022. Organizations have made significant investments in hardware, software, and services to deal with the onslaught of data.

Data lakes quickly emerged as a technology front-runner in the race to make data more digestible – and to finally get it in one place. Data lakes are flexible, scalable and offer an easy solution to store data. They serve as central repositories for all types of “raw” data, including structured, semi-structured and unstructured. The data structure and requirements aren’t defined until the data is needed. Ideally, a data lake is the go-to location for data scientist and business users alike, fueling all analytics activities across the business.

The reality is that getting insight and value from so much data is challenging. Forrester finds that between 60 – 73% of all enterprise data goes unused for analytics. It’s all too common for the majority of users to only find a small percentage of truly valuable in this wide array of assets. In the rush to aggregate our data somewhere, lakes have become swamps of undefined data from a variety of sources. Data scientists and everyday business users struggle to find and understand data.  Even worse, once they find a source, can they trust it?

It begs the question: Are we fooling ourselves? Can a single, centralized repository for the business really exist?

The answer is yes. It’s not the lake, but rather how we organize and govern it.

The first issue to address when it comes to data lakes is how we organize them. It’s easy to misalign on the purpose and content of the lake. Therefore, it’s imperative to establish a comprehensive set of processes and controls before a single byte of data finds it way into the lake. Key questions about the data should include:

  • What is it?
  • Who owns it?
  • Why should it be put in?
  • Does it really belong in the lake?
  • Is it the right source?

The next issue to address is how the lake is governed. Gartner has long warned that data lakes without the right level of governance will devolve into disconnected data pools. A common misconception is that governance builds walls between business users and data. The opposite is true. Governance creates transparency across organizations. So much data, generated so quickly, makes it difficult to understand the data’s origin, format, and lineage, as well as how it is organized, classified and connected. These unknowns result in poorer quality outcomes. Knowing these features is critical to its use. Data governance provides the structure and management of the lake desperately needs, making data more accessible and meaningful, resulting in greater trust and quality. Without such a framework, it’s impossible to know what’s in the lake, who owns it, or its overall value.

Every data governance effort should include a data catalog to serve as a single source of intelligence for data users to discover and consume data. A data catalog should contain data for all of the categories comprising the lake, and the catalog should identify the most valuable data sets. For example, if the majority of users only use 10% of the data in the lake, the catalog must detect and label those assets as the most valuable. This allows data scientists and business users to spend less time concerned about the quality of the data. Instead, they can focus their energy on analyzing that trusted data to gain new insights and better meet customer needs.

The promise of big data is to enable organizations to analyze their data to gain better insight and make more informed decisions, as never before. To realize this promise, we must look at how we collect that data and the processes we put in place to ensure we give data the appropriate meaning and context. Otherwise, we will only create more data dumping grounds. It’s not too late to implement the right level of governance to ensure your data lake becomes a dynamic tool that enables users to improve decision-making and drive innovation.

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Stan Christiaens

    Tags
    Big Data & Technology
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    Blockchain will Revolutionize Healthcare: The Truth or A Baseless Hype

    Blockchain will Revolutionize Healthcare: The Truth or A Baseless Hype

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.