Subscribe to our Newsletter

Data Lakes Still Need Governance Life Vests


Guest blog post by Gabriel Lowy

As a central repository and processing engine, data lakes hold great promise for raising return on data assets (RDA).  Bringing analytics directly to different data in its native formats can accelerate time-to-value by providing data scientists and business users with increased flexibility and efficiency. 

But to realize higher RDA, data lakes still need governance life vests.  Without data governance and integration, analytics projects risk drowning in unmanageable data that lacks proper definitions or security provisions. 

Success with a data lake starts with data governance.  The purpose of data governance is to ensure that information accessed by users is consistently valid and accurate to improve performance and reduce risk exposure. 

Data governance is a team sport.  Collaboration among and between data scientists, IT and business teams define the use cases that the architecture and analytics software will support. 

Understanding business and technical requirements to identify the value data provides is the first step to developing a data governance cycle.  Data governance establishes guidelines for consistent data definitions across the enterprise.  It also defines who has access to specific data and the purposes of usage. 

Without data governance, it’s impossible to know whether the information presented is accurate, how and by whom it has been manipulated.  And if so, with what method, and whether it can be audited validated or replicated.  As departments maintain their own data – often in spreadsheets – and increasingly rely on outside data sources, a verifiable audit trail is compromised, exposing the firm to compliance violations. 

Including security teams in data governance is also crucial. By understanding what data will be brought into the data lake and the user access permissions, security teams can better understand potential risks.  They can build stronger protection around critical data assets while becoming more resilient and responsive to incidents.


Gaining Context in a Data Lake


As the data lake becomes the repository for more internal and external data, IT must integrate the data lake with the existing infrastructure.  One of the benefits of a data lake is that it can ingest data without a rigid schema or manipulation.  Integration reduces errors and misunderstandings, resulting in better data management.

Modern data integration technologies automate much of the data quality, cataloguing, indexing and error handling processes that often encumber IT teams.  Metadata management is all the more critical in a data lake.  Companies need to manage diversity of terminology and definitions by maintaining strong metadata while providing users with the flexibility to analyze data using modern tools. 

A semantic database to manage metadata provides context into what’s in the data lake and its interrelationships with other data.  This goes beyond the basic capabilities of the Hadoop Distributed File System by making data query more organized and systematic.

To facilitate integration, companies may also consider partitioning clusters into separate tiers.  These tiers can be based on data type and usage, or based on an aging schedule (i.e. current, recent, archive).  Each tier can be assigned different classifications and governance policies based on the characteristics of the different data sets. 



The scale, cost and flexibility of Hadoop allows organizations to integrate, catalogue, discover and analyze more types of data than ever before at faster speeds.  But a data lake is not a panacea for data management. 

Data governance is the key to success with data lakes.  Understanding the business use cases facilitates sound technical decisions.  It enables companies to integrate historical data with newer big data formats without the need for traditional ETL (extract, transform, load) tools. 

However, vigilance around data quality, context and usage are essential.  It’s a key to eliminating organizational and departmental data silos – one of the primary objectives of a data lake.  It also instills more confidence in users that the data they are working with is trustworthy.  Such confidence results in more reliable and predicable models and decision outcomes. 

A strong data management platform ensures users can access data assets as they need it, share information when required, and have the tools to “see” analytics results without the pre-definitions of restricted data sets inherent in legacy business intelligence platforms. 

The elegance of analytics systems and processes lies not in the gathering, storing and processing of data.  Today’s technology has made this relatively easy and common.  Rather it is in the integration and management to provide the highest quality data in the timeliest fashion at the point of decision – regardless of whether the decision maker is an employee or a customer.  

Data lakes hold the promise of realizing higher RDA by making data more valuable.  The more valuable a company’s data, the higher it’s RDA.  And higher data efficiency strengthens competitiveness, financial performance and company valuation.


Gabriel Lowy Technology Content Writer

Image: SwimUniversity


E-mail me when people leave their comments –

You need to be a member of Data Plumbing to add comments!

Join Data Plumbing

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

Data Science Jobs