Filling the Data Lake with Hadoop and Pentaho

by Zachary Zeus
July 20, 2017
elephant near shore

What is the “Filling the Data Lake” blueprint?

The blueprint for filling the data lake refers to a modern data onboarding process for ingesting big data into Hadoop data lakes that is flexible, scalable, and repeatable.  It streamlines data ingestion from a wide variety of source data and business users, reduces dependence on hard-coded data movement procedures, and it simplifies regular data movement at scale into the data lake.

The “Filling the Data Lake”blueprint provides developers with a roadmap to easily scale data ingestion processes and automate every step of the data pipeline, while simultaneously improving operational efficiency and lowering costs.

“Developers and data analysts need the ability to create one process that can support many different data sources by detecting metadata on the fly and using it to dynamically generate instructions that drive transformation logic in an automated fashion,” says Chuck Yarbrough, Senior Director of Solutions Marketing at Pentaho.

Within the Pentaho platform, this process is referred to as metadata injection.  It helps organizations accelerate productivity and reduce risk in complex data onboarding projects by dynamically scaling out from one template to hundreds of actual transformations.

Why use this blueprint for big data success?

Today’s data onboarding projects involve managing an ever-changing array of data sources, establishing repeatable processes at scale, and maintaining control and governance. Whether an organization is implementing an ongoing process for ingesting hundreds of data sources into Hadoop or enabling business users to upload diverse data without IT assistance, onboarding projects tend to create major obstacles, such as repetitive manual design, time-consuming development, manual error risks, and the monopolization of IT sources.  

Simplify the data ingestion process of disparate file sources into Hadoop

It’s easy enough to hard-code ingestion jobs to feed one or two data sources into Hadoop, but once you have a successful proof of concept, every business unit will want to get their data in – creating headaches if you’re manually hard-coding different transformations for each source.  Pentaho’s unique metadata injection capability allows one transformation to become many,  boosting productivity and reducing development time. The “instructions” derived from field names, types, lengths, and other metadata can dynamically generate the actual transformations, drastically reducing time spent designing transformations.

Reduce complexity and costs, while ensuring accuracy of data ingestion

Pentaho has accumulated crucial knowledge and best practices by working with several customers to facilitate enterprise-grade Hadoop data onboarding projects.  As such, the Filling the Data Lake blueprint is fairly prescriptive in terms of the data types involved and the expected business benefits.  This blueprint:

  • Streamlines data ingestion from thousands of disparate files or database tables into Hadoop
  • Simplifies regular data movement at scale into Hadoop in the AVRO format
  • Reduces dependence on hard-coded data ingestion procedures
  • Minimizes risk of manual errors by decreasing dependence on hard-coded data ingestion procedures

How Metadata Injection works

Here’s a high level example of how the metadata injection process may look within a large financial services organization. This company uses metadata injection to move thousands of data sources intoHadoop using a streamlined, dynamic integration process. 

Portrait of Maxx Silver
Zachary Zeus

Zachary Zeus is the Co-CEO & Founder of BizCubed. He provides the business with more than 20 years' engineering experience and a solid background in providing large financial services with data capability. He maintains a passion for providing engineering solutions to real world problems, lending his considerable experience to enabling people to make better data driven decisions.

More blog posts