Pentaho has announced today that it is open sourcing all big data capabilities in Pentaho Kettle 4.3 and moving Pentaho Kettle from LGPL to Apache License 2.0.
Open sourcing will accelerate the development of Pentaho's big data capabilities by creating viral downloads and hands-on experimentation with big data developers, analysts and data scientists. As with its other initiatives, Pentaho expects this decision to create advocates within each big data community and the Pentaho Kettle community. The aim is to make PDI/Kettle the de-facto standard for operationalising big data. This could provide an on-ramp for new deployments of the full Pentaho Business Analytics suite around the world.
According to Zachary Zeus of BizCubed, "The practical effects of this announcement are that many more people within organisations will be able use Pentaho Data Integration to 'hard wire' analytics into business processes in an extremely cost effective manner. It's another tool for creating strategic knowledge assets." He added that "BizCubed will be adding these Big Data developments to Australia and New Zealand training events".
Pentaho believes that Kettle for Big Data delivers the following key benefits to developers, analysts and data scientists:
- Delivers 10x boost in productivity for developers
- Visual tools that reduce or eliminate the need to write code such as Java MapReduce, Pig, Hive, or NoSQL database scripts;
- Makes big data platforms usable for a huge breadth of developers
- Whereas previously big data platforms were usable only by developers with deep specific skills such as the ability write Hadoop MapReduce jobs and Pig scripts;
- Enables easy visual orchestration of big data tasks
- Such as Hadoop MapReduce jobs, Pentaho MapReduce jobs, Pig scripts, Hive queries, HBase queries, as well as traditional IT tasks such as data mart/warehouse loads and operational data extract-transform-load jobs;
- Fully leverages the full capabilities of each big data platform
- Native integration with each one, while enabling easy co-existence and migration between big data platforms and traditional relational databases;
- Provides a super-easy on-ramp to Pentaho Business Analytics
- Full data discovery and visualization capabilities including reporting, dashboards, interactive data analysis, data mining and predictive data analysis.
Integration Q & A
Big data capabilities available under open source Pentaho Kettle 4.3 include the ability input, output, manipulate and report on data using the following Hadoop and NoSQL stores:
- Apache Cassandra, Hadoop HDFS, Hadoop MapReduce, Apache Hive, Apache Hbase, MongoDB and Hadapt Adaptive Analytical Platform and HPCC.
In addition, Pentaho Kettle makes available job orchestration steps for:
- Hadoop, Amazon EMR, Pentaho MapReduce, HDFS File Operations, and Pig scripts.
Pentaho Kettle can execute ETL transforms:
- Outside the Hadoop cluster
- Or within the nodes of the cluster taking advantage of Hadoop’s distributed processing and reliability
Pentaho Kettle’s Hadoop capabilities work with all major Hadoop distributions:
- Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition.





