Apache Pig Progression with Hadoop’s Changing Versions

Hadoop has scaled up in many ways to open up wings of all levels of technical people. Of course, Java programmers have the edge over others when it comes to Hadoop development. However, if you are new or unknown to high-level languages like Java or Jython, no worries! Apache Pig is there to do all kinds of data manipulations whether structured or unstructured data. Certainly, it makes “Pig Hadoop” the interrelated terms in the Hadoop family. The purpose of Apache Pig is to create the MapReduce jobs on large data-sets instead of executing them by writing complex Java codes.

However, Hadoop has evolved in years. Not to mention, this has happened due to increased user demands in the field of data analysis with the massive amount of data. Consequently, every Hadoop components have marketed with some new features on new releases, and so is the Apache Pig. We will have a closer look at those change areas of Apache Pig major releases in this article.

Apache Pig in Few Words

Apache Pig is a high-level scripting language which makes a Hadoop developer’s life easy in making complex data transformations. It is a SQL like procedural language, more widely known as Pig Latin. It is very compatible with users who know other scripting languages. However, to handle real business problems, the compatibility of Pig Big data with its User Defined Functions (UDF) feature works wonderfully. It efficiently invokes code in other languages like JRuby, Java, etc. Also, developers can embed Pig scripts in other languages.

Why is Apache Pig Useful When Hadoop has its MapReduce?

Both MapReduce and Pig do data processing. However, the first one deals with a low level of abstraction during data set processing. On the other hand, the Pig processes large data sets with the higher level of abstraction. Moreover, you will get a series of MapReduce jobs out of Pig transformations. Along with it, framework wise there are some more differences between MapReduce Processing and Pig processing.

With Pig Latin, you can perform almost all the standard data-processing operations, such as group by, join, filter, union, order by, etc. However, you can perform only operations like group by using MapReduce. The other operations like order by, filter, projection and join are not provided in MapReduce. As a result, the user needs to write a custom program for it.

Apache Pig Hadoop Versions Over the Years

Since the time of incubation till today, Apache Pig has evolved with twenty-four releases with different versions of Hadoop.

Apache Pig Evolution in Hadoop 1.0 Series

The first release of Apache Pig came with Hadoop 0.18, and it was in its incubation. However, it was not a stable release from Hadoop perspective. The next releases of Apache Pig which was a maintenance release played as the first version as Hadoop subproject. We see following necessary changes in Pig functionality and performance in the subsequent few releases from Pig 0.1.1 to 0.10.

Features included

Five times performance gain
The multi-query optimization (It allows sharing computation across multiple queries within a single Pig script)
Introduction of two new joins – Skewed join and merge join
Performance and memory usage improvements
Adding the Accumulator interface for UDFs
Including of new LoadFunc or StoreFunc interface
Including custom partitioner
Including Python UDF
Including control structures, query parser changing and performing semantic cleanup
Adding the Accumulator interface for UDFs

The most significant release of Apache Pig with Hadoop 1.0 was version 0.10.0.

Hadoop certifications — Best Hadoop Certifications

Features included

Boolean datatype
JRuby
Nested cross/for each
Limit by the expression
UDF
The split default destination
Map-side aggregation
Tuple/bag/map syntax support
Source code only distribution
Better support for Apache Hadoop 2 with different Maven artifacts
Better support for Oracle JDK 7

Apache Pig Evolution in Hadoop 2.0 Onwards

Hadoop 2x is significantly different from Hadoop 1x in many ways. It is

More scalable with YARN
Able to run non-MapReduce jobs
High availability of name nodes
Native Windows support
More utilization
Beyond batch approach

Hence, it demands more enhanced performance from the utility tools like Pig.

The first major release of Apache Pig in Hadoop2x series is 0.12.0.

Features included

ASSERT operator – For data validation
Streaming UDF – For UDF without JVM
New AvroStorage – Works as Pig built-in functions and faster
IN/CASE operator
BigInteger and BigDecimal data type – Some applications need calculations with a high degree of precision. In such cases, BigInteger and BigDecimal are useful for precise calculations.

We see the scope of non-MapReduce engines in Hadoop 2x onwards. Hence, Apache Pig 0.13.0 also brought the necessary changes to run on Hadoop’s non-MapReduce engines. Along with it included –

The auto-local mode to work with small input data size to run in-process
Fetch optimization
Fixed counters for local-mode

As Hadoop introduced high-performance Apache Tez, data processing scaled up from terabytes to petabytes. The main feature of Apache Pig 0.14.0 is Pig on Tez. However, Pig on Tez stabilization came only in next release. Additionally, it came with improved Tez auto-parallelism. Along with, it introduced ORC File.

The latest release of Apache Pig which is 0.17.0 introduced it on Spark which is already a high performer in Hadoop operation.

Hadoop is progressing, and Hadoop 3.0 is already in the market with few enhancements. Hence, we could expect upcoming feature introduction in next release.

[divider /]

Bottom Line

Working in Hadoop environment means working in Hadoop ecosystem and the tools supported by the ecosystem. Similarly, once you work with Pig and Hadoop integrated form, you will get a better picture of it. Hence, if your passion is to become a big data Hadoop architect or developer, you must be familiar with the entire ecosystem.

However, passion does not fulfill itself unless you set some goal for it. Moreover, Hadoop is a vast area to cover up, and you must be correctly oriented. Following the path of a renowned certification in this field probably the best and useful roadmap to reach the goal! Not to mention Cloudera, is the most sought-after platform for Hadoop and their CCA Administrator (CCA-131) certification covers entire Hadoop ecosystem with tools like Apache Pig, Hive, and Impala, etc.

Whizlabs gives you an opportunity to get a broad knowledge of the subject matter through their self-study guide – Cloudera Certified Associate Administrator (CCA-131) Certification.

It is a complete coverage of the certification preparation that includes hands-on as well. Hence, leverage the power of knowledge with us and become a successful Hadoop professional of tomorrow!

About the Author
More from Author

About Aditi Malhotra

Aditi Malhotra is the Content Marketing Manager at Whizlabs. Having a Master in Journalism and Mass Communication, she helps businesses stop playing around with Content Marketing and start seeing tangible ROI. A writer by day and a reader by night, she is a fine blend of both reality and fantasy. Apart from her professional commitments, she is also endearing to publish a book authored by her very soon.

Top 45 Fresher Java Interview Questions - March 9, 2023
25 Free Practice Questions – GCP Certified Professional Cloud Architect - December 3, 2021
30 Free Questions – Google Cloud Certified Digital Leader Certification Exam - November 24, 2021
4 Types of Google Cloud Support Options for You - November 23, 2021
APACHE STORM (2.2.0) – A Complete Guide - November 22, 2021
Data Mining Vs Big Data – Find out the Best Differences - November 18, 2021
Understanding MapReduce in Hadoop – Know how to get started - November 15, 2021
What is Data Visualization? - October 22, 2021

Apache Pig in Few Words

Why is Apache Pig Useful When Hadoop has its MapReduce?

Apache Pig Hadoop Versions Over the Years

Apache Pig Evolution in Hadoop 1.0 Series

Apache Pig Evolution in Hadoop 2.0 Onwards

Bottom Line

About Aditi Malhotra

Related Posts

Leave a Comment Cancel Reply