Hadoop has scaled up in many ways to open up wings of all levels of technical people. Of course, Java programmers have the edge over others when it comes to Hadoop development. However, if you are new or unknown to high-level languages like Java or Jython, no worries! Apache Pig is there to do all kinds of data manipulations whether structured or unstructured data. Certainly, it makes “Pig Hadoop” the interrelated terms in the Hadoop family. The purpose of Apache Pig is to create the MapReduce jobs on large data-sets instead of executing them by writing complex Java codes.
However, Hadoop has evolved in years. Not to mention, this has happened due to increased user demands in the field of data analysis with the massive amount of data. Consequently, every Hadoop components have marketed with some new features on new releases, and so is the Apache Pig. We will have a closer look at those change areas of Apache Pig major releases in this article.
Apache Pig in Few Words
Apache Pig is a high-level scripting language which makes a Hadoop developer’s life easy in making complex data transformations. It is a SQL like procedural language, more widely known as Pig Latin. It is very compatible with users who know other scripting languages. However, to handle real business problems, the compatibility of Pig Big data with its User Defined Functions (UDF) feature works wonderfully. It efficiently invokes code in other languages like JRuby, Java, etc. Also, developers can embed Pig scripts in other languages.
Why is Apache Pig Useful When Hadoop has its MapReduce?
Both MapReduce and Pig do data processing. However, the first one deals with a low level of abstraction during data set processing. On the other hand, the Pig processes large data sets with the higher level of abstraction. Moreover, you will get a series of MapReduce jobs out of Pig transformations. Along with it, framework wise there are some more differences between MapReduce Processing and Pig processing.
With Pig Latin, you can perform almost all the standard data-processing operations, such as group by, join, filter, union, order by, etc. However, you can perform only operations like group by using MapReduce. The other operations like order by, filter, projection and join are not provided in MapReduce. As a result, the user needs to write a custom program for it.
Apache Pig Hadoop Versions Over the Years
Since the time of incubation till today, Apache Pig has evolved with twenty-four releases with different versions of Hadoop.
Apache Pig Evolution in Hadoop 1.0 Series
The first release of Apache Pig came with Hadoop 0.18, and it was in its incubation. However, it was not a stable release from Hadoop perspective. The next releases of Apache Pig which was a maintenance release played as the first version as Hadoop subproject. We see following necessary changes in Pig functionality and performance in the subsequent few releases from Pig 0.1.1 to 0.10.
Features included
- Five times performance gain
- The multi-query optimization (It allows sharing computation across multiple queries within a single Pig script)
- Introduction of two new joins – Skewed join and merge join
- Performance and memory usage improvements
- Adding the Accumulator interface for UDFs
- Including of new LoadFunc or StoreFunc interface
- Including custom partitioner
- Including Python UDF
- Including control structures, query parser changing and performing semantic cleanup
- Adding the Accumulator interface for UDFs
The most significant release of Apache Pig with Hadoop 1.0 was version 0.10.0.
Features included
- Boolean datatype
- JRuby
- Nested cross/for each
- Limit by the expression
- UDF
- The split default destination
- Map-side aggregation
- Tuple/bag/map syntax support
- Source code only distribution
- Better support for Apache Hadoop 2 with different Maven artifacts
- Better support for Oracle JDK 7
Apache Pig Evolution in Hadoop 2.0 Onwards
Hadoop 2x is significantly different from Hadoop 1x in many ways. It is
- More scalable with YARN
- Able to run non-MapReduce jobs
- High availability of name nodes
- Native Windows support
- More utilization
- Beyond batch approach
Hence, it demands more enhanced performance from the utility tools like Pig.
The first major release of Apache Pig in Hadoop2x series is 0.12.0.
Features included
- ASSERT operator – For data validation
- Streaming UDF – For UDF without JVM
- New AvroStorage – Works as Pig built-in functions and faster
- IN/CASE operator
- BigInteger and BigDecimal data type – Some applications need calculations with a high degree of precision. In such cases, BigInteger and BigDecimal are useful for precise calculations.
We see the scope of non-MapReduce engines in Hadoop 2x onwards. Hence, Apache Pig 0.13.0 also brought the necessary changes to run on Hadoop’s non-MapReduce engines. Along with it included –
- The auto-local mode to work with small input data size to run in-process
- Fetch optimization
- Fixed counters for local-mode
As Hadoop introduced high-performance Apache Tez, data processing scaled up from terabytes to petabytes. The main feature of Apache Pig 0.14.0 is Pig on Tez. However, Pig on Tez stabilization came only in next release. Additionally, it came with improved Tez auto-parallelism. Along with, it introduced ORC File.
The latest release of Apache Pig which is 0.17.0 introduced it on Spark which is already a high performer in Hadoop operation.
Hadoop is progressing, and Hadoop 3.0 is already in the market with few enhancements. Hence, we could expect upcoming feature introduction in next release.
[divider /]
Bottom Line
Working in Hadoop environment means working in Hadoop ecosystem and the tools supported by the ecosystem. Similarly, once you work with Pig and Hadoop integrated form, you will get a better picture of it. Hence, if your passion is to become a big data Hadoop architect or developer, you must be familiar with the entire ecosystem.
However, passion does not fulfill itself unless you set some goal for it. Moreover, Hadoop is a vast area to cover up, and you must be correctly oriented. Following the path of a renowned certification in this field probably the best and useful roadmap to reach the goal! Not to mention Cloudera, is the most sought-after platform for Hadoop and their CCA Administrator (CCA-131) certification covers entire Hadoop ecosystem with tools like Apache Pig, Hive, and Impala, etc.
Whizlabs gives you an opportunity to get a broad knowledge of the subject matter through their self-study guide – Cloudera Certified Associate Administrator (CCA-131) Certification.
It is a complete coverage of the certification preparation that includes hands-on as well. Hence, leverage the power of knowledge with us and become a successful Hadoop professional of tomorrow!
- Top 45 Fresher Java Interview Questions - March 9, 2023
- 25 Free Practice Questions – GCP Certified Professional Cloud Architect - December 3, 2021
- 30 Free Questions – Google Cloud Certified Digital Leader Certification Exam - November 24, 2021
- 4 Types of Google Cloud Support Options for You - November 23, 2021
- APACHE STORM (2.2.0) – A Complete Guide - November 22, 2021
- Data Mining Vs Big Data – Find out the Best Differences - November 18, 2021
- Understanding MapReduce in Hadoop – Know how to get started - November 15, 2021
- What is Data Visualization? - October 22, 2021