Over the time, Big Data analysis has reached a new magnitude which, in turn, has changed its mode of operation and expectation as well. Today’s big data analysis is not only dealing with massive data but also with a set target of fast turnaround time. Though Hadoop is the unbeatable technology behind the big data analysis, it has some shortfalls concerning fast processing. However, with the entry of Spark in the Hadoop world, data processing speed has met up most expectations.
New to the world of Apache Spark? Let’s dive deep with this exclusive Apache Spark Guide!
Moreover, when we talk about Spark, the first term comes into our mind is Resilient Distributed Datasets (RDD) or Spark RDD which makes data processing faster. Also, this is the key feature of Spark that enables logical partitioning of data sets during computation.
In this blog, we will discuss technical aspects of Spark RDD which we believe will help you as a developer to understand Spark RDD more with its underlying technical details. In addition to that, this blog will provide you an overview of the use of RDD in Spark.
Spark RDD and its features
RDD stands for Resilient Distributed Dataset where each of the terms signifies its features.
- Resilient: means it is fault tolerant by using RDD lineage graph (DAG). Hence, it makes it possible to do recomputation in case of node failure.
- Distributed: As datasets for Spark RDD resides in multiple nodes.
- Dataset: records of data that you will work with.
In Hadoop designing, RDD is a challenge. However, with Spark RDD the solution seems very effective due to its lazy evaluation. RDDs in Spark works on-demand basis. Hence, it saves lots of data processing time as well efficiency of the whole process.
Hadoop Map-reduce has many shortcomings which are overcome by Spark RDD through its features, and this is the main reason for the popularity of Spark RDD.
Spark RDD Core Features in a Nutshell
- In-memory Computation
- Lazy Evaluation
- Fault Tolerance
- Immutability
- Partitioning
- Persistence
- Coarse-grained Operations
- Location-Stickiness
We will gradually discuss these points in next sections.
Understanding Spark RDD Technical Features
Spark RDD is the technique of representing datasets distributed across multiple nodes, which can operate in parallel. In other words, Spark RDD is the main fault tolerant abstraction of Apache Spark and also its fundamental data structure. The RDD in Spark is an immutable distributed collection of objects which works behind data caching following two methods –
- cache()
- persist()
The in-memory caching technique of Spark RDD makes logical partitioning of datasets in Spark RDD. The beauty of in-memory caching is if the data doesn’t fit it sends the excess data to disk for recalculation. So, this is why it is called resilient. As a result, you can extract RDD in Spark as and when you require it. Hence, it makes the overall data processing faster.
Spark is 100 times faster than Hadoop in terms of data processing. Here are the factors that make Apache Spark faster!
Operations Supported by Spark RDD
RDD in Spark supports two types of operations:
- Transformations
- Actions
Transformation
In case of transformation, Spark RDD creates a new dataset from an existing dataset. To refer a Spark RDD example for transformation, we can say a map is a transformation which passes each dataset element through a function. As a return value, it sends new RDD which represents the result.
The programmatic view of the above example in different languages would be:
In Scala:
val l = sc.textFile(“example.txt”) val lLengths = l.map(s => s.length) val totalLength = lLengths.reduce((a, b) => a + b) |
Now if you want to use lLengths later you can use the persist () function as below:
lLengths. persist() |
You can refer API docs for the detail list of transformations supported by Spark RDD from https://spark.apache.org/.
There are two types of transformations supported by Spark RDD:
- Narrow transformation
- Wide transformation
In case of Narrow transformation, the parent RDD of output RDD is associated with a single partition of data. Whereas in Wide transformation, the output RDD is the result of many parent RDD partitions. In another word, it is known as shuffle transformation.
All Spark RDD transformations are lazy as they do not compute the results right away. Instead, they remember the applied transformations to some base datasets which refers some files as shown in the example. When any action requires a result, then only the transformations are computed in Spark RDD. This, in turn, results in the faster and efficient data processing.
Recomputation in Spark RDD happens every time for each transformed RDD whenever you run an action on it. However, with persist method Spark can keep the elements around on the cluster for much faster access the next time you query it. There is also a support for persisting Spark RDDs on disk or replication across multiple nodes.
Actions
During actions, RDD returns a value to the driver program after performing the computation on the dataset. For example, reduce is an action which aggregates all the RDD elements using some function and returns the final result to the main program.
How to Create RDD in Spark?
There are following three processes to create Spark RDD.
- Using parallelized collections
- From external datasets (viz . other external storage systems like a shared file system, HBase or HDFS)
- From existing Apache Spark RDDs
Next, we will discuss on each of these methods to see how they are used to create Spark RDDs.
Resilient Distributed Datasets (RDD) is the important feature of Apache Spark that makes it important. Let’s understand the importance of Apache Spark in Big Data industry.
Parallelized Collections
You can create parallelized collections by calling parallelize method of SparkContext interface on the existing collection of driver program in Java, Scala or Python. In this case, the copied collection elements form a distributed dataset which can operate in parallel.
Spark RDD example of parallelized collections in Scala:
To hold the numbers 2 to 6 as parallelized collection
val collection = Array(2, 3, 4, 5,6) val prData = spark.sparkContext.parallelize(collection) |
Here the created distributed dataset prData is able to operate in parallel. Hence, you can call prData.reduce () to add up the elements in the array.
One of the key parameters for parallelized collections is deciding the partition numbers to cut the dataset into. In this case, Spark runs single task for individual partition of the cluster. Usually, 2-4 partitions are ideal for a single CPU in your cluster. Though Spark automatically sets the number of partitions based on the cluster. However, the users can also set it themselves manually by passing it as the second parameter of parallelization.
External Datasets
Apache Spark can create distributed datasets from any Hadoop supported file storage which may include:
- Local file system
- HDFS
- Cassandra
- HBase
- Amazon S3
Spark supports file formats like
- Text files
- Sequence Files
- CSV
- JSON
- Any Hadoop Input Format
For example, you can create a Text file Spark RDDs by using a textFile method of the SparkContext interface. This method takes the URL for the file whether it could be the local path on the system, or even a hdfs://, etc.). Finally, it reads the file as a collection of lines.
Seeking a better career in Apache Spark? Choose the one out of 5 Best Apache Spark Certifications to boost your career.
Here the important factor is if you are using a path on the local file system, the file must be accessible at the same path on slave nodes. Hence, either you have to copy the data file to all slave nodes or need to use a shared file system which is network-mounted.
You can use the data frame reader interface to load external datasets and then use the .rdd method to convert the Dataset <Row> into RDD <Row>.
Let’s see the below example of a text file conversion which returns string dataset later.
val exDataRDD = spark.read.textFile(“path/of/text/file”).rdd |
From Existing RDDs
RDD is immutable; hence you can’t change it. However, using transformation, you can create a new RDD from an existing RDD. As no change takes place due to mutation, it maintains the consistency over the cluster. Few of the operations used for this purpose are:
- map
- filter
- count
- distinct
- flatmap
Example:
val seasons =spark.sparkContext.parallelize(Seq(“summer”, “monsoon”, “spring”, “winter”)) val seasons1= seasons.map(s => (s.charAt(0), s)) |
Conclusion
If you are an aspiring candidate preparing for Hortonworks Spark developer certification (HDPCD) then covering all the above RDD features are must as part of the theoretical and practical aspects of the certification.
Whizlabs offers you complete coverage of RDD features for the certification exam with its training videos and study materials. The training guide offers programming aspects mostly with Scala. So, join the course today and get yourself acquainted with Spark RDD.
Have any question/suggestion? Just mention below in the comment box or write here, we’ll be happy to answer!
- Top 45 Fresher Java Interview Questions - March 9, 2023
- 25 Free Practice Questions – GCP Certified Professional Cloud Architect - December 3, 2021
- 30 Free Questions – Google Cloud Certified Digital Leader Certification Exam - November 24, 2021
- 4 Types of Google Cloud Support Options for You - November 23, 2021
- APACHE STORM (2.2.0) – A Complete Guide - November 22, 2021
- Data Mining Vs Big Data – Find out the Best Differences - November 18, 2021
- Understanding MapReduce in Hadoop – Know how to get started - November 15, 2021
- What is Data Visualization? - October 22, 2021
“RDD is immutable; hence you can change it”
It seem not right. RDD is immutable, hence you can not change it.
Thank for useful post <3