top of page

An Introduction to PySpark RDDs: Transformations, Actions, and Caching

Working with RDDs

Resilient Distributed Datasets (RDDs) are a fundamental data structure in PySpark. They represent an immutable, fault-tolerant collection of elements that can be processed in parallel across multiple nodes in a cluster. RDDs can be created from a variety of data sources, including HDFS, local file systems, and external data sources such as Cassandra and HBase.



In this article, we'll explore how to work with RDDs in PySpark. We'll cover the basics of RDDs, how to create them, transformations and actions that can be performed on RDDs, the concept of lazy evaluation, and how to cache RDDs for improved performance.

Understanding RDDs


RDDs are the building blocks of PySpark applications. They provide a simple and flexible way to work with data in a distributed computing environment. An RDD is essentially a collection of elements that can be divided into partitions and processed in parallel across multiple nodes in a cluster.


RDDs are immutable, which means that once created, their contents cannot be changed. However, new RDDs can be created from existing RDDs through a process of transformation.


Creating RDDs

RDDs can be created in multiple ways, some of them are as follows:


Parallelizing an existing collection

We can parallelize an existing collection using SparkContext.parallelize() method. This method takes a collection (like a list or tuple) as an argument and creates an RDD out of it.

For Example :

from pyspark import SparkContext 
sc = SparkContext("local", "RDD Example")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

Reading from a file

We can also create RDDs by reading data from a file. Spark supports various file formats like text files, CSV files, sequence files, Avro files, etc. To create an RDD from a file, we can use SparkContext.textFile() method. This method takes a file path as an argument and returns an RDD of strings where each string is a line in the file.

For Example :

from pyspark import SparkContext
sc = SparkContext("local", "RDD Example")
rdd = sc.textFile("file.txt")

Transformations on existing RDDs

We can also create RDDs by applying transformations on existing RDDs. We will discuss transformations in detail in the next section.


Transformations and actions on RDDs

RDDs support two types of operations: transformations and actions. Transformations are operations that create a new RDD from an existing RDD, while actions are operations that trigger computation on an RDD and return a value to the driver program or write data to an external storage system.


Transformations

Transformations are lazy operations that create a new RDD from an existing RDD. This means that when a transformation is applied on an RDD, it does not immediately execute the operation. Instead, it creates a new RDD which is dependent on the parent RDD and keeps track of the transformation applied to it. The transformation is only executed when an action is called on the RDD.

Here are some common transformations on RDDs:


map(): This transformation applies a function to each element of the RDD and returns a new RDD with the transformed values.


For example:
from pyspark import SparkContext
sc = SparkContext("local", "RDD Example")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
transformed_rdd = rdd.map(lambda x: x * 2)

filter(): This transformation filters the elements of the RDD based on a given condition and returns a new RDD with the filtered values.


For example:

from pyspark import SparkContext
sc = SparkContext("local", "RDD Example")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)

flatMap(): This transformation applies a function to each element of the RDD and returns a new RDD with the flattened results.


For Example :

from pyspark import SparkContext
sc = SparkContext("local", "RDD Example")
data = ["Hello world", "Goodbye world"]
rdd = sc.parallelize(data)
flat_rdd = rdd.flatMap(lambda x: x.split(" "))

distinct(): This transformation returns a new RDD with the distinct elements of the RDD. For example:

rdd = sc.parallelize([1, 2, 3, 1, 2, 4, 5, 4, 6])
distinct_rdd = rdd.distinct()
print(distinct_rdd.collect())  # Output: [1, 2, 3, 4, 5, 6]


Caching RDDs

In PySpark, RDDs are by default recomputed every time an action is triggered on them. Therefore, if we have to access an RDD multiple times or perform several actions on the same RDD, it is recommended to cache it in memory for faster access. We can use the cache() method to cache an RDD.

rdd = sc.parallelize([1, 2, 3, 4, 5, 6])
rdd.cache()
print(rdd.count())  # Output: 6print(rdd.first())  # Output: 1

The first time we perform an action on an RDD, the data is loaded from the source, and the RDD is cached in memory. Subsequent actions on the same RDD will retrieve the data from memory instead of recomputing it.


We can also use the persist() method to specify the storage level of the RDD. The default storage level is MEMORY_ONLY, which caches the RDD in memory as deserialized Java objects. We can also use other storage levels like MEMORY_AND_DISK, DISK_ONLY, etc. depending on the size of the RDD and the available resources.

from pyspark.storagelevel import StorageLevel

rdd = sc.parallelize([1, 2, 3, 4, 5, 6])
rdd.persist(StorageLevel.DISK_ONLY)


Conclusion

In this article, we have learned about RDDs and their transformations and actions in PySpark. We have also discussed lazy evaluation, which allows PySpark to optimize the execution plan of a job. Additionally, we have covered caching RDDs in memory for faster access and how to control the storage level of RDDs using the persist() method. PySpark provides a powerful framework for processing large datasets, and understanding RDDs is essential for developing efficient PySpark applications.



bottom of page