You are currently viewing Create Spark RDD Using Parallelize Method
Could You Please Share This Post? I Appreciate It And Thank YOU! :) Have A Nice Day!
5
(3)

At the beginning I will teach you what is and how to create Spark RDD using Parallelize method. It’s fundamental method which basically is used to create RDD from or to Scala object. The official description of this method is:

Distribute a local Scala collection to form an RDD.

Params:
       seq – Scala collection to distribute
       numSlices – number of partitions to divide the collection into
Returns:
       RDD representing distributed collection

Note: Parallelize acts lazily. If seq is a mutable collection and is altered after the call to parallelize and before the first action on the RDD, the resultant RDD will reflect the modified collection. Pass a copy of the argument to avoid this.

Note: avoid using parallelize(Seq()) to create an empty RDD. Consider emptyRDD for an RDD with no partitions, or parallelize(Seq[T]()) for an RDD of T with empty partitions.

This post is a part of Free Spark Tutorial!

How Spark Parallelize Works?

Let’s assume we have created the Scala collection which we want to Parallelize to Spark Executors to process the Sequence of data in distributed way. Take a look what happened on Spark Driver and Spark Executors. Check the following diagram:

Create Spark RDD Using Parallelize Method

How Spark Collect Works?

Now we will take a look how Spark Collect Works? Basically collect method is used o process data from Executors to Spark Driver. You have to be very careful during usage of Collect() method, because it can kill you Spark Job due to fact that huge amount of data can be transferred from Executors to Driver.

Avoid such situation. Driver is not a place to process data – it’s a manager or boss of your application and as we know the Boss is not a manual worker 🙂

Create Spark RDD Using Parallelize Method

Create Spark RDD Using Parallelize Method In Scala

Spark Parallelize method is available in SparkContext object. In Apache Spark, partitions are the fundamental types of parallelism. In Apache Spark, RDDs are a grouping of partitions.

If you would like to deep your knowledge about SparkSession object please read this post. I described all the main details about SparkSession, SparkContext and SQLContext which are fundamentals of Spark.

import org.apache.spark.sql.SparkSession

object RDDTutorialParallelize extends App {

  val spark: SparkSession = SparkSession
    .builder()
    .master("local[*]")
    .appName("BigData-ETL.com")
    .getOrCreate()

  val dataAsRDD = spark.sparkContext.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))

  println(s"Is RDD empty?: ${dataAsRDD.isEmpty}")
  println(s"First element of RDD: ${dataAsRDD.first}")
  println(s"Number of RDD Partitions: ${dataAsRDD.getNumPartitions}")

  dataAsRDD.collect().foreach(element => println(s"Element: $element"))

}

When you execute the above code you will see the following output. If you are interested why you had to execute collect() method on RDD before iteration and println – check this post:

Is RDD empty?: false
First element of RDD: 1
Number of RDD Partitions: 32
Element: 1
Element: 2
Element: 3
Element: 4
Element: 5
Element: 6
Element: 7
Element: 8
Element: 9
Element: 10

Summary

Cool! The first met with Spark you can mark as done! 🙂

Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!

BigData-ETL: image 7YOU MIGHT ALSO LIKE

How useful was this post?

Click on a star to rate it!

Average rating 5 / 5. Vote count: 3

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?