How to install Apache Spark Standalone in CentOs?

How to install Apache Spark Standalone in CentOs?

Step #1: Install Java

First of all you have to install Java on your machine.

[root@sparkCentOs pawel] sudo yum install java-1.8.0-openjdk
[root@sparkCentOs pawel] java -version
openjdk version "1.8.0_161"
OpenJDK Runtime Environment (build 1.8.0_161-b14)
OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)

Step #2: Install Scala

In second step please install Scala.

[root@sparkCentOs pawel] wget
[root@sparkCentOs pawel] tar xvf scala-2.11.8.tgz
[root@sparkCentOs pawel] sudo mv scala-2.11.8 /usr/lib
[root@sparkCentOs pawel] sudo ln -s /usr/lib/scala-2.11.8 /usr/lib/scala
[root@sparkCentOs pawel] export PATH=$PATH:/usr/lib/scala/bin

Step #3: Installation of Apache Spark​

Now we will download Apache Spark from official website and install on your machine.

# Download Spark
[root@sparkCentOs pawel] wget
[root@sparkCentOs pawel] tar xf spark-2.3.1-bin-hadoop2.7.tgz
[root@sparkCentOs pawel] mkdir /usr/local/spark
[root@sparkCentOs pawel] cp -r spark-2.3.1-bin-hadoop2.7/* /usr/local/spark
[root@sparkCentOs pawel] export SPARK_EXAMPLES_JAR=/usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar
[root@sparkCentOs pawel] PATH=$PATH:$HOME/bin:/usr/local/spark/bin
[root@sparkCentOs pawel] source ~/.bash_profile

Step #4: Run Spark Shell

Please run Spark shell and verify if Spark is working correctly.

[root@sparkCentOs pawel]# spark-shell
2018-08-20 19:57:30 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://sparkCentOs:4040
Spark context available as 'sc' (master = local[*], app id = local-1534795057680).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.1
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_161)
Type in expressions to have them evaluated.
Type :help for more information.

Let’s type some code 🙂

scala> val data = spark.sparkContext.parallelize(
    Seq("I like Spark","Spark is awesome",
    "My first Spark job is working now and is counting these words")
data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :23
scala> val wordCounts = data.flatMap(row => row.split(" ")).
        map(word => (word, 1)).reduceByKey(_ + _)
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[3] at reduceByKey at :25
scala> wordCounts.foreach(println)

If you enjoyed this post please leave the comment below or share this post on your Facebook, Twitter, LinkedIn or another social media webpage. Thanks in advanced!

0 0 vote
Article Rating
Notify of
Newest Most Voted
Inline Feedbacks
View all comments
Marcos Oliveira

Really nice post.
It worked perfectly.