[SOLVED] Apache Spark Check If The File Exists On HDFS? – 1 Super Quick Solution!

Apache Spark Check if the file exists on HDFS? - 1 min solution
Share this post and Earn Free Points!

We will use the FileSystem and Path classes from the org.apache.hadoop.fs library to achieve: Apache Spark Check if the file exists on HDFS.

In this post I will present you how to check if file exists on HDFS using PySpark, Spark with various versions.

Introduction

Apache Spark

Apache Spark is a data processing engine that is designed to handle both batch and streaming workloads. It is fast, flexible, and has a wide range of libraries and APIs for various use cases such as SQL, machine learning, and graph processing.

One of its main features is in-memory processing, which allows it to perform computations faster than traditional MapReduce programs. It also has a distributed architecture that enables it to scale out to large clusters of machines and process large datasets efficiently. Spark has a rich API in various programming languages like Python, Scala, Java, and R and integrates with many popular storage systems and databases like HDFS, S3, Cassandra, and HBase.

In summary, Apache Spark is a powerful tool for data processing and analysis and is widely used in various industries and academia for applications like data pipelines, stream processing, and machine learning.

Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a distributed, scalable, and fault-tolerant file system that is designed to store very large data sets. It is an open-source file system that is part of the Apache Hadoop project and is commonly used for storing and processing big data in a distributed computing environment.

HDFS is based on the Google File System (GFS) and is designed to run on commodity hardware, which makes it a cost-effective option for storing large amounts of data. It stores data in a distributed manner across a cluster of machines, which allows it to scale out as the data size increases. It also has built-in fault tolerance and can recover from hardware failures without losing data.

HDFS is well suited for storing large files that are written once and read many times, such as log files, scientific data, and multimedia content. It is commonly used in conjunction with other tools in the Hadoop ecosystem, such as MapReduce, Pig, and Hive, to process and analyze large data sets.

Check If The File Exists On HDFS

PySpark Check If File Exists In HDFS

To check if a file exists in HDFS using PySpark, you can use the exists method of the SparkSession object. Here is an example of how to do it:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()

# Set the HDFS path to the file you want to check
hdfs_path = "hdfs:///path/to/file.txt"

# Use the exists method to check if the file exists
if spark.sparkContext.hadoopConfiguration.get("fs.defaultFS")+hdfs_path).exists(hdfs_path):
    print("File exists")
else:
    print("File does not exist")

This will check if the file at the specified HDFS path exists and print a message accordingly.

Alternatively, you can also use the SparkContext object’s wholeTextFiles method to read the file and check if it is empty. Here is an example of how to do it:

from pyspark import SparkContext

# Create a SparkContext
sc = SparkContext.getOrCreate()

# Set the HDFS path to the file you want to check
hdfs_path = "hdfs:///path/to/file.txt"

# Read the file as an RDD
rdd = sc.wholeTextFiles(hdfs_path)

# Check if the RDD is empty
if rdd.isEmpty():
    print("File does not exist or is empty")
else:
    print("File exists and is not empty")

This will read the file at the specified HDFS path and check if it is empty. If it is empty, it means that either the file does not exist or it is empty.

Spark 2.0 or higher

package com.bigdataetl

import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession

object Test extends App {

  val spark = SparkSession.builder
    // I set master to local[*], because I run it on my local computer.
    // I production mode master will be set from spark-submit command.
    .master("local[*]")
    .appName("BigDataETL - Check if file exists")
    .getOrCreate()

  // Create FileSystem object from Hadoop Configuration
  val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)

  // This methods returns Boolean (true - if file exists, false - if file doesn't exist
  val fileExists = fs.exists(new Path("<parh_to_file>"))

  if (fileExists) println("File exists!")
  else println("File doesn't exist!")

}

//  (Apache Spark Check if the file exists on HDFS?)

 Since Spark 1.6 to 2.0

package com.bigdataetl

import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.{SparkConf, SparkContext}

object Test extends App {

  val sparkConf = new SparkConf().setAppName(s"BigDataETL - Check if file exists")
  val sc = new SparkContext(sparkConf)

  // Create FileSystem object from Hadoop Configuration
  val fs = FileSystem.get(sc.hadoopConfiguration)

  // This methods returns Boolean (true - if file exists, false - if file doesn't exist
  val fileExists = fs.exists(new Path("<parh_to_file>"))

  if (fileExists) println("File exists!")
  else println("File doesn't exist!")

}

Summary

That’s all about topic: Apache Spark Check if the file exists on HDFS.

I hope it helped you solve your problems. Enjoy!

Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!

How useful was this post?

Click on a star to rate it!

Average rating 4.8 / 5. Vote count: 762

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?