Hadoop Distributed FileSystem
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
The Hadoop Distributed File System provides high throughput access to application data and is suitable for applications that have large data sets. The Hadoop Distributed File System relaxes a few POSIX requirements to enable streaming access to file system data. The Hadoop Distributed File System was originally built as infrastructure for the Apache Nutch web search engine project. The Hadoop Distributed File System is now an Apache Hadoop subproject. The project URL is https://hadoop.apache.org/hdfs/.
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
Problem
[ Apache Spark Check if the file exists on HDFS ] We will use the FileSystem and Path classes from the org.apache.hadoop.fs library to achieve it. (Apache Spark Check if the file exists on HDFS?)
Spark 2.0 or higher
package com.bigdataetl import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.spark.sql.SparkSession object Test extends App { val spark = SparkSession.builder // I set master to local[*], because I run it on my local computer. // I production mode master will be set from spark-submit command. .master("local[*]") .appName("BigDataETL - Check if file exists") .getOrCreate() // Create FileSystem object from Hadoop Configuration val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration) // This methods returns Boolean (true - if file exists, false - if file doesn't exist val fileExists = fs.exists(new Path("<parh_to_file>")) if (fileExists) println("File exists!") else println("File doesn't exist!") } // (Apache Spark Check if the file exists on HDFS?)
Since Spark 1.6 to 2.0
package com.bigdataetl import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.spark.{SparkConf, SparkContext} object Test extends App { val sparkConf = new SparkConf().setAppName(s"BigDataETL - Check if file exists") val sc = new SparkContext(sparkConf) // Create FileSystem object from Hadoop Configuration val fs = FileSystem.get(sc.hadoopConfiguration) // This methods returns Boolean (true - if file exists, false - if file doesn't exist val fileExists = fs.exists(new Path("<parh_to_file>")) if (fileExists) println("File exists!") else println("File doesn't exist!") }
That’s all about topic: Apache Spark Check if the file exists on HDFS. Enjoy!
If you enjoyed this post please add the comment below and share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!