In this short post I will show you how you can using Apache Spark rename or delete a file HDFS.
Apache Spark Rename or delete a File HDFS
package com.bigdataetl import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.spark.sql.SparkSession object Test extends App { val spark = SparkSession.builder // I set master to local[*], because I run it on my local computer. // I production mode master will be set rom s .master("local[*]") .appName("BigDataETL") .getOrCreate() // Create FileSystem object from Hadoop Configuration val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration) // Base path where Spark will produce output file val basePath = "/bigtadata_etl/spark/output" val newFileName = "renamed_spark_output" // Change file name from Spark generic to new one fs.rename(new Path(s"$basePath/part-00000"), new Path(s"$basePath/$newFileName")) }
Apache Spark Remove File / Files
Using DSL. (I described another example under the post: Scala: how to run a shell command from the code level?) or use the FileSystem class from the org.apache.hadoop.fs package. (Apache Spark rename or delete a file HDFS)
package com.bigdataetl import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.spark.sql.SparkSession import scala.sys.process._ object SparkDeleteFile { val spark = SparkSession.builder // I set master to local[*], because I run it on my local computer. // I production mode master will be set rom s .master("local[*]") .appName("BigDataETL") .getOrCreate() // Create FileSystem object from Hadoop Configuration val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration) // Delete directories recursively using FileSystem class fs.delete(new Path("/bigdata_etl/data"), true) // Delete using Scala DSL s"hdfs dfs -rm -r /bigdata_etl/data/" ! // Delete file fs.removeAcl(new Path("/bigdata_etl/data/file_to_delete.dat")) // Delete using Scala DSL s"hdfs dfs -rm /bigdata_etl/data/file_to_delete.dat" ! }
Apache Spark
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.
Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS), and it should run on any platform that runs a supported version of Java. This should include JVMs on x86_64 and ARM64. It’s easy to run locally on one machine — all you need is to have
java
installed on your systemPATH
, or theJAVA_HOME
environment variable pointing to a Java installation.Spark runs on Java 8/11, Scala 2.12/2.13, Python 3.6+ and R 3.5+. Python 3.6 support is deprecated as of Spark 3.2.0. Java 8 prior to version 8u201 support is deprecated as of Spark 3.2.0. For the Scala API, Spark 3.2.1 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x).
https://spark.apache.org/docs/latest/
Apache Spark rename or delete a file HDFS
HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject. The project URL is https://hadoop.apache.org/.
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Assumptions+and+Goals
That’s all about topic: Apache Spark rename or delete a file HDFS. Enjoy! 🙂
rename file hdfs, delete files hdfs
If you enjoyed this post please add the comment below and share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!
I needed to create hdfs directory (mkdir) in a spark scala application, this really helped. Thanks
Thanks Jogn! I am very glad it has helped you! 🙂