Apache Spark Rename Or Delete a File HDFS – Great Example In 1 Minute?

You are currently viewing Apache Spark Rename Or Delete a File HDFS – Great Example In 1 Minute?
Share This Post, Help Others, And Earn My Heartfelt Appreciation! :)
4.8
(1442)

In this short post I will show you how you can using Apache Spark rename or delete a file HDFS.

Apache Spark Rename or delete a File HDFS

package com.bigdataetl

import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession

object Test extends App {

  val spark = SparkSession.builder
    // I set master to local[*], because I run it on my local computer.
    // I production mode master will be set rom s
    .master("local[*]")
    .appName("BigDataETL")
    .getOrCreate()

  // Create FileSystem object from Hadoop Configuration
  val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)

  // Base path where Spark will produce output file
  val basePath = "/bigtadata_etl/spark/output"
  val newFileName = "renamed_spark_output"

  // Change file name from Spark generic to new one
  fs.rename(new Path(s"$basePath/part-00000"), new Path(s"$basePath/$newFileName"))

}

Apache Spark Remove File / Files

Using DSL. (I described another example under the post: Scala: how to run a shell command from the code level?) or use the FileSystem class from the org.apache.hadoop.fs package. (Apache Spark rename or delete a file HDFS)

package com.bigdataetl

import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession
import scala.sys.process._

object SparkDeleteFile {
  val spark = SparkSession.builder
    // I set master to local[*], because I run it on my local computer.
    // I production mode master will be set rom s
    .master("local[*]")
    .appName("BigDataETL")
    .getOrCreate()

  // Create FileSystem object from Hadoop Configuration
  val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)

  // Delete directories recursively using FileSystem class
  fs.delete(new Path("/bigdata_etl/data"), true)
  // Delete using Scala DSL
  s"hdfs dfs -rm -r /bigdata_etl/data/" !

  // Delete file
  fs.removeAcl(new Path("/bigdata_etl/data/file_to_delete.dat"))
  // Delete using Scala DSL
  s"hdfs dfs -rm /bigdata_etl/data/file_to_delete.dat" !

}

Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS), and it should run on any platform that runs a supported version of Java. This should include JVMs on x86_64 and ARM64. It’s easy to run locally on one machine β€” all you need is to have java installed on your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.

Spark runs on Java 8/11, Scala 2.12/2.13, Python 3.6+ and R 3.5+. Python 3.6 support is deprecated as of Spark 3.2.0. Java 8 prior to version 8u201 support is deprecated as of Spark 3.2.0. For the Scala API, Spark 3.2.1 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x).

https://spark.apache.org/docs/latest/

Apache Spark rename or delete a file HDFS

HDFS

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject. The project URL is https://hadoop.apache.org/.

https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Assumptions+and+Goals

That’s all about topic: Apache Spark rename or delete a file HDFS. Enjoy! πŸ™‚

rename file hdfs, delete files hdfs

If you enjoyed this post please add the comment below and share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!

How useful was this post?

Click on a star to rate it!

Average rating 4.8 / 5. Vote count: 1442

No votes so far! Be the first to rate this post.

Subscribe
Notify of
guest
2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Jogn

I needed to create hdfs directory (mkdir) in a spark scala application, this really helped. Thanks