You are currently viewing Spark withColumn DataFrame Method – 7 Very Easy Examples!
Could You Please Share This Post? I Appreciate It And Thank YOU! :) Have A Nice Day!
5
(1)

Since version 2.0.0 the Spark withColumn DataFrame function was introduced. This method returns a new DataFrame or Dataset that includes a new column or replaces an existing column with the same name.

Spark withColumn DataFrame

The phrase in the column must only relate to properties given by this Dataframe or Dataset. Adding a column that refers to another Dataframe or Dataset is a mistake.

This method: withColumn() includes an internal projection. As a result, invoking it several times, for example, via loops to add additional columns, might build large plans, causing performance concerns and potentially a StackOverflowException. Use select() with many columns at once to avoid this.

The Spark withColumn() method takes two parameters:

Spark withColumn function is the Spark transformation. It means that Spark will wait till the action to execute the logic.

def withColumn(colName: String, col: Column): DataFrame

Example Data

To present some examples firstly we must prepare some DataFrame which will be changing by us during this this tutorial. So let’s create the well known DataFrame with some data about cars.

  val spark: SparkSession = SparkSession
    .builder()
    .master("local[*]")
    .appName("BigData-ETL.com")
    .getOrCreate()

  import spark.implicits._

  val carsData = Seq(
    ("Ford Torino", 140, 3449, "US"),
    ("Chevrolet Monte Carlo", 150, 3761, "US"),
    ("BMW 2002", 113, 2234, "Europe")
  )
  val columns = Seq("car", "horsepower", "weight", "origin")
  val carsDf = carsData.toDF(columns: _*)

Spark Replace Column Value In DataFrame

The first example is to replace existing column in DataFrame. The withColumn() function can be used to update the value in existing column.

Let’s update the weight column and set it for 2000 for all the records in DataFrame.

  carsDf.withColumn("weight", lit(2000))
    .show()

Spark Add New Column To DataFrame

To add new column to DataFrame we can use the withColumn() methods as well. To create new column in DataFrame or DataSet we must pass the colName as the first argument and the column values as the second argument.

Let’s create new columns in DataFrame which will be called: city and continent and add the value of unknown for both:

  carsDf
    .withColumn("city", lit("unknown"))
    .withColumn("continent", lit("unknown"))

Spark Acquire New Column Based On Existing Column In DataFrame

To create new column which will derive values from another column in DataFrame we need to use this column withColumn() function. It’s very easy and very common operation.

Let’s create new columns in DataFrame with name: kilowatt_power which will derive from horsepower column:

  carsDf
    .withColumn("kilowatt_power", col("horsepower") * lit(0.7457))
    .show()

It gets the following results:

+--------------------+----------+------+------+------------------+
|                 car|horsepower|weight|origin|    kilowatt_power|
+--------------------+----------+------+------+------------------+
|         Ford Torino|       140|  3449|    US|104.39800000000001|
|Chevrolet Monte C...|       150|  3761|    US|           111.855|
|            BMW 2002|       113|  2234|Europe|           84.2641|
+--------------------+----------+------+------+------------------+

Spark Change Column Data Type In DataFrame

The withColumn function can be also used to change the column data type in DataFrame. In SQL we often write:

SELECT cast(weight as int) as weight_as_int from some_table...

In Spark we can accomplish the same result using pure Spark API by writing expression in withColumn method. To change column data type in DataFrame from Integer to Double we can write:

  carsDf
    .withColumn("horsepower", col("horsepower").cast(DoubleType))
    .show()

Spark Rename Column In DataFrame

To rename existing column in Spark DataFrame you can use the built-in function: withColumnRenamed, which returns a new DataFrame or DataSet with a column renamed. This is a no-op if schema doesn’t contain existing name. This method takes two parameters which both are String type:.

In the following example we want to rename column car to car_name:

  carsDf
    .withColumnRenamed("car", "car_name")
    .printSchema()

The new schema of DataFrame is:

Spark Rename Column In DataFrame
root
 |-- car_name: string (nullable = true)
 |-- horsepower: integer (nullable = false)
 |-- weight: integer (nullable = false)
 |-- origin: string (nullable = true)

Spark Update, Replace or Add Multiple Columns In DataFrame

To change multiple columns in one DataFrame you can use the withColumns methods from Spark API or create temporary view for some table in Spark Context and then access column by pure SQL syntax.

In the following examples I will present you both ways. It’s up to you which one you will use / prefer. The most important thing is: no matter which one you will choose – from performance perspective both works in the same way, because both are analysed by Spark Catalyst library.

First example with Spark API and withColumns function:

  val colsMap = Map(
    "kilowatt_power" -> col("horsepower") * lit(0.7457),
    "double_kilowatt_power" -> col("horsepower") * lit(0.7457 * 2)
  )

  carsDf
    .withColumns(colsMap)
    .show()

The second example which uses the SQL and Spark temporary views:

  carsDf.createOrReplaceTempView("cars")
  spark
    .sql(
      "SELECT horsepower * 0.7457 as kilowatt_power, horsepower * 0.7457 * 2 as double_kilowatt_power FROM cars"
    )
    .show()

Spark Delete Column From DataFrame

To delete or drop column from DataFrame or DataSet you should use the drop method from Spark API.

drop() method returns a new Dataset with a column dropped. This is a no-op if schema doesn’t contain column name. This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.

In the following example we will drop car column from carsDf DataFrame:

   carsDf
    .drop("car")
    .show()

Spark Drop Multiple Columns From DataFrame

And now the last example how to drop multiple columns from existing DataFrame or DataSet:

   carsDf
    .drop("car", "horsepower")
    .show()

Full Code

package com.bigdataetl.sparktutorial.sql

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, lit}
import org.apache.spark.sql.types.DoubleType

object SparkWithColumn extends App {

  val spark: SparkSession = SparkSession
    .builder()
    .master("local[*]")
    .appName("BigData-ETL.com")
    .getOrCreate()

  import spark.implicits._

  val carsData = Seq(
    ("Ford Torino", 140, 3449, "US"),
    ("Chevrolet Monte Carlo", 150, 3761, "US"),
    ("BMW 2002", 113, 2234, "Europe")
  )
  val columns = Seq("car", "horsepower", "weight", "origin")
  val carsDf = carsData.toDF(columns: _*)

  println("Replace Column Value In DataFrame")
  carsDf
    .withColumn("weight", lit(2000))
    .show()

  println("Add New Column To DataFrame")
  carsDf
    .withColumn("city", lit("unknown"))
    .withColumn("continent", lit("unknown"))

  println("Acquire New Column Based On Existing Column In DataFrame")
  carsDf
    .withColumn("kilowatt_power", col("horsepower") * lit(0.7457))
    .show()

  println("Change Column Data Type In DataFrame")
  carsDf
    .withColumn("horsepower", col("horsepower").cast(DoubleType))
    .show()

  println("Spark Rename Column In DataFrame")
  carsDf
    .withColumnRenamed("car", "car_name")
    .printSchema()

  println("Update, Replace or Add Multiple Columns In DataFrame")
  val colsMap = Map(
    "kilowatt_power" -> col("horsepower") * lit(0.7457),
    "double_kilowatt_power" -> col("horsepower") * lit(0.7457 * 2)
  )

  carsDf
    .withColumns(colsMap)
    .show()

  carsDf.createOrReplaceTempView("cars")
  spark
    .sql(
      "SELECT horsepower * 0.7457 as kilowatt_power, horsepower * 0.7457 * 2 as double_kilowatt_power FROM cars"
    )
    .show()

  println("Spark Delete Column From DataFrame")
  carsDf
    .drop("car")
    .show()
}

GitLab Repository

As usual, please find the full code on our GitLab repository!

Summary

Summary…

Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!

BigData-ETL: image 7YOU MIGHT ALSO LIKE

How useful was this post?

Click on a star to rate it!

Average rating 5 / 5. Vote count: 1

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?