In this short Apache Spark tutorial I will show you (Apache Spark Use DataFrame efficiently during reading data) how to use the DataFrame API to increase the performance of the Spark application, while loading large, semi-structured data sets such as CSV, XML and JSON.
Apache Spark Tutorial
- Apache Spark Tutorial
- Apache Spark Use DataFrame efficiently during reading data?
- Define Static Schema of Data
- #1 Tip: Disable the InferSchema Option And Use a Defined Schema
- #2 Tip: Store Schemas Outside Application Code
- Hint #3: Use External Tables In Hive External Tables
Define Static Schema of Data
In Spark DataFrame API, you can define a static data schema. To achieve this, you must provide an object of class StructType that contains a list of StructField. Below I have presented two ways in which the data schema can be defined. (Apache Spark Use DataFrame efficiently during reading data)
// The first way val schema = new StructType() .add("id", IntegerType, false) .add("Name", StringType, true) .add("Surname", StringType, true) .add("Age", IntegerType, true) // or the second way val schema = StructType.fromDDL ("id integer, name string, surname string, age integer")
#1 Tip: Disable the InferSchema Option And Use a Defined Schema
In the following example, you can find a snippet of code where you used the previously created data schema and shown how to disable the InferSchema option. By disabling the InferSchema option, Spark will not analyze the data set to deduce what the schema of the data is to use only the one that gave it.
// CSV file val data = spark.sqlContext.rad.format("CSV") .option("Header", "false") .option("InferSchema", "false") .schema(schema) .load("/path/to/csv_data") // JSON file val data = spark.read.schema(Schema).json("/path/to/json_data") // XML file val data = spark.read.format("com. databricks. Spark. xml") .options(rowTag = "book").load("/path/to/xml_data")
#2 Tip: Store Schemas Outside Application Code
A good approach is to not store schemas directly in your code. If you change the data schema, you must modify your code, build a new package, and deploy it. In this tutorial, I’ll show you how you can store schemas in files in HDFS and the Hive table. (Apache Spark Use DataFrame efficiently during reading data)
Loading a Schema From a File on HDFS
// Load schema from file on HDFS val schemaAsTextFromFile = spark.sparkContext.textFile("/data/schemas/schema_from_hdfs.txt").toLocalIterator.toList(0) val schema = StructType.fromDDL (schemaAsTextFromFile)
Loading a Schema From a Table Into a Hive
First, we create a new database and tables in which we will store their schemas. (Apache Spark Use DataFrame efficiently during reading data)
CREATE DATABASE bigdata_etl; CREATE TABLE bigdata_etl.schemas ( Id int, dataset _name String, schema string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.orcSerde ';
Insert a row into the Hive table with the specified data schema.
--Insert into table INSERT INTO bigdata_etl. Schemas (ID, dataset_name, Schema) VALUES (1, Test _dataset ', id integer, name string, surname string, age integer ');
Now in the application Spark code we can load the data from the table with Hive, then filter the interesting line containing the appropriate schema of the data and based on it create object class StrucType, which we can then use when loading Data. (Apache Spark Use DataFrame efficiently during reading data)
val schemasFromHive = sqlcontext.table ("Bigdata_etl. Schemas") val DataSetSchema = schemasFromHive.filter(schemasFromHive ("dataset_name") = = = "Test_dataset").select ("Schema").head.mkString val schema = Structtype.fromDDL (DataSetSchema)
Hint #3: Use External Tables In Hive External Tables
You can also use the external table in Hive to improve the execution time of the Spark application when reading data from files. The following example creates an outer table in Hive. The location parameter is key and determines where to HDFS the data in the CSV format (in this example). Because the schema is defined in the table in Hive, Spark will not attempt to infer the schema (infer schema) from the files stored in that location. (Apache Spark Use DataFrame efficiently during reading data)
CREATE EXTERNAL TABLE 'bigdata _etl. Test_dataset ' ( Id ' int, Name ' String, 'surname ' String, 'age ' int ) ROW FORMAT Delimited Fields terminated BY ', ' Stored AS textfile Location '/path/to/csv_data ';
Then, in the code in Spark, you can just read the data from that table.
val TestDataSetDF = spark.sqlContext.table("Bigdata_etl.Test_dataset")
In this tutorial, I’ve presented you how you can improve your code when you use the DataFrame API. In this tutorial, I took some designs from a blog belonging to my friend, where a similar concept was presented to people who use the Python API. Here you will find a reference to this entry (https://szczeles.github.io/Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark/). (Apache Spark Use DataFrame efficiently during reading data)
If you enjoyed this post please add the comment below and share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!