How to use Dataframe API in spark efficiently when downloading data?

In this short tutorial I will show you how to use the Dataframe API to increase the performance of the Spark'owej application, while loading large, semi-structured data sets such as CSV, XML and JSON.

[Założenie ogólne] Define static schema of data

In Spark dataframe API, you can define a static data schema. To achieve this, you must provide an object of class Structtype that contains a list of StructField. Below I have presented two ways in which the data schema can be defined.

--The first way
Val schema = new Structtype ()
. Add ("id", Integertype, False)
. Add ("Name", StringType, True)
. Add ("Surname", StringType, True)
. Add ("Age", Integertype, True)
--or the second way
Val schema = Structtype. FromDDL ("id integer, name string, surname string, age integer")

#1 Tip: Disable the InferSchema option and use a defined schema

In the following example, you can find a snippet of code where you used the previously created data schema and shown how to disable the InferSchema option. By disabling the InferSchema option, Spark will not analyze the data set to deduce what the schema of the data is to use only the one that gave it.

CSV file
Val data = Spark. SqlContext. Read. Format ("CSV")
. Option ("Header", "false")
. Option ("InferSchema", "false")
. Schema (Schema)
. Load ("/path/to/csv_data")
JSON file
Val data = Spark. Read. Schema (Schema). JSON ("/path/to/json_data")
XML file
Val data = Spark. Read. Format ("com. databricks. Spark. xml")
. Options (RowTag = "book"). Load ("/path/to/xml_data")

#2 Tip: Store schemas outside application code

A good approach is to not store schemas directly in your code. If you change the data schema, you must modify your code, build a new package, and deploy it. In this tutorial, I'll show you how you can store schemas in files in HDFS and the Hive table.

Loading a schema from a file on HDFS

--Load schema from file on HDFS
Val schemaAsTextFromFile = Spark. SparkContext. Textfile ("/data/schemas/schema_from_hdfs.txt"). ToLocalIterator. ToList (0)
Val schema = Structtype. FromDDL (SchemaAsTextFromFile)

Loading a schema from a table into a Hive

First, we create a new database and tables in which we will store their schemas.

CREATE DATABASE Bigdata_etl;
CREATE TABLE 'bigdata _etl. Schemas ' (
  Id ' int,
  'dataset _name ' String,
  'schema ' string
)
ROW FORMAT SERDE
'org. Apache. Hadoop. Hive. ql. io. Orc. OrcSerde ';

Insert a row into the Hive table with the specified data schema.

--Insert into table
INSERT INTO bigdata_etl. Schemas (ID, dataset_name, Schema) VALUES (1, Test _dataset ', id integer, name string, surname string, age integer ');

Now in the application code Spark'owej we can load the data from the table with Hive, then filter the interesting line containing the appropriate schema of the data and based on it create object class StrucType, which we can then use when loading Data.

Val schemasFromHive = sqlcontext. Table ("Bigdata_etl. Schemas")
Val DataSetSchema = schemasFromHive. Filter (SchemasFromHive ("dataset_name") = = = "Test_dataset"). Select ("Schema"). Head. mkString
Val schema = Structtype. FromDDL (DataSetSchema)

Hint #3: Use external tables in hive external tables

You can also use the external table in Hive to improve the execution time of the SPARK'OEWJ application when reading data from files. The following example creates an outer table in Hive. The location parameter is key and determines where to HDFS the data in the CSV format (in this example). Because the schema is defined in the table in Hive, Spark will not attempt to infer the schema (infer schema) from the files stored in that location.

CREATE EXTERNAL TABLE 'bigdata _etl. Test_dataset ' (
  Id ' int,
  Name ' String,
  'surname ' String,
  'age ' int
)
ROW FORMAT Delimited
Fields terminated BY ', '
Stored AS textfile
Location '/path/to/csv_data ';

Then, in the code in Spark'u, you can just read the data from that table.

Val TestDataSetDF = Spark. SqlContext. Table ("Bigdata_etl. Test_dataset")

Summary

In this tutorial, I've presented you how you can improve your code when you use the Dataframe API. In this tutorial, I took some designs from a blog belonging to my friend, where a similar concept was presented to people who use the Python'a API. Here you will find a reference to this entry (https://szczeles.github.io/Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark/).

If you liked this post please write a comment below and share this post on your Facebook, Twitter'ze, LinkedIn or any other website with social media.
Thanks in advance!

Leave a Reply

avatar
  Subscribe  
Notify of
Close Menu