In this short tutorial I will show you how to use the Dataframe API to increase the performance of the Spark'owej application, while loading large, semi-structured data sets such as CSV, XML and JSON.
[Założenie ogólne] Define static schema of data
In Spark dataframe API, you can define a static data schema. To achieve this, you must provide an object of class Structtype that contains a list of StructField. Below I have presented two ways in which the data schema can be defined.
--The first way Val schema = new Structtype () . Add ("id", Integertype, False) . Add ("Name", StringType, True) . Add ("Surname", StringType, True) . Add ("Age", Integertype, True) --or the second way Val schema = Structtype. FromDDL ("id integer, name string, surname string, age integer")
#1 Tip: Disable the InferSchema option and use a defined schema
In the following example, you can find a snippet of code where you used the previously created data schema and shown how to disable the InferSchema option. By disabling the InferSchema option, Spark will not analyze the data set to deduce what the schema of the data is to use only the one that gave it.
CSV file Val data = Spark. SqlContext. Read. Format ("CSV") . Option ("Header", "false") . Option ("InferSchema", "false") . Schema (Schema) . Load ("/path/to/csv_data") JSON file Val data = Spark. Read. Schema (Schema). JSON ("/path/to/json_data") XML file Val data = Spark. Read. Format ("com. databricks. Spark. xml") . Options (RowTag = "book"). Load ("/path/to/xml_data")
#2 Tip: Store schemas outside application code
A good approach is to not store schemas directly in your code. If you change the data schema, you must modify your code, build a new package, and deploy it. In this tutorial, I'll show you how you can store schemas in files in HDFS and the Hive table.
Loading a schema from a file on HDFS
--Load schema from file on HDFS Val schemaAsTextFromFile = Spark. SparkContext. Textfile ("/data/schemas/schema_from_hdfs.txt"). ToLocalIterator. ToList (0) Val schema = Structtype. FromDDL (SchemaAsTextFromFile)
Loading a schema from a table into a Hive
First, we create a new database and tables in which we will store their schemas.
CREATE DATABASE Bigdata_etl; CREATE TABLE 'bigdata _etl. Schemas ' ( Id ' int, 'dataset _name ' String, 'schema ' string ) ROW FORMAT SERDE 'org. Apache. Hadoop. Hive. ql. io. Orc. OrcSerde ';
Insert a row into the Hive table with the specified data schema.
--Insert into table INSERT INTO bigdata_etl. Schemas (ID, dataset_name, Schema) VALUES (1, Test _dataset ', id integer, name string, surname string, age integer ');
Now in the application code Spark'owej we can load the data from the table with Hive, then filter the interesting line containing the appropriate schema of the data and based on it create object class StrucType, which we can then use when loading Data.
Val schemasFromHive = sqlcontext. Table ("Bigdata_etl. Schemas") Val DataSetSchema = schemasFromHive. Filter (SchemasFromHive ("dataset_name") = = = "Test_dataset"). Select ("Schema"). Head. mkString Val schema = Structtype. FromDDL (DataSetSchema)
Hint #3: Use external tables in hive external tables
You can also use the external table in Hive to improve the execution time of the SPARK'OEWJ application when reading data from files. The following example creates an outer table in Hive. The location parameter is key and determines where to HDFS the data in the CSV format (in this example). Because the schema is defined in the table in Hive, Spark will not attempt to infer the schema (infer schema) from the files stored in that location.
CREATE EXTERNAL TABLE 'bigdata _etl. Test_dataset ' ( Id ' int, Name ' String, 'surname ' String, 'age ' int ) ROW FORMAT Delimited Fields terminated BY ', ' Stored AS textfile Location '/path/to/csv_data ';
Then, in the code in Spark'u, you can just read the data from that table.
Val TestDataSetDF = Spark. SqlContext. Table ("Bigdata_etl. Test_dataset")
In this tutorial, I've presented you how you can improve your code when you use the Dataframe API. In this tutorial, I took some designs from a blog belonging to my friend, where a similar concept was presented to people who use the Python'a API. Here you will find a reference to this entry (https://szczeles.github.io/Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark/).