In this post I will show you how to save data ORC Parquet Text CSV in Hive" in few ways how you can export data from Hive to csv file. For this tutorial" I have prepared table “test_csv_data” with few records into this table.
Table of Contents
Introduction
Hive is a data warehousing tool that runs on top of a Hadoop" cluster. It provides a SQL-like interface for querying and managing large datasets stored in Hadoop’s distributed filesystem (HDFS").
File Formats
ORC", Parquet", Text, and CSV are all file formats that can be used to store data in Hive". Each of these formats has its own advantages and disadvantages, and the choice of which format to use will depend on the specific needs of your application.
ORC (Optimized Row Columnar)
ORC" (Optimized Row Columnar) is a data storage format that is designed to store large amounts of data in a column-oriented manner, which can be more efficient than storing data in a row-oriented format. ORC" is often used in big data" environments, such as Hadoop", where it can be used to store and process large volumes of structured and semi-structured data.
One of the benefits of using ORC" is that it allows for efficient querying and data processing, as data is stored in a columnar format, which can be more efficient for certain types of queries. ORC" also includes features such as compression, predicate pushdown, and indexing, which can further improve query performance and reduce the amount of data that needs to be read and processed.
ORC" is a popular choice for storing data in Hadoop" and is supported by many tools and technologies in the Hadoop" ecosystem, including Hive", Pig, and Spark. It is also used in other big data" environments, such as Apache Impala" and Apache Drill.
Parquet
Apache Parquet" is a columnar storage format for Hadoop" and other distributed filesystems. It is designed to store and process large amounts of data efficiently, and is widely used in data warehousing and big data" analytics applications.
One of the main advantages of using Parquet" is that it allows you to store and process data in a columnar format, which can be much more efficient than storing and processing data in a row-based format. This is because columnar storage allows you to access and process only the specific columns that you need, rather than reading and processing an entire row of data.
In addition to its efficiency, Parquet" is also highly portable, as it has been designed to work with a variety of different data processing frameworks and programming" languages, including Apache Spark", Apache Flink", and Python".
Overall, Parquet" is a popular choice for storing and processing large datasets in distributed systems, and is widely used in data engineering and data science applications.
Text
Text is a simple, flat file" format that stores data as plain text, with each line representing a record and fields within each record separated by a delimiter such as a comma or tab. It is easy to work with and can be used to store data of any type, but it is not as efficient as more specialized formats like ORC" or Parquet".
CSV (Comma Separated Values)
CSV (Comma Separated Values) is similar to text format, but with a slightly more structured format. It stores data as plain text, with each line representing a record and fields within each record separated by a comma. CSV is often used to store tabular data and is easy to work with, but it is not as efficient as more specialized formats like ORC" or Parquet".
Avro
Apache Avro" is a serialization format that is widely used in the Apache Hadoop" ecosystem and other big data" systems. It is a binary serialization format that uses a schema to specify the structure of the data being serialized. The schema is stored with the data, so Avro" can be used to deserialize the data without requiring access to the schema that was used to serialize it.
Avro" is designed to be compact, efficient, and easy to use. It is a popular choice for data serialization in big data" systems because it is fast and has a low overhead, making it well-suited for use in high-volume, data-intensive environments.
Avro" is implemented in a number of programming languages, including Java", C, C++, Python", and others. It can be used to serialize and deserialize data in a variety of formats, including JSON and CSV".
If you have any specific questions about Avro", feel free to ask.
Save Data ORC Parquet Text CSV in Hive
Method 1: INSERT OVERWRITE LOCAL DIRECTORY…
Please find the below HiveQL syntax. I use “INSERT OVERWRITE LOCAL DIRECTORY” syntax to create csv file as result of select “Select * from test_csv_data”. In this method we have to execute this HiveQL syntax using hive" or beeline command line or Hue for instance. (how to save data ORC Parquet Text CSV in Hive")
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/bigdataetl/hive2csv_insOverLocal' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY "\n" SELECT * FROM test_csv_data;
Command output:
19/12/20 08:19:33 INFO ql.Driver: Compiling command(queryId=hive_20191220081919_33941327-d364-42ea-bfa4-7eb2739ef246): INSERT OVERWRITE LOCAL DIRECTORY '/tmp/bigdataetl/hive2csv_insOverLocal' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY "\n" SELECT * FROM test_csv_data 19/12/20 08:19:34 INFO ql.Driver: Semantic Analysis Completed 19/12/20 08:19:34 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test_csv_data.id, type:int, comment:null), FieldSchema(name:test_csv_data.name, type:varchar(20), comment:null), FieldSchema(name:test_csv_data.age, type:int, comment:null)], properties:null) 19/12/20 08:19:34 INFO ql.Driver: Completed compiling command(queryId=hive_20191220081919_33941327-d364-42ea-bfa4-7eb2739ef246); Time taken: 0.108 seconds 19/12/20 08:19:34 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager 19/12/20 08:19:34 INFO ql.Driver: Executing command(queryId=hive_20191220081919_33941327-d364-42ea-bfa4-7eb2739ef246): INSERT OVERWRITE LOCAL DIRECTORY '/tmp/bigdataetl/hive2csv_insOverLocal' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY "\n" SELECT * FROM test_csv_data 19/12/20 08:19:34 INFO ql.Driver: Query ID = hive_20191220081919_33941327-d364-42ea-bfa4-7eb2739ef246 19/12/20 08:19:34 INFO ql.Driver: Total jobs = 1 19/12/20 08:19:34 INFO ql.Driver: Launching Job 1 out of 1 19/12/20 08:19:34 INFO ql.Driver: Starting task [Stage-1:MAPRED] in serial mode 19/12/20 08:19:34 INFO exec.Task: Number of reduce tasks is set to 0 since there's no reduce operator 19/12/20 08:19:34 INFO mr.ExecDriver: Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat 19/12/20 08:19:34 INFO mr.ExecDriver: adding libjars: file:/usr/lib/hive/lib/hive-contrib.jar 19/12/20 08:19:34 ERROR mr.ExecDriver: yarn 19/12/20 08:19:34 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 19/12/20 08:19:34 INFO exec.Task: Starting Job = job_1576766377926_0012, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1576766377926_0012/ 19/12/20 08:19:34 INFO exec.Task: Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1576766377926_0012 19/12/20 08:19:40 INFO exec.Task: Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 19/12/20 08:19:40 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 19/12/20 08:19:40 INFO exec.Task: 2019-12-20 08:19:40,838 Stage-1 map = 0%, reduce = 0% 19/12/20 08:19:47 INFO exec.Task: 2019-12-20 08:19:47,130 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.6 sec 19/12/20 08:19:48 INFO exec.Task: MapReduce Total cumulative CPU time: 1 seconds 600 msec 19/12/20 08:19:48 INFO exec.Task: Ended Job = job_1576766377926_0012 19/12/20 08:19:48 INFO ql.Driver: Starting task [Stage-0:MOVE] in serial mode 19/12/20 08:19:48 INFO exec.Task: Copying data to local directory /tmp/bigdataetl/hive2csv_insOverLocal from hdfs://quickstart.cloudera:8020/tmp/hive/cloudera/ee559ee2-25dd-41b7-bd65-a0767a7a03b2/hive_2019-12-20_08-19-33_928_8164278149162791118-1/-mr-10000 19/12/20 08:19:48 INFO ql.Driver: MapReduce Jobs Launched: 19/12/20 08:19:48 INFO ql.Driver: Stage-Stage-1: Map: 1 Cumulative CPU: 1.6 sec HDFS Read: 3712 HDFS Write: 70 SUCCESS 19/12/20 08:19:48 INFO ql.Driver: Total MapReduce CPU Time Spent: 1 seconds 600 msec 19/12/20 08:19:48 INFO ql.Driver: Completed executing command(queryId=hive_20191220081919_33941327-d364-42ea-bfa4-7eb2739ef246); Time taken: 14.245 seconds 19/12/20 08:19:48 INFO ql.Driver: OK
Method 2: Use hive -e
Using this method firstly we have to create shell/bash script which we will execute. In short: we will execute shell" script which fetch data from Hive" and write output to file. (how to save data ORC Parquet Text CSV in Hive")
Create shell script
[root@quickstart bigdataetl]# vim hive2csv_hive.sh # Put this command into shell file hive -e 'select * from test_csv_data' | sed 's/[\t]/,/g' > hive2csv_hive.csv
Execute script and see the command output. (how to save data ORC Parquet Text CSV in Hive")
[root@quickstart bigdataetl]# ./hive2csv_hive.sh Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties OK Time taken: 1.161 seconds, Fetched: 8 row(s)
Method 3: Use beeline
beeline -u 'jdbc:hive2://localhost:10000/default' --outputformat=csv2 -f hive2csv_beeline.sql > hive2csv_beeline.csv
Command output: (how to save data ORC Parquet Text CSV in Hive")
scan complete in 4ms Connecting to jdbc:hive2://localhost:10000/default Connected to: Apache Hive (version 1.1.0-cdh5.7.0) Driver: Hive JDBC (version 1.1.0-cdh5.7.0) Transaction isolation: TRANSACTION_REPEATABLE_READ INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test_csv_data.id, type:int, comment:null), FieldSchema(name:test_csv_data.name, type:varchar(20), comment:null), FieldSchema(name:test_csv_data.age, type:int, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20191220080303_a0466887-2fba-4b57-a6cc-81e11243c81c); Time taken: 0.113 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20191220080303_a0466887-2fba-4b57-a6cc-81e11243c81c): select * from test_csv_data INFO : Completed executing command(queryId=hive_20191220080303_a0466887-2fba-4b57-a6cc-81e11243c81c); Time taken: 0.002 seconds INFO : OK 8 rows selected (0.371 seconds) Closing: 0: jdbc:hive2://localhost:10000/default
Results
Below you can find the generated files under /tmp/bigdataetl path: (how to save data ORC Parquet Text CSV in Hive")
[root@quickstart bigdataetl]# find "$(pwd)" /tmp/bigdataetl /tmp/bigdataetl/hive2csv_beeline.csv /tmp/bigdataetl/hive2csv_insOverLocal /tmp/bigdataetl/hive2csv_insOverLocal/000000_0 /tmp/bigdataetl/hive2csv_insOverLocal/.000000_0.crc /tmp/bigdataetl/hive2csv_beeline.sql /tmp/bigdataetl/hive2csv_hive.csv /tmp/bigdataetl/hive2csv_hive.sh
Summary
In summary, each of these file formats can be used to store data in Hive", and the choice of which format to use will depend on the specific needs of your application. ORC" and Parquet" are generally more efficient and optimized for fast read and write performance, while Text and CSV are simpler formats that are easier to work with but may not be as efficient.
That’s all about how to save data ORC Parquet Text CSV in Hive"!
Could You Please Share This Post?
I appreciate It And Thank YOU! :)
Have A Nice Day!