How to save data from Hive (Parquet, Text, ORC) to CSV file or any different file type?

How to save data from Hive (Parquet, Text, ORC) to CSV file or any different file type?
Photo by Annie Spratt on Unsplash

In this post I will show you few ways how you can export data from Hive to csv file. For this tutorial I have prepared hive table “test_csv_data” with few records into this table.

Method 1: INSERT OVERWRITE LOCAL DIRECTORY…

Please find the below HiveQL syntax. I use “INSERT OVERWRITE LOCAL DIRECTORY” syntax to create csv file as result of select “Select * from test_csv_data”. In this method we have to execute this HiveQL syntax using hive or beeline command line or Hue for instance.

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/bigdataetl/hive2csv_insOverLocal' 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','  
LINES TERMINATED BY "\n"
SELECT * FROM test_csv_data;

Command output:

19/12/20 08:19:33 INFO ql.Driver: Compiling command(queryId=hive_20191220081919_33941327-d364-42ea-bfa4-7eb2739ef246): INSERT OVERWRITE LOCAL DIRECTORY '/tmp/bigdataetl/hive2csv_insOverLocal' 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','  
LINES TERMINATED BY "\n"
SELECT * FROM test_csv_data
19/12/20 08:19:34 INFO ql.Driver: Semantic Analysis Completed
19/12/20 08:19:34 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test_csv_data.id, type:int, comment:null), FieldSchema(name:test_csv_data.name, type:varchar(20), comment:null), FieldSchema(name:test_csv_data.age, type:int, comment:null)], properties:null)
19/12/20 08:19:34 INFO ql.Driver: Completed compiling command(queryId=hive_20191220081919_33941327-d364-42ea-bfa4-7eb2739ef246); Time taken: 0.108 seconds
19/12/20 08:19:34 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager
19/12/20 08:19:34 INFO ql.Driver: Executing command(queryId=hive_20191220081919_33941327-d364-42ea-bfa4-7eb2739ef246): INSERT OVERWRITE LOCAL DIRECTORY '/tmp/bigdataetl/hive2csv_insOverLocal' 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','  
LINES TERMINATED BY "\n"
SELECT * FROM test_csv_data
19/12/20 08:19:34 INFO ql.Driver: Query ID = hive_20191220081919_33941327-d364-42ea-bfa4-7eb2739ef246
19/12/20 08:19:34 INFO ql.Driver: Total jobs = 1
19/12/20 08:19:34 INFO ql.Driver: Launching Job 1 out of 1
19/12/20 08:19:34 INFO ql.Driver: Starting task [Stage-1:MAPRED] in serial mode
19/12/20 08:19:34 INFO exec.Task: Number of reduce tasks is set to 0 since there's no reduce operator
19/12/20 08:19:34 INFO mr.ExecDriver: Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
19/12/20 08:19:34 INFO mr.ExecDriver: adding libjars: file:/usr/lib/hive/lib/hive-contrib.jar
19/12/20 08:19:34 ERROR mr.ExecDriver: yarn
19/12/20 08:19:34 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
19/12/20 08:19:34 INFO exec.Task: Starting Job = job_1576766377926_0012, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1576766377926_0012/
19/12/20 08:19:34 INFO exec.Task: Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1576766377926_0012
19/12/20 08:19:40 INFO exec.Task: Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
19/12/20 08:19:40 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
19/12/20 08:19:40 INFO exec.Task: 2019-12-20 08:19:40,838 Stage-1 map = 0%,  reduce = 0%
19/12/20 08:19:47 INFO exec.Task: 2019-12-20 08:19:47,130 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.6 sec
19/12/20 08:19:48 INFO exec.Task: MapReduce Total cumulative CPU time: 1 seconds 600 msec
19/12/20 08:19:48 INFO exec.Task: Ended Job = job_1576766377926_0012
19/12/20 08:19:48 INFO ql.Driver: Starting task [Stage-0:MOVE] in serial mode
19/12/20 08:19:48 INFO exec.Task: Copying data to local directory /tmp/bigdataetl/hive2csv_insOverLocal from hdfs://quickstart.cloudera:8020/tmp/hive/cloudera/ee559ee2-25dd-41b7-bd65-a0767a7a03b2/hive_2019-12-20_08-19-33_928_8164278149162791118-1/-mr-10000
19/12/20 08:19:48 INFO ql.Driver: MapReduce Jobs Launched: 
19/12/20 08:19:48 INFO ql.Driver: Stage-Stage-1: Map: 1   Cumulative CPU: 1.6 sec   HDFS Read: 3712 HDFS Write: 70 SUCCESS
19/12/20 08:19:48 INFO ql.Driver: Total MapReduce CPU Time Spent: 1 seconds 600 msec
19/12/20 08:19:48 INFO ql.Driver: Completed executing command(queryId=hive_20191220081919_33941327-d364-42ea-bfa4-7eb2739ef246); Time taken: 14.245 seconds
19/12/20 08:19:48 INFO ql.Driver: OK

Method 2: hive -e

Using this method firstly we have to create shell/bash script which we will execute. In short: we will execute shell script which fetch data from hive and write output to file.

Create shell script

[root@quickstart bigdataetl]# vim hive2csv_hive.sh

# Put this command into shell file
hive -e 'select * from test_csv_data' | sed 's/[\t]/,/g'  > hive2csv_hive.csv

Execute script and see the command output.

[root@quickstart bigdataetl]# ./hive2csv_hive.sh 
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
OK
Time taken: 1.161 seconds, Fetched: 8 row(s)

Method 3: beeline

beeline -u 'jdbc:hive2://localhost:10000/default' --outputformat=csv2 -f hive2csv_beeline.sql > hive2csv_beeline.csv

Command output:

scan complete in 4ms
Connecting to jdbc:hive2://localhost:10000/default
Connected to: Apache Hive (version 1.1.0-cdh5.7.0)
Driver: Hive JDBC (version 1.1.0-cdh5.7.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test_csv_data.id, type:int, comment:null), FieldSchema(name:test_csv_data.name, type:varchar(20), comment:null), FieldSchema(name:test_csv_data.age, type:int, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20191220080303_a0466887-2fba-4b57-a6cc-81e11243c81c); Time taken: 0.113 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20191220080303_a0466887-2fba-4b57-a6cc-81e11243c81c): select * from test_csv_data
INFO  : Completed executing command(queryId=hive_20191220080303_a0466887-2fba-4b57-a6cc-81e11243c81c); Time taken: 0.002 seconds
INFO  : OK
8 rows selected (0.371 seconds)
Closing: 0: jdbc:hive2://localhost:10000/default

Results

Below you can find the generated files under /tmp/bigdataetl path:

[root@quickstart bigdataetl]# find "$(pwd)"
/tmp/bigdataetl
/tmp/bigdataetl/hive2csv_beeline.csv
/tmp/bigdataetl/hive2csv_insOverLocal
/tmp/bigdataetl/hive2csv_insOverLocal/000000_0
/tmp/bigdataetl/hive2csv_insOverLocal/.000000_0.crc
/tmp/bigdataetl/hive2csv_beeline.sql
/tmp/bigdataetl/hive2csv_hive.csv
/tmp/bigdataetl/hive2csv_hive.sh

If you enjoyed this post please add the comment below or share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!

Leave a Reply

avatar
  Subscribe  
Notify of
Close Menu