How To Save Data Orc Parquet Text Csv In Hive File Or Any Different File Type? – 4 types 1 Easy Approach!

How to save data ORC Parquet Text CSV in Hive file or any different file type? - 4 types 1 easy approach!
Photo by Annie Spratt on Unsplash
Share this post and Earn Free Points!

In this post I will show you how to save data ORC Parquet Text CSV in Hive in few ways how you can export data from Hive to csv file. For this tutorial I have prepared table “test_csv_data” with few records into this table.

Introduction

Hive is a data warehousing tool that runs on top of a Hadoop cluster. It provides a SQL-like interface for querying and managing large datasets stored in Hadoop’s distributed filesystem (HDFS).

File Formats

ORC, Parquet, Text, and CSV are all file formats that can be used to store data in Hive. Each of these formats has its own advantages and disadvantages, and the choice of which format to use will depend on the specific needs of your application.

ORC (Optimized Row Columnar)

ORC (Optimized Row Columnar) is a data storage format that is designed to store large amounts of data in a column-oriented manner, which can be more efficient than storing data in a row-oriented format. ORC is often used in big data environments, such as Hadoop, where it can be used to store and process large volumes of structured and semi-structured data.

One of the benefits of using ORC is that it allows for efficient querying and data processing, as data is stored in a columnar format, which can be more efficient for certain types of queries. ORC also includes features such as compression, predicate pushdown, and indexing, which can further improve query performance and reduce the amount of data that needs to be read and processed.

ORC is a popular choice for storing data in Hadoop and is supported by many tools and technologies in the Hadoop ecosystem, including Hive, Pig, and Spark. It is also used in other big data environments, such as Apache Impala and Apache Drill.

Parquet

Apache Parquet is a columnar storage format for Hadoop and other distributed filesystems. It is designed to store and process large amounts of data efficiently, and is widely used in data warehousing and big data analytics applications.

One of the main advantages of using Parquet is that it allows you to store and process data in a columnar format, which can be much more efficient than storing and processing data in a row-based format. This is because columnar storage allows you to access and process only the specific columns that you need, rather than reading and processing an entire row of data.

In addition to its efficiency, Parquet is also highly portable, as it has been designed to work with a variety of different data processing frameworks and programming languages, including Apache Spark, Apache Flink, and Python.

Overall, Parquet is a popular choice for storing and processing large datasets in distributed systems, and is widely used in data engineering and data science applications.

Text

Text is a simple, flat file format that stores data as plain text, with each line representing a record and fields within each record separated by a delimiter such as a comma or tab. It is easy to work with and can be used to store data of any type, but it is not as efficient as more specialized formats like ORC or Parquet.

CSV (Comma Separated Values)

CSV (Comma Separated Values) is similar to text format, but with a slightly more structured format. It stores data as plain text, with each line representing a record and fields within each record separated by a comma. CSV is often used to store tabular data and is easy to work with, but it is not as efficient as more specialized formats like ORC or Parquet.

Avro

Apache Avro is a serialization format that is widely used in the Apache Hadoop ecosystem and other big data systems. It is a binary serialization format that uses a schema to specify the structure of the data being serialized. The schema is stored with the data, so Avro can be used to deserialize the data without requiring access to the schema that was used to serialize it.

Avro is designed to be compact, efficient, and easy to use. It is a popular choice for data serialization in big data systems because it is fast and has a low overhead, making it well-suited for use in high-volume, data-intensive environments.

Avro is implemented in a number of programming languages, including Java, C, C++, Python, and others. It can be used to serialize and deserialize data in a variety of formats, including JSON and CSV.

If you have any specific questions about Avro, feel free to ask.

Save Data ORC Parquet Text CSV in Hive

Method 1: INSERT OVERWRITE LOCAL DIRECTORY…

Please find the below HiveQL syntax. I use “INSERT OVERWRITE LOCAL DIRECTORY” syntax to create csv file as result of select “Select * from test_csv_data”. In this method we have to execute this HiveQL syntax using hive or beeline command line or Hue for instance. (how to save data ORC Parquet Text CSV in Hive)

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/bigdataetl/hive2csv_insOverLocal' 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','  
LINES TERMINATED BY "\n"
SELECT * FROM test_csv_data;

Command output:

19/12/20 08:19:33 INFO ql.Driver: Compiling command(queryId=hive_20191220081919_33941327-d364-42ea-bfa4-7eb2739ef246): INSERT OVERWRITE LOCAL DIRECTORY '/tmp/bigdataetl/hive2csv_insOverLocal' 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','  
LINES TERMINATED BY "\n"
SELECT * FROM test_csv_data
19/12/20 08:19:34 INFO ql.Driver: Semantic Analysis Completed
19/12/20 08:19:34 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test_csv_data.id, type:int, comment:null), FieldSchema(name:test_csv_data.name, type:varchar(20), comment:null), FieldSchema(name:test_csv_data.age, type:int, comment:null)], properties:null)
19/12/20 08:19:34 INFO ql.Driver: Completed compiling command(queryId=hive_20191220081919_33941327-d364-42ea-bfa4-7eb2739ef246); Time taken: 0.108 seconds
19/12/20 08:19:34 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager
19/12/20 08:19:34 INFO ql.Driver: Executing command(queryId=hive_20191220081919_33941327-d364-42ea-bfa4-7eb2739ef246): INSERT OVERWRITE LOCAL DIRECTORY '/tmp/bigdataetl/hive2csv_insOverLocal' 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','  
LINES TERMINATED BY "\n"
SELECT * FROM test_csv_data
19/12/20 08:19:34 INFO ql.Driver: Query ID = hive_20191220081919_33941327-d364-42ea-bfa4-7eb2739ef246
19/12/20 08:19:34 INFO ql.Driver: Total jobs = 1
19/12/20 08:19:34 INFO ql.Driver: Launching Job 1 out of 1
19/12/20 08:19:34 INFO ql.Driver: Starting task [Stage-1:MAPRED] in serial mode
19/12/20 08:19:34 INFO exec.Task: Number of reduce tasks is set to 0 since there's no reduce operator
19/12/20 08:19:34 INFO mr.ExecDriver: Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
19/12/20 08:19:34 INFO mr.ExecDriver: adding libjars: file:/usr/lib/hive/lib/hive-contrib.jar
19/12/20 08:19:34 ERROR mr.ExecDriver: yarn
19/12/20 08:19:34 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
19/12/20 08:19:34 INFO exec.Task: Starting Job = job_1576766377926_0012, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1576766377926_0012/
19/12/20 08:19:34 INFO exec.Task: Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1576766377926_0012
19/12/20 08:19:40 INFO exec.Task: Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
19/12/20 08:19:40 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
19/12/20 08:19:40 INFO exec.Task: 2019-12-20 08:19:40,838 Stage-1 map = 0%,  reduce = 0%
19/12/20 08:19:47 INFO exec.Task: 2019-12-20 08:19:47,130 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.6 sec
19/12/20 08:19:48 INFO exec.Task: MapReduce Total cumulative CPU time: 1 seconds 600 msec
19/12/20 08:19:48 INFO exec.Task: Ended Job = job_1576766377926_0012
19/12/20 08:19:48 INFO ql.Driver: Starting task [Stage-0:MOVE] in serial mode
19/12/20 08:19:48 INFO exec.Task: Copying data to local directory /tmp/bigdataetl/hive2csv_insOverLocal from hdfs://quickstart.cloudera:8020/tmp/hive/cloudera/ee559ee2-25dd-41b7-bd65-a0767a7a03b2/hive_2019-12-20_08-19-33_928_8164278149162791118-1/-mr-10000
19/12/20 08:19:48 INFO ql.Driver: MapReduce Jobs Launched: 
19/12/20 08:19:48 INFO ql.Driver: Stage-Stage-1: Map: 1   Cumulative CPU: 1.6 sec   HDFS Read: 3712 HDFS Write: 70 SUCCESS
19/12/20 08:19:48 INFO ql.Driver: Total MapReduce CPU Time Spent: 1 seconds 600 msec
19/12/20 08:19:48 INFO ql.Driver: Completed executing command(queryId=hive_20191220081919_33941327-d364-42ea-bfa4-7eb2739ef246); Time taken: 14.245 seconds
19/12/20 08:19:48 INFO ql.Driver: OK

Method 2: Use hive -e

Using this method firstly we have to create shell/bash script which we will execute. In short: we will execute shell script which fetch data from Hive and write output to file. (how to save data ORC Parquet Text CSV in Hive)

Create shell script

[root@quickstart bigdataetl]# vim hive2csv_hive.sh

# Put this command into shell file
hive -e 'select * from test_csv_data' | sed 's/[\t]/,/g'  > hive2csv_hive.csv

Execute script and see the command output. (how to save data ORC Parquet Text CSV in Hive)

[root@quickstart bigdataetl]# ./hive2csv_hive.sh 
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
OK
Time taken: 1.161 seconds, Fetched: 8 row(s)

Method 3: Use beeline

beeline -u 'jdbc:hive2://localhost:10000/default' --outputformat=csv2 -f hive2csv_beeline.sql > hive2csv_beeline.csv

Command output: (how to save data ORC Parquet Text CSV in Hive)

scan complete in 4ms
Connecting to jdbc:hive2://localhost:10000/default
Connected to: Apache Hive (version 1.1.0-cdh5.7.0)
Driver: Hive JDBC (version 1.1.0-cdh5.7.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test_csv_data.id, type:int, comment:null), FieldSchema(name:test_csv_data.name, type:varchar(20), comment:null), FieldSchema(name:test_csv_data.age, type:int, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20191220080303_a0466887-2fba-4b57-a6cc-81e11243c81c); Time taken: 0.113 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20191220080303_a0466887-2fba-4b57-a6cc-81e11243c81c): select * from test_csv_data
INFO  : Completed executing command(queryId=hive_20191220080303_a0466887-2fba-4b57-a6cc-81e11243c81c); Time taken: 0.002 seconds
INFO  : OK
8 rows selected (0.371 seconds)
Closing: 0: jdbc:hive2://localhost:10000/default

Results

Below you can find the generated files under /tmp/bigdataetl path: (how to save data ORC Parquet Text CSV in Hive)

[root@quickstart bigdataetl]# find "$(pwd)"
/tmp/bigdataetl
/tmp/bigdataetl/hive2csv_beeline.csv
/tmp/bigdataetl/hive2csv_insOverLocal
/tmp/bigdataetl/hive2csv_insOverLocal/000000_0
/tmp/bigdataetl/hive2csv_insOverLocal/.000000_0.crc
/tmp/bigdataetl/hive2csv_beeline.sql
/tmp/bigdataetl/hive2csv_hive.csv
/tmp/bigdataetl/hive2csv_hive.sh

Summary

In summary, each of these file formats can be used to store data in Hive, and the choice of which format to use will depend on the specific needs of your application. ORC and Parquet are generally more efficient and optimized for fast read and write performance, while Text and CSV are simpler formats that are easier to work with but may not be as efficient.

That’s all about how to save data ORC Parquet Text CSV in Hive!

Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!

How useful was this post?

Click on a star to rate it!

Average rating 5 / 5. Vote count: 1298

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?