[SOLVED] Apache Hive Convert ORC to Parquet – the best way to convert data from one format to another (CSV, Parquet, Avro, ORC) – 1 cool approach for all cases!

You are currently viewing [SOLVED] Apache Hive Convert ORC to Parquet – the best way to convert data from one format to another (CSV, Parquet, Avro, ORC) – 1 cool approach for all cases!
Share This Post, Help Others, And Earn My Heartfelt Appreciation! :)
4.9
(1475)

[ Apache Hive Convert ORC to Parquet ] In this short tutorial I will give you a hint how you can convert the data in Hive from one to another format without any additional application.

Apache Hive Convert ORC to Parquet

Hint: Just copy data between Hive tables

Let’s concern the following scenario:

  • You have data in CSV format in table “data_in_csv”
  • You would like to have the same data but in ORC format in table “data_in_parquet” -> convert ORC to Parquet

Step #1 – Make copy of table but change the “STORED” format

You have table in CSV format like below:

CREATE TABLE data_in_csv (
  id Int,
  name String,
  age Int
)
PARTITIONED BY (INGESTION_ID BIGINT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  "separatorChar" = ',',
  "quoteChar" = '"',
  "escapeChar" = '\'
)
STORED AS TEXTFILE;

Now we will create the same table but in ORC format: (Convert ORC to Parquet)

CREATE TABLE data_in_orc (
  id int,
  name string,
  age int
)
PARTITIONED BY (INGESTION_ID BIGINT)
STORED AS ORC tblproperties ("orc.compress"="SNAPPY");

Step #2 – Copy the data between tables

Now, when you have created these two tables we will just copy the data from first to new one. The conversion will be done by Hive engine. You don’t have to know how it was performed 🙂

INSERT OVERWRITE TABLE DATA_IN_ORC PARTITION (INGESTION_ID)
SELECT ID, NAME, AGE, INGESTION_ID FORM DATA_IN_CSV;

For Avro and Parquet examples

-- Avro format
CREATE TABLE data_in_avro (
  id int,
  name string,
  age int
)
PARTITIONED BY (INGESTION_ID BIGINT)
STORED AS AVRO;

-- Parquet format
CREATE TABLE data_in_parquet (
  id int,
  name string,
  age int
)
PARTITIONED BY (LOADING_DATE STRING)
STORED AS STORED AS PARQUET;

The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.

https://cwiki.apache.org/confluence/display/Hive//Home

Apache Hive Convert ORC to Parquet

Built on top of Apache Hadoop™, Hive provides the following features:

Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
A mechanism to impose structure on a variety of data formats
Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™
Query execution via Apache Tez™, Apache Spark™, or MapReduce
Procedural language with HPL-SQL
Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.

https://cwiki.apache.org/confluence/display/Hive//Home

That’s all about: Apache Hive Convert ORC to Parquet!. Enjoy!

If you enjoyed this post please add the comment below and share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!

How useful was this post?

Click on a star to rate it!

Average rating 4.9 / 5. Vote count: 1475

No votes so far! Be the first to rate this post.

Subscribe
Notify of
guest
2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
shobha

good