You are currently viewing [SOLVED] Apache Hive Convert ORC to Parquet – the best way to convert data from one format to another (CSV, Parquet, Avro, ORC) – 1 cool approach for all cases!
Could You Please Share This Post? I Appreciate It And Thank YOU! :) Have A Nice Day!
4.9
(1475)

[ Apache Hive Convert ORC to Parquet ] In this short tutorial I will give you a hint how you can convert the data in Hive from one to another format without any additional application.

ORC Format

ORC is a type-aware columnar file format created for Hadoop workloads that is self-descriptive. It’s built for huge streaming reads, but it also has built-in functionality for quickly finding required rows. The reader can read, decompress, and process only the values required for the current query by storing data in a columnar format. Because ORC files are type-aware, the writer selects the best encoding for the type and creates an internal index when writing the file.

By default, ORC files are separated into stripes of about 64MB each. A file’s stripes are self-contained and make up the natural unit of distributed work. The columns are divided from one another within each stripe so that the reader can read only the columns that are required. More information you can find under official specification.

Apache Hive Convert ORC to Parquet

Hint: Just copy data between Hive tables

Let’s concern the following scenario:

Step #1 – Make copy of table but change the “STORED” format

You have table in CSV format like below:

CREATE TABLE data_in_csv (
  id Int,
  name String,
  age Int
)
PARTITIONED BY (INGESTION_ID BIGINT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  "separatorChar" = ',',
  "quoteChar" = '"',
  "escapeChar" = '\'
)
STORED AS TEXTFILE;

Now we will create the same table but in ORC format: (convert ORC to Parquet)

CREATE TABLE data_in_orc (
  id int,
  name string,
  age int
)
PARTITIONED BY (INGESTION_ID BIGINT)
STORED AS ORC tblproperties ("orc.compress"="SNAPPY");

Step #2 – Copy the data between tables

Now, when you have created these two tables we will just copy the data from first to new one. The conversion will be done by Hive engine. You don’t have to know how it was performed 🙂

INSERT OVERWRITE TABLE DATA_IN_ORC PARTITION (INGESTION_ID)
SELECT ID, NAME, AGE, INGESTION_ID FORM DATA_IN_CSV;

For Avro and Parquet examples

-- Avro format
CREATE TABLE data_in_avro (
  id int,
  name string,
  age int
)
PARTITIONED BY (INGESTION_ID BIGINT)
STORED AS AVRO;

-- Parquet format
CREATE TABLE data_in_parquet (
  id int,
  name string,
  age int
)
PARTITIONED BY (LOADING_DATE STRING)
STORED AS STORED AS PARQUET;

The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.

https://cwiki.apache.org/confluence/display/Hive//Home

Apache Hive Convert ORC to Parquet

Built on top of Apache Hadoop™, Hive provides the following features:

Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
A mechanism to impose structure on a variety of data formats
Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™
Query execution via Apache Tez™, Apache Spark™, or MapReduce
Procedural language with HPL-SQL
Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.

https://cwiki.apache.org/confluence/display/Hive//Home

That’s all about: Apache Hive convert ORC to Parquet!. Enjoy!

Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!

BigData-ETL: image 7YOU MIGHT ALSO LIKE

How useful was this post?

Click on a star to rate it!

Average rating 4.9 / 5. Vote count: 1475

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?