[SOLVED] Apache Hive Convert Orc To Parquet – The Best Way To Convert Data From One Format To Another (CSV, Parquet, Avro, ORC) – 1 Cool Approach For All Cases!

Apache Hive Convert ORC to Parquet - the best way to convert data from one format to another (CSV, Parquet, Avro, ORC) - 1 cool approach for all cases! Apache Hive konwertowanie z ORC do Parquet - najlepszy sposób danych z jednego formatu na inny (CSV, Parquet, Avro, ORC) - 2 proste kroki!
Share this post and Earn Free Points!

In this short tutorial I will dive into topic: Apache Hive Convert ORC to Parquet.

I will give you a hint how you can convert the data in Hive from one to another format without any additional application.

Introduction

ORC Format

ORC is a type-aware columnar file format created for Hadoop workloads that is self-descriptive. It’s built for huge streaming reads, but it also has built-in functionality for quickly finding required rows. The reader can read, decompress, and process only the values required for the current query by storing data in a columnar format. Because ORC files are type-aware, the writer selects the best encoding for the type and creates an internal index when writing the file.

By default, ORC files are separated into stripes of about 64MB each. A file’s stripes are self-contained and make up the natural unit of distributed work. The columns are divided from one another within each stripe so that the reader can read only the columns that are required. More information you can find under official specification.

Parquet Format

Parquet is a columnar storage format for storing large amounts of data in a highly efficient and performant manner. It is a popular choice for storing and processing data in big data platforms such as Apache Hadoop and Apache Spark.

One of the main advantages of the Parquet format is that it is optimized for reading and writing large amounts of data. It stores data in a columnar fashion, which allows it to efficiently compress and store data, and it also supports efficient data processing, as data can be read and processed in a column-wise fashion rather than row-wise.

Another advantage of the Parquet format is that it is highly flexible and can store data of various types and structures. It supports a wide range of data types, including primitive types such as integers, floats, and strings, as well as more complex types such as arrays and maps. It also supports nested data structures, which allows it to store complex data in a highly efficient manner.

ORC VS Parquet

ORC (Optimized Row Columnar) and Parquet are both columnar storage formats that are designed to store and process large amounts of data in a highly efficient and performant manner. They are often used in big data platforms such as Apache Hadoop and Apache Spark for storing and processing data.

One key difference between ORC and Parquet is the way that they store data. ORC stores data in a row-based format, with each row in the file representing a record, while Parquet stores data in a column-based format, with each column in the file representing a field.

Another difference is the way that they handle data types and structures. ORC is generally more limited in terms of the data types and structures that it can support, while Parquet is more flexible and can store a wider range of data types and structures, including nested data structures.

Overall, both ORC and Parquet are efficient choices for storing and processing large amounts of data, and the choice between them will depend on the specific needs and requirements of your application.

Avro Format

Apache Avro is a serialization and data exchange format that is designed to be compact, efficient, and easy to use. It is often used in big data platforms such as Apache Hadoop and Apache Spark for storing and exchanging data.

One of the main features of Avro is that it is a self-describing format, which means that it includes the schema for the data that it serializes within the data itself. This allows Avro to support data evolution and data compatibility, as the schema can be used to interpret and read data that was serialized using different versions of the schema.

Another feature of Avro is that it supports a wide range of data types and structures, including primitive types such as integers, floats, and strings, as well as more complex types such as arrays and maps. It also supports nested data structures, which allows it to store complex data in a highly efficient manner.

CSV Format

CSV is a popular choice for storing and exchanging data due to its simplicity and wide support. It stores data in a table structure with rows representing records and columns representing fields, and values are separated by commas.

CSV can store various data types and structures, including numeric, text, and date data. However, it is not ideal for storing large amounts of data and lacks some advanced features found in other data storage formats like support for complex data types and structures and data compression.

Advantages Of Apache Hive Parquet

Apache Hive is a data warehousing and SQL-like query language for big data platforms such as Apache Hadoop. It provides a convenient interface for working with large datasets stored in the Hadoop Distributed File System (HDFS) and other storage systems, and it supports a range of data formats, including the Parquet format.

There are several advantages to using the Parquet format with Hive:

  1. Improved performance: Parquet is a columnar storage format that is optimized for reading and writing large amounts of data. It stores data in a column-based format, which allows it to compress and store data more efficiently, and it also supports efficient data processing, as data can be read and processed in a column-wise fashion rather than row-wise.
  2. Data compatibility: Parquet is a self-describing format, which means that it includes the schema for the data that it stores within the data itself. This allows it to support data evolution and data compatibility, as the schema can be used to interpret and read data that was written using different versions of the schema.
  3. Wide range of data types and structures: Parquet supports a wide range of data types and structures, including primitive types such as integers, floats, and strings, as well as more complex types such as arrays and maps. It also supports nested data structures, which allows it to store complex data in a highly efficient manner.

Overall, using the Parquet format with Hive can improve the performance and efficiency of Hive queries and data processing tasks, and it is a popular choice for storing and processing large amounts of data in the Hadoop ecosystem.

Parquet As Native Format For Impala

Apache Impala is an open-source distributed SQL query engine for Hadoop that is designed for fast, interactive analysis of large datasets. It is a popular choice for data warehousing and business intelligence workloads.

Using Parquet as the native file format for Impala can improve the performance and efficiency of Impala queries and data processing tasks, and it is a popular choice for storing and processing large amounts of data in the Hadoop ecosystem.

In addition to its performance benefits, Parquet is also a self-describing format that supports data evolution and data compatibility, and it supports a wide range of data types and structures.

Apache Hive Convert ORC To Parquet

Hint: Just copy data between Hive tables

Let’s concern the following scenario:

Step #1 – Make copy of table but change the “STORED” format

You have table in CSV format like below:

CREATE TABLE data_in_csv (
  id Int,
  name String,
  age Int
)
PARTITIONED BY (INGESTION_ID BIGINT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  "separatorChar" = ',',
  "quoteChar" = '"',
  "escapeChar" = '\'
)
STORED AS TEXTFILE;

Now we will create the same table but in ORC format: (Convert ORC to Parquet)

CREATE TABLE data_in_orc (
  id int,
  name string,
  age int
)
PARTITIONED BY (INGESTION_ID BIGINT)
STORED AS ORC tblproperties ("orc.compress"="SNAPPY");

Step #2 – Copy the data between tables

Now, when you have created these two tables we will just copy the data from first to new one. The conversion will be done by Hive engine. You don’t have to know how it was performed 🙂

INSERT OVERWRITE TABLE DATA_IN_ORC PARTITION (INGESTION_ID)
SELECT ID, NAME, AGE, INGESTION_ID FORM DATA_IN_CSV;

For Avro And Parquet Examples

Now let’s consider the case when you want to convert data from Avro to Parquet format. What you have to just do is to create the new table with target format and execute the insert as select statement.

-- Avro format
CREATE TABLE data_in_avro (
  id int,
  name string,
  age int
)
PARTITIONED BY (INGESTION_ID BIGINT)
STORED AS AVRO;

-- Parquet format
CREATE TABLE data_in_parquet (
  id int,
  name string,
  age int
)
PARTITIONED BY (LOADING_DATE STRING)
STORED AS STORED AS PARQUET;

Summary

Overall, the Parquet format is a widely-used and efficient choice for storing and processing large amounts of data, and it is a key component of many big data platforms and pipelines.

Avro is a popular choice for storing and exchanging data in big data platforms and pipelines, thanks to its compact size, efficient encoding, and support for data evolution and compatibility.

That’s all about: Apache Hive Convert ORC to Parquet!. Enjoy!

Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!

How useful was this post?

Click on a star to rate it!

Average rating 4.9 / 5. Vote count: 1475

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?