Hive – the best way to convert data from one format to another (CSV, Parquet, Avro, ORC)

Hive – the best way to convert data from one format to another (CSV, Parquet, Avro, ORC)

Hello! In this short tutorial I will give you a hint how you can convert the data in Hive from one to another format without any additional application.

Hint: Just copy data between Hive tables

Let’s concern the following scenario:

  • You have data in CSV format in table “data_in_csv”
  • You would like to have the same data but in ORC format in table “data_in_parquet”

Step #1 – Make copy of table but change the “STORED” format

You have table in CSV format like below:

CREATE TABLE data_in_csv (
  id Int,
  name String,
  age Int
)
PARTITIONED BY (INGESTION_ID BIGINT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  "separatorChar" = ',',
  "quoteChar" = '"',
  "escapeChar" = '\'
)
STORED AS TEXTFILE;

Now we will create the same table but in ORC format:

CREATE TABLE data_in_orc (
  id int,
  name string,
  age int
)
PARTITIONED BY (INGESTION_ID BIGINT)
STORED AS ORC tblproperties ("orc.compress"="SNAPPY");

Step #2 – Copy the data between tables

Now, when you have created these two tables we will just copy the data from first to new one. The conversion will be done by Hive engine. You don’t have to know how it was performed 🙂

INSERT OVERWRITE TABLE DATA_IN_ORC PARTITION (INGESTION_ID)
SELECT ID, NAME, AGE, INGESTION_ID FORM DATA_IN_CSV;

For Avro and Parquet examples

-- Avro format
CREATE TABLE data_in_avro (
  id int,
  name string,
  age int
)
PARTITIONED BY (INGESTION_ID BIGINT)
STORED AS AVRO;

-- Parquet format
CREATE TABLE data_in_parquet (
  id int,
  name string,
  age int
)
PARTITIONED BY (LOADING_DATE STRING)
STORED AS STORED AS PARQUET;

If you enjoyed this post please leave the comment below or share this post on your Facebook, Twitter, LinkedIn or another social media webpage.

Thanks in advanced!

Leave a Reply

avatar
  Subscribe  
Notify of
Close Menu