Hive Partitioning based on file name or value derived from one of the columns



0

I am trying to create hive paritioned table with the weather dataset shared during course. We have weather data orgnized in files. These files are named after each month.

201201hourly.txt201202hourly.txt201203hourly.txt201204hourly.txt.  

What is the correct approach to partition this data (in hive table) which is already orginized in files in the way we want.

- Can we partition weather Data table using these file names? Is there something inbuilt in hive which support paritioning by input directory / files names

- Is there a way we can partition based on derived value from one of the columns in a file

          - Example, weather data has date format "20120101" (where 5th and 6th char represents month), can I partition based those 2 chars in the column.

One of the options is to transform data to have another column (for month) and then parition/load data to hive table, but that would be resource heavy process.


1 Answer(s)


1

Hi Pankaj,

Hive has parition but required parts values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.

You need to run some non-hive script to create a extra column contains only last two digits of date string.

Hope this helps.

Thanks.