File Format and Data manipulation in HDFS



0
Hi,

Could you please briefly explain the following.

How to convert a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS or Hive/Hcatalog?
How to write data with compression?
How to convert data from one set of values to another (e.g., Postal Address using an external library)?
How to purge bad records from a data set, e.g., null values?
How to perform deduplication and merge data in HDFS?
How to denormalize data from multiple disparate data sets?

2 Answer(s)


0

hi Trinath,
How to convert a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS or Hive/Hcatalog?
>> Write a MapReduce program which will take older data values and write to new data format
How to write data with compression?
>> For compression of data, use one of the hadoop supported compression formats like snappy ,gzip. etc, checkout more on http://comphadoop.weebly.com/
How to convert data from one set of values to another
>> Not clear on what tool you are using Pig/Hive/Mapreduce?
How to purge bad records from a data set, e.g., null values?
>> Depends on what tool you are using, if you are using Mapreduce, using the reporter capture the bad records, report them or delete them


0

Hi,

Thanks for answering my questions.

1. How to convert a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS or Hive/Hcatalog?
>> Write a MapReduce program which will take older data values and write to new data format.
** How can we do this in PIG or Hive?

2.How to convert data from one set of values to another
>> Not clear on what tool you are using Pig/Hive/Mapreduce?
** The tools I am using are PIG and Hive.

3.How to purge bad records from a data set, e.g., null values?
>> Depends on what tool you are using, if you are using Mapreduce, using the reporter capture the bad records, report them or delete them
** I am using Pig and Hive, could you suggest ways to purge bad records using these tools?