Here is Homework PIG Script, technically it works, but is it good?



0
Technically this script does word count.
But because of the way it used GROUP, it does not seem like there would be any way to break it up into separate mappers, so I'm not sure if this a good PIG script.

---WORDCOUNT.PIG
-- load lines, seems about the same as USING TextLoader
LINES = LOAD 'wcdata.txt' AS (line:chararray);
-- TOKENIZE
TOKES = FOREACH LINES GENERATE TOKENIZE(line) AS linelist;
-- tuple of bag of tuple of chararray?? - must flatten out!

WL = FOREACH TOKES GENERATE FLATTEN(linelist) AS words;
-- wow that actually worked!

WG = GROUP WL BY words;

-- note GROUP command gives key field the name group
-- and the (inner) bag the of what it was grouped by (WL)
WC = FOREACH WG GENERATE group AS word, COUNT(WL) AS ct;

1 Answer(s)


0

hi Chip,

Here is the wordcount Pig script

lines = LOAD 'file.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;

Here is the explanation:

first splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple . In the third statement, the words are grouped together so that the count can be computed which is done in fourth statement.

Thanks