Issue in nested grouping and MAX clause in PIG


3 Answer(s)


Hi Harshit,

Please try the below code:

battingAnalysis = LOAD '/Data/Batting/Batting.csv' USING PigStorage(',') AS (playerID:chararray,yearID:int, teamID:chararray, oppTeamID:chararray,runs:int,balls:int);
batting = FILTER battingAnalysis by playerID!='playerID';
groupByPlayer = GROUP batting BY playerID;
sumRuns = FOREACH groupByPlayer GENERATE group, SUM(batting.runs) as totalRuns;
sumBalls= FOREACH groupByPlayer GENERATE group, SUM(batting.balls) as totalBalls;


Hope this helps.

Thanks,


It is not giving desired output.


Hi Harshit,

The reason you are not getting the desired answer because you are try to group your result by year and player, however, your target was max runs scored by a player in a year. So your actual targeted attributes are year and runs, not the player. I hope this makes sense?

You must try this:

A = LOAD '/home/hduser/Batting.csv' using PigStorage(',') as (playerID:chararray, yearID:int, stint:int, teamID:chararray, lgID:chararray, G:int, Gbatting:int, AB:int, R:int, H:int, TwoB:int, ThreeB:int, HR:int, RBI:int, SB:int, CS:int, BB:int, SO:int, IBB:int, HBP:int, SH:int, SF:int, GIDP:int, Gold:int);
B = GROUP A BY (yearID, playerID);
C = FOREACH B GENERATE group.yearID, group.playerID, SUM(A.R) as runs;
D = GROUP C BY yearID;
E = FOREACH D GENERATE group as yearID, MAX(C.runs) as runs;
F = JOIN E BY (yearID, runs), C BY (yearID, runs);
G = FOREACH F GENERATE E::yearID, C::playerID, C::runs as max_runs;
DUMP G;

--More than one player in an year with same score (runs) (QC)
H = GROUP G BY yearID;
I = FOREACH H GENERATE group as yearID, COUNT(G.playerID) as counts;
OR
--Distinct player counts
I = FOREACH H { players = FOREACH G GENERATE playerID;
                distinct_players = DISTINCT players;
              GENERATE group as yearID, COUNT(distinct_players) as counts;
              };

J = FILTER I BY counts > 1;
DUMP J;