Explanation of filtering aggregated data (groupby with filter)



0

In module 4  Visualization there is a topic in the webinar about grouping and filtering. I was trying to get my head around it and succeeded by trying for myself.

I was trying to understand the following part:

dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})
print dff
dff.groupby('B').filter(lambda x: len(x) > 2)

In the webinar explanation the groupby + filter part selects the last 2 rows of the list. I believe this is not right.

I tried for myself and found the following explanation working for me.

The filter works by handling the groups separately (split), apply the function to the group (apply), combine the result (combine). So: split -> apply -> combine.

dff_grouped = dff.groupby('B')
print dff_grouped.groups

Output: {'a': [0, 1], 'c': [6, 7], 'b': [2, 3, 4, 5]}

So there are 3 groups

The filter "len(x) > 2" is applied to each group separately. It means that the length of the group (number of rows) must be larger than 2. Only the third group has more than 2 rows (it has 4). So only the rows of the third group are present in the combine step.

The result:

   A  B
2  2  b
3  3  b
4  4  b
5  5  b

Now the more difficult one:

print grouped.filter(lambda x: np.mean(x['MEDV']) > 23)

The lambda function is applied to each group separately. The rows from a group are in the resulting dataset when the mean of "MEDV" is larger the 23 for the rows in that group.

Lets print the mean of "MEDV" for each group (and remove some irrelevant columns):

print grouped.mean()['MEDV']

Result:

CHAS  RAD
0     1      23.015789
      2      26.833333
      3      27.927778
      4      21.040196
      5      25.521154
      6      20.976923
      7      27.105882
      8      31.505263
      24     15.438710
1     1      50.000000
      3      27.950000
      4      25.812500
      5      27.463636
      8      26.000000
      24     31.362500
Name: MEDV, dtype: float64

Notice that the mean of "MEDV" is not larger than 23 for the groups
CHAS = 0, RAD = 4 (mean MEDV = 21.0)
CHAS = 0, RAD = 6 (mean MEDV = 20.9)
CHAS = 0, RAD = 24 (mean MEDV = 15.4)

So those groups will not be present in the result:

print grouped_filtered.groupby(['CHAS','RAD']).mean()['MEDV']
CHAS  RAD
0     1      23.015789
      2      26.833333
      3      27.927778
      5      25.521154
      7      27.105882
      8      31.505263
1     1      50.000000
      3      27.950000
      4      25.812500
      5      27.463636
      8      26.000000
      24     31.362500
Name: MEDV, dtype: float64

And indeed those groups are missing.

Hope this help!

 


1 Answer(s)


0

Hi Maarten,

Hope you are doing great,

In the module 4 the groupby where we spent more informative time to explain about the groupby process You gave the same intuition again. 

Thanks it helps to revise.

regards

Jameel shaik

Your Answer

Click on this code-snippet-icon icon to add code snippet.

Upload Files (Maximum image file size - 1.5 MB, other file size - 10 MB, total size - not more than 50 MB)

Email
Password