In module 4 Visualization there is a topic in the webinar about grouping and filtering. I was trying to get my head around it and succeeded by trying for myself.

I was trying to understand the following part:

```
dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})
print dff
dff.groupby('B').filter(lambda x: len(x) > 2)
```

In the webinar explanation the groupby + filter part selects the last 2 rows of the list. I believe this is not right.

I tried for myself and found the following explanation working for me.

The filter works by handling the groups separately (split), apply the function to the group (apply), combine the result (combine). So: split -> apply -> combine.

```
dff_grouped = dff.groupby('B')
print dff_grouped.groups
```

Output: {'a': [0, 1], 'c': [6, 7], 'b': [2, 3, 4, 5]}

So there are 3 groups

The filter "len(x) > 2" is applied to each group separately. It means that the length of the group (number of rows) must be larger than 2. Only the third group has more than 2 rows (it has 4). So only the rows of the third group are present in the combine step.

The result:

A B 2 2 b 3 3 b 4 4 b 5 5 b

Now the more difficult one:

`print grouped.filter(lambda x: np.mean(x['MEDV']) > 23)`

The lambda function is applied to each group separately. The rows from a group are in the resulting dataset when the mean of "MEDV" is larger the 23 **for the rows in that group**.

Lets print the mean of "MEDV" for each group (and remove some irrelevant columns):

`print grouped.mean()['MEDV']`

Result:

CHAS RAD 0 1 23.015789 2 26.833333 3 27.927778 4 21.040196 5 25.521154 6 20.976923 7 27.105882 8 31.505263 24 15.438710 1 1 50.000000 3 27.950000 4 25.812500 5 27.463636 8 26.000000 24 31.362500 Name: MEDV, dtype: float64

Notice that the mean of "MEDV" is not larger than 23 for the groups

CHAS = 0, RAD = 4 (mean MEDV = 21.0)

CHAS = 0, RAD = 6 (mean MEDV = 20.9)

CHAS = 0, RAD = 24 (mean MEDV = 15.4)

So those groups will not be present in the result:

`print grouped_filtered.groupby(['CHAS','RAD']).mean()['MEDV']`

CHAS RAD 0 1 23.015789 2 26.833333 3 27.927778 5 25.521154 7 27.105882 8 31.505263 1 1 50.000000 3 27.950000 4 25.812500 5 27.463636 8 26.000000 24 31.362500 Name: MEDV, dtype: float64

And indeed those groups are missing.

Hope this help!

Apr 18 2016 09:32 PM