Pandas Tutorial Part-2

Using groupby() : Split-Apply-Combine

In Data Analysis workflows, operations like data loading, cleaning and merging are usually followed by summarizations using some grouping variable(s). This includes summary statistics over variables, or groups within variables, within-group transformations (like variable standardization), computing pivot-tables and by-group analyses. Pandas DataFrames have a .groupby() method that works in the same way as the SQL group by.

This process involves three steps

Splitting the data into groups based on the levels of a categorical variable. This is generally the simplest step.
Applying a function to each group individually. There are 3 classes of functions we might consider:
- Aggregate – estimate summary statistics (like counts, means) for each group. This will reduce the size of the data.
- Transform – within group standardization, imputation using group values. The size of the data will not change.
- Filter – ignore rows that belong to a particular group
- A combination of these 3
Combining the results into a Series or Dataframe

The image below shows a graphical explanation of this process: We split by ‘x’, apply the function ‘mean’ to each group formed and then append the results

The Split-Apply-Combine Strategy

With pandas, we can implement this strategy as

Split
- A DataFrame can be split up by rows(axis=0) or columns(axis=1) into groups.
- We use pd.groupby() to create a groupby object
Apply
- A function is applied to each group using .agg() or .apply()
Combine
- The results of applying functions to groups are put together into an object
- Note: Data types of returned objects are handled gracefully by pandas

We create a groupBy object by calling the groupby() function on a data frame, passing a list of column names that we wish to use for grouping. These objects,

have a .size() method, which returns the count of elements in each group.
can be subsetted using column names (or arrays of column names) to select variables for aggregation
have optimized methods for general aggregation operations like -
- count, sum
- mean, median, std, var
- first, last
- min, max
specialized methods like .describe() apply to these objects

By far, the most important GroupBy methods are .agg(), .transform(), and .apply()

Syntax:

# Create a groupBy object
gb_obj = my_df.groupby(‘col_x’)

# Summarize each group
gb.obj.my_func()

Example:

# Create a toy dataset with 2 categorical and 2 numeric variables
df = DataFrame({'k1': list('abcd' * 25),
                'k2': list('xy' * 25 + 'yx' * 25),
                'v1': np.random.rand(100),
                'v2': np.random.rand(100)})
print df.head(15)

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

Grouping by ONE key

This results in a summarized data frame indexed by levels of the key.

# Since k1 has 4 categories, this will return 4 rows
print '\n', df.groupby('k1').mean()

# Since k2 has 2 categories, this will return 2 rows
print '\n', df.groupby('k2').sum()

Grouping by TWO keys

This will result in a summarized data frame with a hierarchical index.

# A dataframe with a hierarchical index formed by a combination of the levels
print df.groupby([df['k1'], df['k2']]).sum()

Column-wise aggregations – optimized statistical methods

For simple statistical aggregations (of numeric columns of a DataFrame) we can call methods like mean and sum

# Summing a Series
df['v1'].groupby(df['k1']).sum()

# Summing all Series of a DataFrame
print df.groupby('k2').mean()

Or you can pass the name of a function as a string with the .agg() method

# Summing a Series
df['v1'].groupby(df['k1']).agg('sum')

# Finding the mean of all grouped series of a DataFrame
print df.groupby(df.k1).agg('mean').add_prefix('mu_')

The .agg( ) method

Learn Data Science by working on interesting Data Science Projects

When we have a groupBy object, we may choose to apply one or more functions to one or more columns, even different functions to individual columns. The .agg() method allows us to easily and flexibly specify these details. It takes as arguments the following –

list of function names to be applied to all selected columns
tuples of (colname, function) to be applied to all selected columns
dict of (df.col, function) to be applied to each df.col
Apply >1 functions to selected column(s) by passing names of functions to agg() as a list

# Apply min, mean, max and max to v1 grouped by k1
df['v1'].groupby(df['k1']).agg(['min', 'mean', 'max'])

# Apply min and max to all numeric columns of df grouped by k2
print df[['v1', 'v2']].groupby(df['k2']).agg(['min', 'max'])
     # Hierarchical index will be created

# We can call .stack on the returned object!
print '\n', df[['v1', 'v2']].groupby(df['k2']).agg(['min', 'max']).stack()

5. We can supply names for the columns in the (new) aggregated DataFrame to the agg() method, in a list of tuples

# Provide names for the aggregated columns
df[['v1', 'v2']].groupby(df['k1']).agg([('smallest','min'), ('largest', 'max')])

6. We can supply DataFrame column names and which functions to apply to each, in a dictionary

# Apply max and min to v1; and mean and sum to v2; all grouped by k1
df[['v1', 'v2']].groupby(df['k1']).agg({'v1': ['max', 'min'], 'v2': ['mean', 'sum']})

The .apply( ) method

This method takes as argument the following:

a general or user defined function
any other parameters that the function would take

#  Retrieve the top N cases from each group
def topN(data, col, N):
     return data.sort(columns=col, ascending=False).loc[:, col].head(5)
print df.groupby(df['k2']).apply(topN, col='v1', N=5)

Combining Multiple Datasets - merge()

The merge() function in pandas is similar to the SQL join operations, it links rows of tables using one or more keys.

Syntax:

merge(df1, df2,
     how='left', on='key', left_on=None, right_on=None,
     left_index=False, right_index=False,
     sort=True, copy=True,
     suffixes=('_x', '_y'))

The syntax includes specifications of the following arguments:

Which column to merge on;
- the on='key' if the same key is in the two DFs,
- or left_on='lkey', right_on='rkey' if the keys have different names in the DFs

Note: To merge on multiple keys, pass a list of column names

The nature of the join;
- the how= option, with left, right, outer
- By default, the merge is an inner join

Tuple of string values to append to overlapping column names to identify them in the merged dataset
- the suffixes= option
- defaults to ('_x', '_y')

If you wish to merge on the DataFrame index,
- pass left_index=True or right_index=True or both.

Whether to Sort the result DataFrame by the join keys in lexicographical order or not;
- The sort= option;
- Defaults to True, setting to False will improve performance substantially in many cases

Explore More Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro

Examples

# Let's define a few toy datasets to use as examples

df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data1': np.random.randn(7)})
df2 = DataFrame({'key': ['a', 'b', 'd'],
                 'data2': np.random.randn(3)})

df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data3': np.random.randn(7)})
df4 = DataFrame({'rkey': ['a', 'b', 'd'],
                 'data4': np.random.randn(3)})

print df1, '\n\n', df2, '\n\n', df3, '\n\n', df4

Recommended Tutorials:

Default merge with no parameters

pd.merge(df1, df2)

We note that:

It is an inner join by default (output key is the intersection of input keys)
Merge happens on the column 'key' which is common to both datasets;
We could've written pd.merge(df1, df2, on='key') to the same effect

pd.merge(df1, df4)

# Yields an error because there are no matching column names to merge on

Specifying which columns to merge on (if keys have different names in datasets)

pd.merge(df3, df4, left_on='lkey', right_on='rkey')
# still an inner join!

Specifying which type of join

pd.merge(df1, df2, how='outer')
# the merged dataset will have a union of the keys, imputing NaNs where values aren't found
pd.merge(df1, df2, how='left')
# value 'c' is absent in df2, so there will be a NaN in column data2

Get FREE Access to Machine Learning Example Codes for Data Cleaning, Data Munging, and Data Visualization

Specifying Suffixes

# Add a column with the same name to df1 and df2
df1['colx'] = np.random.randn(7)
df2['colx'] = np.random.randn(3)

# Specifying suffixes to identify columns with the same name
pd.merge(df1, df2, on='key', suffixes=['_l', '_r'])

Merging on columns and index

# Set lkey to be the index of df3
df3.set_index(lkey, inplace=True)

Note: Do this only once. Re-running set_index will produce errors.
You'll have to reset index before you can set it again.

We specify that

for the df2 we will use the column 'key' and
for the df4, we will use its index to merge

pd.merge(df2, df3, how='left', left_on='key', right_index=True)

Access Data Science and Machine Learning Project Code Examples